Unified audio-visual generation is rapidly gaining industrial and creative relevance, enabling applications in virtual production and interactive media. However, when moving from general audio-video synthesis to music-dance co-generation, the task becomes substantially harder: musical rhythm, phrasing, and accents must drive choreographic motion at fine temporal resolution, and such rhythmic coupling is not captured by unimodal metrics or generic audiovisual consistency scores used in current evaluation practice. We introduce TMD-Bench, a benchmark for text-driven music-dance co-generation that assesses systems across unimodal generation quality, instruction adherence, and cross-modal rhythmic alignment. The benchmark integrates computable physical metrics with perceptual multimodal judgments, and is supported by a curated rhythm-aligned music-dance dataset and a fine-grained Music Captioner for structured music semantics. TMD-Bench further reveals that (i) modern commercial audio-visual models, such as Veo 3 and Sora 2, produce high-quality music and video, while rhythmic coupling remains less consistently optimized and leaves room for improvement, and (ii) our unified baseline RhyJAM trained on rhythm-aligned data achieves competitive beat-level synchronization while maintaining competitive unimodal fidelity. This presents prospects for building next-generation music-dance models that explicitly optimize rhythmic and kinetic coherence.
Primary: Zhejiang University
All Institutions: Zhejiang University, Tencent, National University of Singapore
The paper presents TMD-Bench, a novel benchmark for music-dance co-generation, and introduces RhyJAM, a unified model that achieves competitive rhythmic alignment and unimodal fidelity. This work significantly advances the evaluation and generation of synchronized audio-visual content, addressing a critical gap in existing methodologies.
The paper introduces TMD-Bench, a comprehensive evaluation framework specifically designed for music-dance co-generation, which is a significant advancement in the field. The methodology is well-structured, integrating both low-level physical metrics and high-level perceptual assessments through a multi-level evaluation pipeline. The authors also develop a unified model, RhyJAM, that generates music and dance in a coherent manner, addressing the critical challenge of rhythmic alignment. The use of a fine-grained Music Captioner for structured music semantics is a novel aspect that enhances the evaluation process.
The experiments are robust, utilizing a curated dataset of 10,000 rhythm-aligned music-dance pairs. The evaluation metrics are comprehensive, covering unimodal generation quality, instruction adherence, and cross-modal rhythmic alignment. The results demonstrate that RhyJAM outperforms existing models in rhythmic alignment while maintaining competitive unimodal fidelity. The comparative analysis against various baselines, including closed-source and open-source models, provides a clear picture of the model's strengths and weaknesses.
The paper provides detailed implementation details, including training configurations and dataset processing methods, which enhance reproducibility. However, the absence of a publicly available demo or project URL limits the ability for others to directly replicate the results.
One limitation is the reliance on existing commercial models for baseline comparisons, which may not fully represent the capabilities of open-source alternatives. Additionally, while the evaluation framework is comprehensive, the subjective nature of some assessments may introduce variability in human judgments.
The proposed benchmark and model have the potential to significantly impact the fields of generative audio and video synthesis, particularly in applications related to virtual production and interactive media. By addressing the intricate coupling between music and dance, this work paves the way for more sophisticated generative models that can enhance user experiences in entertainment and education. The paper presents TMD-Bench, a novel benchmark for music-dance co-generation, and introduces RhyJAM, a unified model that achieves competitive rhythmic alignment and unimodal fidelity. This work significantly advances the evaluation and generation of synchronized audio-visual content, addressing a critical gap in existing methodologies.
To preserve or not to preserve prosody is a central question in voice anonymization. Prosody conveys meaning and affect, yet is tightly coupled with speaker identity. Existing methods either discard prosody for privacy or lack a principled mechanism to control the utility-privacy trade-off, operating at fixed design points. We propose DiffAnon, a diffusion-based anonymization method with classifier-free guidance (CFG) that provides explicit, continuous inference-time control over prosody preservation. DiffAnon refines acoustic detail over semantic embeddings of an RVQ codec, enabling smooth interpolation between anonymization strength and prosodic fidelity within a single model. To the best of our knowledge, it is the first voice anonymization framework to provide structured, interpolatable inference-time prosody control. Experiments demonstrate structured trade-off behavior, achieving strong utility while maintaining competitive privacy across controllable operating points.
Primary: Johns Hopkins University
All Institutions: Johns Hopkins University, Center for Language and Speech Processing, Human Language Technology Center of Excellence (COE)
The main contribution of this paper is the introduction of DiffAnon, a diffusion-based voice anonymization framework that enables explicit and continuous control over prosody preservation, significantly advancing the field of privacy-preserving speech technologies. This work represents a meaningful step forward in balancing the utility-privacy trade-off in voice applications, showcasing the potential for structured prosody control in enhancing both privacy and expressiveness in anonymized speech.
The proposed methodology, DiffAnon, leverages a novel diffusion-based framework with classifier-free guidance to provide continuous control over prosody preservation in voice anonymization. This approach is innovative as it allows for the modulation of the utility-privacy trade-off in a structured manner, which is a significant advancement over existing methods that operate at fixed points. The integration of semantic embeddings from an RVQ codec with a diffusion model is particularly noteworthy, as it combines strengths from both domains to enhance the quality of anonymized speech.
The experiments are robust, utilizing the VoicePrivacy Challenge 2024 protocol, which provides a standardized framework for evaluating privacy and utility. The results demonstrate that DiffAnon achieves competitive performance across various metrics, including EER for privacy and WER for content preservation, while also showing a clear trade-off between privacy and prosodic fidelity. The systematic evaluation across different prosody guidance weights adds depth to the findings.
The authors have made their code and pretrained models publicly available, which is a strong point for reproducibility. The detailed training and inference setup, including hyperparameters and datasets used, further supports replicability of the results.
While the paper presents a significant advancement, it does not explore the potential impact of varying speaker characteristics on the performance of the model. Additionally, the reliance on specific datasets may limit the generalizability of the findings to other languages or dialects. The paper also does not address the computational costs associated with training and deploying the model in real-world applications.
The ability to anonymize voice while preserving prosody has significant implications for privacy in various applications, including telecommunication, virtual assistants, and voice-based interactions. This work could enhance user trust in voice technologies by providing a means to protect identity while maintaining communicative effectiveness. The structured control over prosody could also lead to advancements in emotional speech synthesis and human-computer interaction. The main contribution of this paper is the introduction of DiffAnon, a diffusion-based voice anonymization framework that enables explicit and continuous control over prosody preservation, significantly advancing the field of privacy-preserving speech technologies. This work represents a meaningful step forward in balancing the utility-privacy trade-off in voice applications, showcasing the potential for structured prosody control in enhancing both privacy and expressiveness in anonymized speech.
We introduce a toolkit for uncovering spurious correlations between recording characteristics and target class in speech datasets. Spurious correlations may arise due to heterogeneous recording conditions, a common scenario for health-related datasets. When present both in the training and test data, these correlations result in an overestimation of the system performance -- a dangerous situation, specially in high-stakes application where systems are required to satisfy minimum performance requirements. Our toolkit implements a diagnostic method based on the detection of the target class using only the non-speech regions in the audio. Better than chance performance at this task indicates that information about the target class can be extracted from the non-speech regions, flagging the presence of spurious correlations. The toolkit is publicly available for research use.
Primary: Instituto de Investigación en Ciencias de la Computación
All Institutions: Instituto de Investigación en Ciencias de la Computación, Departamento de Computación, Facultad de Ciencias Exactas y Naturales, Facultad de Medicina, Centro de Neurociencias Cognitivas, Universidad de Chile, Universidad de San Andrés
The paper introduces a novel toolkit for detecting spurious correlations in speech datasets, addressing a critical issue in machine learning applications. The technical contributions and methodology are well-articulated, providing valuable insights into the reliability of speech-based models, particularly in high-stakes scenarios.
The methodology presented in the paper is robust and well-structured, focusing on the detection of spurious correlations in speech datasets. The authors introduce a systematic approach that leverages non-speech regions of audio to diagnose potential biases in datasets, which is a significant advancement in ensuring the reliability of machine learning models in high-stakes applications. The toolkit's design, which includes careful selection of voice-activity detection systems and feature extraction methods, demonstrates a thorough understanding of the challenges posed by spurious correlations.
The experiments conducted on two Alzheimer's disease speech datasets are comprehensive and well-executed. The authors provide a detailed analysis of the performance of their method against various configurations, including different feature extraction techniques and VAD systems. The use of statistical significance testing adds rigor to their findings, although the reliance on specific datasets may limit generalizability.
The paper offers a clear description of the experimental setup and the toolkit's implementation, which is publicly available on GitHub. This enhances reproducibility, as other researchers can apply the same methods to their datasets. However, the paper could benefit from more detailed instructions on how to utilize the toolkit effectively.
One limitation of the study is the potential overfitting to the specific datasets used for evaluation, which may not represent the broader spectrum of speech datasets. Additionally, while the toolkit addresses spurious correlations, it does not provide solutions for all possible biases that may arise in speech data collection.
The implications of this research are significant, particularly in the context of health-related machine learning applications where spurious correlations can lead to harmful consequences. The toolkit can serve as a critical resource for researchers and practitioners in the field, promoting more reliable and ethical use of speech datasets in machine learning. The paper introduces a novel toolkit for detecting spurious correlations in speech datasets, addressing a critical issue in machine learning applications. The technical contributions and methodology are well-articulated, providing valuable insights into the reliability of speech-based models, particularly in high-stakes scenarios.
Self-supervised speech models (S3Ms) achieve strong downstream performance, yet their learned representations remain poorly understood under natural and adversarial perturbations. Prior studies rely on representation similarity or global dimensionality, offering limited visibility into local geometric changes. We ask: how do perturbations deform local geometry, and do these shifts track downstream automatic speech recognition (ASR) degradation? To address this, we present GRIDS, a framework using Local Intrinsic Dimensionality (LID) across layer-wise representations in WavLM and wav2vec 2.0. We find that LID increases for all low signal-to noise ratio (SNR) perturbations and diverges at high SNR: benign noise converges toward the clean profile, while adversarial inputs retain early-layer LID elevation. We show LID elevation co-occurs with increased WER, and that layer-wise LID features enable anomaly detection (AUROC 0.78-1.00), opening the door to transcript-free monitoring in S3Ms.
Primary: University of Melbourne
All Institutions: University of Melbourne, Monash University, Johns Hopkins University
The paper presents a comprehensive and innovative framework for analyzing the geometric properties of learned representations in self-supervised speech models, contributing valuable insights into the robustness of these models under various perturbations. The methodology effectively links local geometric changes to performance degradation, marking a significant advancement in the understanding of S3Ms.
The paper introduces a novel framework, GRIDS, which utilizes Local Intrinsic Dimensionality (LID) to analyze the geometric properties of learned representations in self-supervised speech models (S3Ms) under various perturbations. The methodology is well-structured, employing a layer-wise analysis that captures local changes in representation geometry, which is a significant advancement over traditional global measures. The approach effectively links geometric shifts to downstream performance metrics, specifically word error rate (WER), and provides a robust mechanism for anomaly detection without reliance on ground-truth transcripts. The use of k-nearest neighbors (kNN) for LID estimation is appropriate, although the choice of neighborhood size could introduce variability in results.
The experiments are comprehensive, utilizing a well-defined dataset (LibriSpeech) and a variety of perturbation types (benign and adversarial) under controlled signal-to-noise ratio (SNR) conditions. The results demonstrate a clear correlation between LID changes and ASR degradation, reinforcing the framework's validity. The performance metrics, including AUROC for anomaly detection, indicate strong results, particularly for the WavLM model. However, the paper could benefit from additional comparative analysis against existing methods to further establish the effectiveness of GRIDS.
The paper provides detailed descriptions of the experimental setup, including the generation of perturbations and the evaluation protocols. However, the lack of publicly available code or data limits reproducibility. Providing access to the GRIDS framework and datasets would enhance the ability of other researchers to validate and build upon these findings.
The study is limited to specific self-supervised speech models (WavLM and wav2vec 2.0) and does not explore the implications of the findings on other architectures or tasks. Additionally, the focus on layer-wise analysis may overlook global representation dynamics that could be informative. The anomaly detection performance decreases at higher SNRs, suggesting that the method may not be universally applicable across all conditions.
The findings have significant implications for the robustness and interpretability of self-supervised speech models, particularly in real-world applications where adversarial attacks and noise are prevalent. The ability to monitor representation geometry could lead to improved robustness in automatic speech recognition systems and other related fields, such as speaker verification and emotion recognition. The paper presents a comprehensive and innovative framework for analyzing the geometric properties of learned representations in self-supervised speech models, contributing valuable insights into the robustness of these models under various perturbations. The methodology effectively links local geometric changes to performance degradation, marking a significant advancement in the understanding of S3Ms.
Speech encodes multiple simultaneous attributes--linguistic content, speaker identity, dialect, gender--that conventional single-vector embeddings conflate. We present a factor-partitioned embedding framework that maps each utterance into a single vector whose subspaces correspond to distinct axes of variation. A shared acoustic encoder feeds per-axis linear projection heads, each trained via distillation from a specialist teacher or a contrastive objective over shared-label pairs. The resulting embeddings support attribute-conditioned retrieval: similarity is computed as a signed weighted sum over per-axis cosine scores, allowing retrieval that jointly considers what was said and how --or explicitly suppresses one attribute to surface another. We evaluate on cross-corpus retrieval over corpora sharing the Harvard sentence prompts, demonstrating that signed axis weighting can suppress same-speaker bias and surface semantically matched utterances across recording conditions.
Primary: Department of Speech, Music & Hearing
All Institutions: Department of Speech, Music & Hearing
The main contribution of this paper is the introduction of a novel factor-partitioned embedding framework for speech that allows for controllable multi-axis similarity searches. This work represents a meaningful advancement in the field of speech representation learning, addressing the complexities of speech attributes and providing a robust solution for attribute-conditioned retrieval tasks.
The paper introduces a factor-partitioned embedding framework that effectively separates various attributes of speech into distinct subspaces, allowing for nuanced similarity searches. The methodology is well-structured, employing a shared acoustic encoder and per-axis linear projection heads trained through distillation and contrastive objectives. This approach is innovative in its use of signed axis weighting to control retrieval outcomes, providing a significant advancement over traditional single-vector embeddings that conflate multiple attributes.
The experiments conducted are thorough, utilizing cross-corpus retrieval tasks with well-defined metrics such as Precision@k and preference-flip tests. The evaluation on datasets sharing the Harvard sentence prompts is appropriate, demonstrating the framework's ability to suppress same-speaker bias and surface semantically matched utterances. The results are compelling, showing that the proposed method outperforms baseline models significantly.
The paper provides a clear description of the architecture, data sources, and training processes, which enhances reproducibility. However, the absence of a publicly available implementation or code repository limits the ability for other researchers to replicate the results fully.
One limitation identified is the reliance on specific datasets, which may not generalize well to other speech domains. Additionally, the paper notes challenges in separating the gender axis due to its correlation with speaker identity, suggesting that further exploration is needed to improve this aspect. The potential for mode collapse in models trained without auxiliary tasks is also a concern.
The proposed framework has significant implications for applications in speech retrieval, speaker recognition, and potentially in areas such as voice conversion and personalized speech synthesis. By enabling controllable retrieval based on multiple axes, it opens avenues for more sophisticated user interactions with speech technologies. The main contribution of this paper is the introduction of a novel factor-partitioned embedding framework for speech that allows for controllable multi-axis similarity searches. This work represents a meaningful advancement in the field of speech representation learning, addressing the complexities of speech attributes and providing a robust solution for attribute-conditioned retrieval tasks.
Vocal hyperfunction (VH) is a prevalent voice disorder whose ambulatory detection remains challenging despite extensive daily voice data. Prior approaches capture week-long neck-surface accelerometer recordings but collapse them into fixed-length subject-level feature vectors, discarding within-day temporal dynamics encoding nuanced voicing feature interactions. We introduce a novel hybrid architecture combining gradient-boosted trees on day-level distributional features with a CNN-based multiple instance learning (MIL) framework that preserves and learns from from temporal dynamics throughout each day. On the held-out test set, our model exceeds the challenge baselines (AUC: 0.82 PVH, 0.77 NPVH), achieving AUCs of 0.879 for PVH (Rank 5) and 0.848 for NPVH (Rank 3), while also providing insights into clinically relevant information about both pathologies.
Primary: Harvard University
All Institutions: Harvard University, Eaton Peabody Laboratories, Massachusetts Eye and Ear Infirmary
The paper presents a novel hybrid architecture for detecting vocal hyperfunction using attention-based multiple instance learning, significantly advancing the state of the art in ambulatory voice monitoring. The methodology effectively addresses the limitations of previous approaches by preserving temporal dynamics, leading to improved diagnostic accuracy and insights into voice disorders.
The paper introduces a hybrid architecture that effectively combines gradient-boosted trees with a CNN-based multiple instance learning (MIL) framework, which is innovative in the context of ecological momentary assessment for voice disorders. The dual-representation framework allows the model to leverage both global distributional features and local temporal dynamics, addressing the shortcomings of previous methods that relied on fixed-length feature vectors. The attention mechanism in the CNN-MIL architecture is particularly noteworthy, as it enables the model to learn which time segments are most discriminative for vocal hyperfunction detection.
The experiments are robust, utilizing the NeckVibe Challenge dataset, which is the largest publicly available dataset for this task. The reported AUC scores demonstrate significant improvements over existing baselines, particularly for the more challenging non-structural vocal hyperfunction (NPVH) classification. The systematic ablation studies provide strong evidence for the contributions of each model component, confirming the effectiveness of the proposed ensemble strategy.
The paper provides detailed implementation information, including hyperparameters and software versions used, which enhances reproducibility. However, the absence of a publicly accessible code repository limits the ability for other researchers to replicate the results fully.
The study acknowledges that the CNN-MIL framework processes each day independently, which may overlook week-level trends in vocal behavior. Additionally, while the attention mechanism offers some interpretability, the complexity of the model may hinder full understanding of the feature interactions. Future work could explore hierarchical models and causal inference approaches.
The findings have significant implications for the field of voice disorder diagnostics, particularly in enhancing ambulatory monitoring techniques. By capturing temporal dynamics, the proposed approach could lead to more accurate and timely interventions for individuals suffering from vocal hyperfunction, potentially improving patient outcomes. The paper presents a novel hybrid architecture for detecting vocal hyperfunction using attention-based multiple instance learning, significantly advancing the state of the art in ambulatory voice monitoring. The methodology effectively addresses the limitations of previous approaches by preserving temporal dynamics, leading to improved diagnostic accuracy and insights into voice disorders.
Tibetan text-to-speech (TTS) has long been challenged by scarce speech resources, significant dialectal variation, and the complex mapping between written text and spoken pronunciation. To address these issues, this work presents, to the best of our knowledge, the first large-model-based Tibetan TTS system in the industry, built upon a large speech synthesis model developed by Xingchen AGI Lab. The proposed system integrates data quality enhancement, Tibetan-oriented text representation and tokenizer adaptation, and cross-lingual adaptive training for low-resource Tibetan speech synthesis. Experimental results show that the system can generate stable, natural, and intelligible Tibetan speech under low-resource conditions. In subjective evaluation, the MOS scores of the syllable-level and BPE-based systems reach 4.28 and 4.35, while their pronunciation accuracies reach 97.6% and 96.6%, respectively, outperforming an external commercial Tibetan TTS interface. These results demonstrate that combining a large-model backbone with Tibetan-oriented text representation adaptation and cross-lingual adaptive training enables highly usable low-resource Tibetan speech synthesis, and also provides a technical foundation for future unified multi-dialect Tibetan speech synthesis.
Primary: Xingchen AGI Lab
All Institutions: Xingchen AGI Lab, China Telecom Artificial Intelligence Technology Co. Ltd, Xizang University, Qinghai Normal University, University of Electronic Science and Technology of China
This paper establishes a pioneering framework for low-resource Tibetan speech synthesis, combining innovative methodologies with practical applications. The integration of data quality enhancement, tailored text representation, and cross-lingual training marks a significant advancement in the field of speech synthesis, particularly for minority languages facing resource constraints.
The paper presents a comprehensive approach to Tibetan TTS by integrating a data quality enhancement pipeline, a Tibetan-oriented text representation, and a cross-lingual adaptive training strategy. The methodology is well-structured, addressing the unique challenges of Tibetan speech synthesis, such as dialectal variation and resource scarcity. The use of a large-model backbone and the adaptation of tokenization strategies specifically for Tibetan linguistic characteristics are notable innovations that enhance the model's performance under low-resource conditions.
The experiments are robust, with both subjective (MOS scores) and objective (pronunciation accuracy) evaluations demonstrating the effectiveness of the proposed system. The reported MOS scores of 4.28 and 4.35 for the syllable-level and BPE-based systems, respectively, indicate a high level of naturalness and intelligibility. The comparison with an external commercial TTS interface further validates the system's performance, showcasing its potential for practical applications.
The paper lacks detailed implementation specifics, such as code availability or dataset access, which could hinder reproducibility. While the methodology is clearly described, the absence of a public repository or demo limits the ability for other researchers to replicate the results.
The study primarily focuses on the U-Tsang dialect, which may limit the generalizability of the findings to other Tibetan dialects. Additionally, while the proposed methods show promise, the reliance on a large pretrained model may not be feasible in all low-resource scenarios, particularly where such models are not available.
The development of a Tibetan TTS system has significant implications for cultural preservation, education, and accessibility in Tibetan-speaking regions. By providing a framework for low-resource language synthesis, this work could serve as a model for similar efforts in other underrepresented languages, promoting linguistic diversity in technology. This paper establishes a pioneering framework for low-resource Tibetan speech synthesis, combining innovative methodologies with practical applications. The integration of data quality enhancement, tailored text representation, and cross-lingual training marks a significant advancement in the field of speech synthesis, particularly for minority languages facing resource constraints.
Periodic patterns are fundamental cues in multimedia signals and systems, including repetitive motion in video (e.g., gait cycles), rhythmic and pitch-related structure in audio, and recurring textures in image sequences. When such user-generated streams are collected from edge devices, local differential privacy (LDP) is appealing because it perturbs data before upload; however, the injected noise can corrupt spectral peaks and induce phase drift, making period estimation unreliable and degrading reconstruction quality. We propose \textbf{CPR} (\textit{Cycle and Phase Recovery}), a period-aware reconstruction framework for periodic time series under LDP. CPR performs multi-scale period probing and multi-consensus selection to suppress noise-induced spectral interference, then aggregates perturbed samples at matched within-cycle phase positions to stabilize phase alignment across cycles. To recover the underlying per-phase values, CPR combines EM-based denoising with kernel density estimation, improving robustness under tight privacy budgets. Experiments on two real-world periodic datasets demonstrate that CPR better preserves periodic structure and consistently achieves lower reconstruction error than representative LDP baselines, especially in the low-$ε$ regime.
Primary: Taiyuan University of Technology
All Institutions: Taiyuan University of Technology, University of Michigan, Ann Arbor
The main contribution of this paper is the development of the CPR framework, which effectively addresses the challenges of reconstructing periodic time series under local differential privacy. This work significantly advances the field by providing a robust solution that preserves critical periodic structures while ensuring privacy, thus opening new avenues for research and application in multimedia signal processing.
The proposed CPR framework introduces a novel approach to reconstruct periodic time series under local differential privacy (LDP) by addressing the specific challenges posed by noise-induced spectral interference and phase drift. The methodology is robust, employing multi-scale period probing and phase-aware aggregation, which are well-justified in the context of multimedia signals. The integration of EM-based denoising with kernel density estimation is a significant technical contribution, enhancing the reconstruction quality while maintaining privacy. The paper effectively articulates the rationale behind each methodological choice and demonstrates a clear understanding of the underlying challenges.
The experiments conducted on real-world datasets are comprehensive and well-structured, showcasing the effectiveness of CPR compared to several baseline methods. The evaluation metrics, particularly the cosine distance for reconstruction accuracy, are appropriate for the task. The results convincingly demonstrate that CPR outperforms existing methods, especially under tight privacy constraints, which is crucial for practical applications. However, the paper could benefit from a more detailed discussion of the experimental setup, including hyperparameter choices and potential variations in results across different datasets.
The paper provides a clear description of the experimental setup and methodologies used, but it lacks specific implementation details that would facilitate reproducibility. Key parameters and configurations for the experiments are mentioned, yet the absence of a publicly available code repository limits the ability for others to replicate the results. Including a link to a GitHub repository or similar would greatly enhance reproducibility.
One limitation of the study is the reliance on specific datasets that may not fully represent the diversity of periodic signals encountered in real-world applications. Additionally, while the proposed method shows significant improvements, the paper does not address the computational complexity of the CPR framework, which may affect its applicability in resource-constrained environments. The potential trade-offs between privacy and reconstruction accuracy under varying conditions could also be explored further.
The implications of this research are significant, particularly in fields where privacy-preserving data collection is paramount, such as healthcare and personal monitoring systems. By enabling accurate reconstruction of periodic signals while adhering to strict privacy constraints, this work has the potential to enhance the usability of sensitive data in various applications, including motion analysis and behavioral monitoring. The findings may also inspire further research into privacy-aware signal processing techniques. The main contribution of this paper is the development of the CPR framework, which effectively addresses the challenges of reconstructing periodic time series under local differential privacy. This work significantly advances the field by providing a robust solution that preserves critical periodic structures while ensuring privacy, thus opening new avenues for research and application in multimedia signal processing.
We study example-level private supervised speech classification under a practical release constraint: training may access privileged side information, but the released model must be audio-only. This setting is important because speech systems can often exploit richer side information during development, whereas deployment and release require a lightweight unimodal model with auditable privacy guarantees. Using DP-SGD on the private dataset $D_{\text{priv}}$, we identify a strong-privacy failure mode ($ε\le 1$) on imbalanced tasks, where training may collapse to a near single-class predictor, a phenomenon that overall accuracy can obscure. We therefore emphasize Macro-F1, balanced accuracy, and a simple collapse diagnostic. This failure is especially problematic in our release setting because a collapsed private teacher cannot provide useful supervision for the downstream audio-only student. To address this setting under strong privacy, we propose a two-stage protocol: (i) train a (possibly multimodal) DP teacher on $D_{\text{priv}}$, and (ii) distill an audio-only student on a fixed, recording-disjoint auxiliary dataset $D_{\text{aux}}$ using one-shot offline teacher probability outputs, releasing only the student. The DP guarantee applies only to $D_{\text{priv}}$; we make no DP claim for $D_{\text{aux}}$, and privacy of the released student with respect to $D_{\text{priv}}$ follows by post-processing. We frame this setting as involving four coupled bottlenecks: speech-induced optimization instability under DP-SGD, minority-class erosion under clipping and noise, teacher over-reliance on privileged modalities unavailable at deployment, and train--deploy modality mismatch. We address them with a DP-stabilizing acoustic front-end (DSAF), minibatch-adaptive bounded loss reweighting (AW-DP), privileged-modality dropout, and offline teacher-to-student distillation.
Primary: Taiyuan University of Technology
All Institutions: Shanxi Key Laboratory of Industrial Internet Security, University of Michigan, Taiyuan University of Technology
The main contribution of this paper is a novel two-stage protocol for private speech classification that mitigates prediction collapse and class imbalance while ensuring differential privacy. This work significantly advances the field of privacy-preserving machine learning by providing effective solutions to critical challenges in deploying speech classification systems.
The paper presents a novel two-stage protocol for private speech classification that effectively addresses the challenges of differential privacy (DP) in imbalanced datasets. The methodology is well-structured, incorporating a DP teacher trained on private data followed by offline distillation to create an audio-only student model. The authors introduce several innovative techniques such as the DP-Stabilizing Acoustic Front-End (DSAF) and Imbalance-aware Weighted DP-SGD (AW-DP) to mitigate issues related to prediction collapse and class imbalance. The use of privileged-modality dropout further enhances the robustness of the model by discouraging reliance on privileged information during deployment. Overall, the methodology is comprehensive and addresses critical bottlenecks in private speech classification.
The experiments are thorough, utilizing the Mozilla Common Voice dataset to evaluate the proposed methods. The authors provide a clear comparison of their two-stage distillation approach against a single-stage DP audio-only model. The results demonstrate significant improvements in Macro-F1 and balanced accuracy metrics, particularly under strong privacy constraints, highlighting the effectiveness of their approach. The inclusion of various metrics to diagnose collapse, such as Maj-Pred, adds depth to the evaluation. However, further exploration of the impact of different auxiliary dataset sizes could enhance the robustness of the findings.
The paper provides sufficient details regarding the experimental setup, including the use of Python and PyTorch for implementation. However, the absence of a public code repository or demo URL limits reproducibility. Future work should consider making the code available to facilitate validation of results and encourage further research in this area.
One limitation of the study is the lack of a comprehensive evaluation of the model's performance across diverse datasets beyond the Mozilla Common Voice dataset. Additionally, while the paper addresses the issue of prediction collapse, it does not fully explore the implications of using auxiliary datasets that may not be representative of the deployment environment. The privacy guarantee for the auxiliary dataset is also not claimed, which could raise concerns about potential data leakage.
The proposed methods have significant implications for the deployment of speech classification systems in privacy-sensitive applications, such as voice assistants and transcription services. By ensuring that models can be trained with rich side information while maintaining privacy in the released artifacts, this work contributes to the development of more secure and trustworthy AI systems. The approach could be extended to other modalities and applications, enhancing the overall impact on the field of machine learning and privacy-preserving technologies. The main contribution of this paper is a novel two-stage protocol for private speech classification that mitigates prediction collapse and class imbalance while ensuring differential privacy. This work significantly advances the field of privacy-preserving machine learning by providing effective solutions to critical challenges in deploying speech classification systems.
We present the Streaming Reservoir Convergence Theorem (SRCT), a novel mathematical framework for multi-provider adaptive bitrate streaming that addresses three fundamental structural weaknesses in current systems: linear provider probing, reactive failover, and cold standby transitions. SRCT models stream acquisition as a concurrent reservoir filling problem$-$probing all $N$ providers simultaneously rather than in batches$-$and maintains $k$ pre-verified, pre-fetched standby streams alongside the active stream to enable sub-second failover with zero user-visible disruption. We prove four principal results: (1) a harmonic lower bound on reservoir safety showing that $k$ independent streams provide $H_k / \barλ$ expected uptime where $H_k$ is the $k$-th harmonic number; (2) a concurrent acquisition speedup $S(N,b) = (N/b) \cdot (1-F^b)/(1-F^N)$ over batched probing, yielding $3$-$5\times$ practical improvement; (3) monotonic non-decreasing quality under lazy-refill with convergence to the Pareto-optimal frontier; and (4) a prospect-weighted switching rule$-$using Kahneman-Tversky value functions with $α=β=0.88$, $λ=2.25$ $-$ that provably eliminates thrashing between similar-quality streams via a no-thrash bound on the expected switch count. We implement SRCT across two production streaming pipelines: a primary movie/TV system serving 12+ HLS providers with $k=3$ reservoir slots, and a live sports system with multi-format DASH/HLS failover. Empirical verification via Monte Carlo simulation (5000 trials) confirms all four theorems across 22 independent checks. The reservoir of $k=3$ streams achieves $9.15\times$ mean time to depletion versus a single stream, and concurrent probing of 12 providers at 40% failure rate yields a $4.27\times$ speedup over the current batched-by-3 default.
Primary: Sperix Labs
All Institutions: Sperix Labs, KNUST
The main contribution of this paper is the introduction of the Streaming Reservoir Convergence Theorem, which provides a comprehensive framework for improving multi-provider adaptive streaming. The technical contributions, including theoretical proofs and empirical validation, position this work as a notable advancement in the field of audio streaming and adaptive bitrate technologies.
The paper introduces the Streaming Reservoir Convergence Theorem (SRCT), which presents a novel mathematical framework for adaptive bitrate streaming across multiple providers. The methodology is robust, employing a combination of theoretical proofs and practical implementations. The authors effectively unify several aspects of streaming (provider probing, failover, and quality selection) into a single reservoir model, which is a significant advancement over traditional methods. The use of prospect theory to inform the switching rules adds a unique psychological dimension to the algorithm, enhancing its applicability in real-world scenarios.
The experimental evaluation is thorough, with empirical verification conducted through Monte Carlo simulations and deterministic checks. The results demonstrate significant improvements in mean time to depletion and speedup in stream acquisition, validating the theoretical claims made in the paper. The experiments are well-designed, with a clear connection between the theoretical framework and practical outcomes.
While the paper provides a detailed description of the algorithm and its implementation, it lacks a publicly accessible code repository or demo URL, which would enhance reproducibility. The absence of shared code limits the ability of other researchers to validate the findings independently.
The paper acknowledges the limitations of the conditional independence assumption, which may not hold during large-scale outages. Additionally, the reliance on a Markov model for stream viability may not capture more complex, time-varying availability patterns. The parameters derived from prospect theory are also noted as potentially needing further calibration for specific streaming contexts.
The proposed framework has significant implications for the future of adaptive streaming technologies, particularly in enhancing user experience through reduced buffering and improved quality. The integration of psychological principles into automated decision-making processes may influence other domains beyond streaming, such as network management and real-time systems. The main contribution of this paper is the introduction of the Streaming Reservoir Convergence Theorem, which provides a comprehensive framework for improving multi-provider adaptive streaming. The technical contributions, including theoretical proofs and empirical validation, position this work as a notable advancement in the field of audio streaming and adaptive bitrate technologies.
Recent advances in voice cloning and text-to-speech synthesis have made partial speech manipulation - where an adversary replaces a few words within an utterance to alter its meaning while preserving the speaker's identity - an increasingly realistic threat. Existing audio deepfake detection benchmarks focus on utterance-level binary classification or single-region tampering, leaving a critical gap in detecting and localizing multiple inpainted segments whose count is unknown a priori. We address this gap with three contributions. First, we introduce MIST (Multiregion Inpainting Speech Tampering), a large-scale multilingual dataset spanning 6 languages with 1-3 independently inpainted word-level segments per utterance, generated via LLM-guided semantic replacement and neural voice cloning, with fake content constituting only 2-7% of each utterance. Second, we propose ISA (Iterative Segment Analysis), a backbone-agnostic framework that performs coarse-to-fine sliding-window classification with gap-tolerant region proposal and boundary refinement to recover all tampered regions without prior knowledge of their count. Third, we define SF1@tau, a segment-level F1 metric based on temporal IoU matching that jointly evaluates region count accuracy and localization precision. Zero-shot evaluation reveals that partial inpainting at word granularity remains unsolved by existing deepfake detectors: utterance-level classifiers trained on fully synthesized speech assign near zero fake probability to MIST utterances where only 2-7% of content is manipulated. ISA consistently outperforms non-iterative baselines in this challenging setting, and the dataset, code, and evaluation toolkit are publicly released.
Primary: Posts and Telecommunications Institute of Technology
All Institutions: Posts and Telecommunications Institute of Technology
The paper presents a comprehensive approach to addressing the challenges of multi-region speech inpainting detection, contributing valuable resources and methodologies to the field of audio forensics. The introduction of the MIST dataset and the ISA framework represents a meaningful step forward in the ongoing battle against audio deepfakes and misinformation.
The paper introduces a novel dataset (MIST) designed specifically for multi-region speech inpainting detection, addressing significant gaps in existing benchmarks that focus primarily on single-region tampering. The methodology includes a comprehensive generation pipeline for creating the dataset, which utilizes LLM-guided semantic replacements and neural voice cloning to produce high-quality tampered audio. The proposed Iterative Segment Analysis (ISA) framework is robust and backbone-agnostic, allowing for effective localization of tampered segments without prior knowledge of their count. The introduction of the SF1@tau metric is a significant advancement, providing a more nuanced evaluation of detection performance by accounting for both segment count and localization precision.
The experiments conducted are thorough, employing a zero-shot evaluation strategy that highlights the challenges of detecting partial inpainting. The results demonstrate that ISA outperforms baseline methods in terms of localization accuracy, even when using a classifier not specifically trained on the dataset. The paper includes a detailed breakdown of results by language and variant, providing insights into the performance across different linguistic contexts and manipulation types. However, the overall SF1 scores remain low, indicating that the task is inherently challenging and that further work is needed to improve detection capabilities.
The authors provide sufficient details regarding the dataset generation process, the ISA framework, and the evaluation metrics, which supports reproducibility. The dataset and code are publicly available, which is a positive aspect for the research community. However, the implementation details of the ISA framework could benefit from additional clarity regarding hyperparameter tuning and specific configurations used in experiments.
The paper acknowledges the limitations of existing audio deepfake detection systems, particularly their inability to handle partial inpainting effectively. The ISA method, while innovative, still relies on a backbone classifier that was not trained on the specific task of multi-region tampering, which may limit its performance. Additionally, the low absolute scores in the experiments suggest that the problem remains challenging, and the proposed methods may require further refinement and optimization.
The implications of this research are significant, particularly in the context of audio forensics and misinformation detection. As voice cloning technology advances, the ability to detect and localize tampered speech becomes increasingly critical for maintaining trust in audio communications. The MIST dataset and ISA framework can serve as foundational tools for future research in this area, potentially leading to improved detection methods and better understanding of audio manipulation threats. The paper presents a comprehensive approach to addressing the challenges of multi-region speech inpainting detection, contributing valuable resources and methodologies to the field of audio forensics. The introduction of the MIST dataset and the ISA framework represents a meaningful step forward in the ongoing battle against audio deepfakes and misinformation.
Stage-wise audio-visual encoders propagate fused intermediate states across layers, making the formation of later representations depend on the readiness of earlier fusion states. Strong local audio-visual agreement provides useful correspondence evidence, yet a fused state also needs sufficient cross-layer and cross-modal support before it can reliably guide later fusion. This paper studies this issue through propagation-aware representation readiness and formulates premature perceptual commitment as a readiness-deficiency problem, where local plausibility, propagation influence, and support insufficiency jointly appear at an intermediate stage. We propose the Delayed Perceptual Commitment Network (DPC-Net), an encoder-level framework that estimates an observable readiness-deficiency surrogate, localizes the intervention-sensitive bottleneck, and applies support-aware correction with cross-layer and cross-modal evidence. DPC-Net preserves task-specific heads, losses, decoding modules, and evaluation protocols, making it applicable to different audio-visual tasks through encoder-side intervention. Experiments on audio-visual speech separation, audio-visual event localization, and audio-visual speech recognition show consistent improvements across reconstruction, localization, and recognition regimes. Further analyses on component contribution, selection criteria, counterfactual intervention, and readiness trajectories support the effectiveness of readiness-guided bottleneck correction.
Primary: Lingnan University
All Institutions: Lingnan University, University of Southern Queensland, Wuhan University of Technology, Hong Kong Metropolitan University
The main contribution of this work is the introduction of DPC-Net, a novel framework that enhances representation readiness in audio-visual learning by addressing premature perceptual commitment through a readiness-deficiency approach. This research significantly advances the understanding of audio-visual representation learning, providing a robust mechanism to improve performance across various tasks while preserving the integrity of task-specific architectures.
The paper introduces the Delayed Perceptual Commitment Network (DPC-Net), which innovatively addresses the issue of representation readiness in stage-wise audio-visual learning by formulating premature perceptual commitment as a readiness-deficiency problem. The methodology is well-structured, utilizing a readiness-deficiency surrogate to localize bottlenecks and applying support-aware corrections, thus enhancing the robustness of audio-visual representations. The approach is grounded in theoretical insights from human perception, which strengthens its conceptual foundation.
The experiments are comprehensive, covering three distinct audio-visual tasks: speech separation, event localization, and speech recognition. The results demonstrate consistent improvements across various metrics, indicating the effectiveness of the proposed method. The use of controlled comparisons with baseline models adds rigor to the evaluation, although the paper could benefit from additional qualitative assessments of the generated outputs.
The paper outlines the implementation details, including the architecture and training protocols, which aids reproducibility. However, the absence of a publicly available code repository limits the ability for other researchers to replicate the findings fully.
One limitation is the reliance on specific datasets for evaluation, which may not generalize across all audio-visual tasks. Additionally, the paper does not address potential computational overhead introduced by the DPC-Net framework, which could impact its practical deployment in real-time systems.
The proposed framework has significant implications for improving audio-visual learning systems, particularly in applications requiring robust performance under adverse conditions, such as speech recognition in noisy environments. The insights gained from this research could lead to advancements in multimodal AI systems, enhancing their ability to process and integrate diverse sensory inputs effectively. The main contribution of this work is the introduction of DPC-Net, a novel framework that enhances representation readiness in audio-visual learning by addressing premature perceptual commitment through a readiness-deficiency approach. This research significantly advances the understanding of audio-visual representation learning, providing a robust mechanism to improve performance across various tasks while preserving the integrity of task-specific architectures.
A common design pattern in high-quality music generation is to handle structure and fidelity in different representation spaces: a generator first models high-level structure, followed by diffusion-based or neural decoding stages that reconstruct fine details. In this work, we explore an alternative view: both may be progressively modeled within a single deep acoustic-token hierarchy. To study this, we build a 64-layer residual vector quantization (RVQ) acoustic representation and propose a two-stage coarse-to-fine generation framework. A backbone model first generates coarse acoustic tokens for the full track, and a super-resolution model then completes finer tokens within the same acoustic token space. The super-resolution stage works at full-track scale and refines tokens layer by layer while running in parallel over time, leading to a fixed 62-step inference process. To jointly improve lyric alignment and fine-detail reconstruction, we further introduce hybrid-attention training: the alignment objective uses causal attention, while layer-wise refinement uses full attention. A key finding is that text--vocal alignment can emerge within pure acoustic-token language modeling, without requiring a separate semantic token stage. Moreover, initializing the super-resolution model from the trained backbone significantly improves convergence and final quality. Taken together, our results suggest that high-quality music generation can be effectively pursued without separating structure and fidelity into heterogeneous representation spaces. Instead, both can be progressively modeled within a unified acoustic-token hierarchy, pointing toward a simpler and more unified path to high-quality music generation.
Primary: Central Conservatory of Music
All Institutions: Central Conservatory of Music, Tsinghua University
The main contribution of this paper is the introduction of Khala, a high-fidelity music generation system that effectively models both musical structure and acoustic fidelity within a unified acoustic-token hierarchy, demonstrating competitive performance against existing systems. This work significantly advances the field of music generation by providing a novel methodology that integrates lyric alignment and acoustic detail refinement in a single framework, showcasing the potential for future developments in this area.
The paper presents a novel approach to music generation by using a unified acoustic-token hierarchy, which contrasts with existing methods that typically separate structure and fidelity into different stages. The introduction of a two-stage coarse-to-fine generation framework and hybrid-attention training for lyric alignment is particularly innovative. The methodology is well-structured, with clear explanations of the architecture and training processes, including the use of a 64-layer residual vector quantization (RVQ) acoustic representation. The authors effectively address the challenges of generating coherent and high-fidelity music while maintaining lyric alignment, which is a significant advancement in the field.
The experiments conducted in the paper include a large-scale human preference evaluation, which is crucial for assessing the quality of generated music. The use of both subjective (mean Overall Score and BT-derived Elo) and objective metrics provides a comprehensive evaluation framework. The results indicate that Khala performs competitively against both commercial and open-source systems, showcasing its effectiveness in real-world applications. The ablation studies further validate the importance of the proposed training strategies, particularly the backbone initialization and hybrid-attention training.
The paper provides detailed implementation information, including model architectures, training strategies, and dataset descriptions. The availability of code and model checkpoints on GitHub enhances reproducibility, allowing other researchers to replicate the experiments and build upon the work. However, the paper could benefit from more explicit details on hyperparameter settings and training configurations to facilitate easier reproduction.
One limitation noted is the reliance on a two-stage model design, which, while practical, may not fully exploit the potential of a unified model that integrates both coarse generation and fine-layer refinement. Additionally, the paper acknowledges that the current tokenizer, while effective, could be improved for even higher fidelity, suggesting that future work could focus on enhancing the acoustic representation further.
The work has significant implications for the field of music generation, particularly in developing systems that can produce high-quality music with coherent structure and fidelity without relying on separate semantic stages. This approach could pave the way for more integrated and efficient music generation systems, potentially impacting various applications in entertainment, education, and creative industries. The findings also suggest a promising direction for future research in audio modeling and machine learning. The main contribution of this paper is the introduction of Khala, a high-fidelity music generation system that effectively models both musical structure and acoustic fidelity within a unified acoustic-token hierarchy, demonstrating competitive performance against existing systems. This work significantly advances the field of music generation by providing a novel methodology that integrates lyric alignment and acoustic detail refinement in a single framework, showcasing the potential for future developments in this area.
In this paper, we propose MelShield, a robust, in-generation, keyed audio watermarking framework that embeds identifiable signals into AI-generated audio for copyright protection and reliable attribution. Specifically, MelShield operates in the Mel-spectrogram domain during the generation process, targeting intermediate acoustic representations in Mel-conditioned pipelines for text-to-speech (TTS) generation. The core idea is to treat the intermediate Mel-spectrogram as the host signal and embed a short binary payload via low-energy, keyed spread-spectrum perturbations distributed across carefully selected time-frequency regions prior to waveform synthesis. By performing watermarking before vocoder inference, MelShield remains plug-and-play for Mel-conditioned TTS architectures and does not require modification or retraining of the underlying TTS generation vocoder, such as DiffWave and HiFi-GAN. Moreover, the multi-user keyed construction enables scalable user-specific attribution, while the keyed verification mechanism limits unauthorized decoding, thereby reducing the risk of large-scale extractor probing and adversarial analysis. Extensive experiments on DiffWave and HiFi-GAN demonstrate that MelShield achieves reliable watermark extraction, approaching 100\% bit accuracy, even under signal distortions, e.g., compression and additive noise, while preserving high perceptual audio quality.
Primary: Queen's University
All Institutions: Queen's University, University of Waterloo
MelShield presents a novel in-generation audio watermarking framework that effectively integrates into TTS systems, enhancing copyright protection and attribution mechanisms. The comprehensive evaluation and innovative methodology position this work as a significant contribution to the field of audio processing and machine learning.
The methodology presented in MelShield is innovative, leveraging a keyed spread-spectrum approach for watermarking directly in the Mel-spectrogram domain of TTS systems. This is a significant advancement over traditional post-hoc watermarking methods, as it integrates watermarking seamlessly into the audio generation pipeline without requiring modifications to existing vocoders. The use of low-energy perturbations and adaptive masking to maintain audio quality while embedding watermarks is particularly noteworthy. The authors provide a clear and systematic approach to embedding and extracting watermarks, which is well-justified and theoretically sound.
The experimental evaluation is comprehensive, utilizing two prominent TTS vocoders (DiffWave and HiFi-GAN) and a robust dataset (LJSpeech 1.1). The results demonstrate high bit accuracy for watermark recovery under various conditions, including common signal distortions. The paper effectively compares MelShield against existing watermarking methods, showcasing its superior performance in terms of robustness and fidelity. The use of multiple evaluation metrics (PESQ, STOI, DNSMOS) adds credibility to the results, although the paper could benefit from more extensive user studies to assess perceptual quality in real-world scenarios.
The paper provides a detailed description of the experimental setup, including the datasets, vocoder configurations, and watermark embedding parameters. However, it lacks a publicly accessible code repository or demo URL, which would enhance reproducibility and allow other researchers to validate the findings. Clearer documentation of the implementation would also aid in replicating the experiments.
One limitation is the reliance on specific vocoders, which may not generalize to all TTS systems. While the authors claim model-agnostic deployment, the performance may vary with different architectures not tested in the study. Additionally, the paper does not address potential vulnerabilities to advanced adversarial attacks that could target the watermarking system. The scalability of the approach in high-demand real-world applications remains to be fully explored.
The implications of this work are significant, particularly in the context of copyright protection and attribution for AI-generated audio. As deepfake technologies become more prevalent, robust watermarking solutions like MelShield can help mitigate risks associated with misinformation and unauthorized content distribution. The framework could be applied across various domains, including media production, digital rights management, and content verification systems. MelShield presents a novel in-generation audio watermarking framework that effectively integrates into TTS systems, enhancing copyright protection and attribution mechanisms. The comprehensive evaluation and innovative methodology position this work as a significant contribution to the field of audio processing and machine learning.
Generating expressive conducting gestures from music is a challenging cross-modal motion synthesis problem: the output must follow long-range musical structure, preserve beat-level synchronization, and remain plausible as a fine-grained 3D human performance. Existing conducting-motion studies are often limited by sparse pose representations, small-scale data, or evaluation protocols that do not directly measure whether music and gesture are mutually aligned. This paper presents TransConductor, a Transformer-based framework for music-driven conducting gesture generation. We introduce ConductorMotion, a SMPL-parameter data construction pipeline that recovers detailed body motion from conducting videos and forms a dataset targeted at professional conducting gestures. Given acoustic descriptors extracted from audio and an initial pose, TransConductor uses a Trans-Temporal Music Encoder and a Trans-Temporal Conducting Gesture Decoder to autoregressively predict SMPL pose parameters. To better assess artistic correspondence, we further build a retrieval-based evaluation model that embeds music and gestures into a shared space and yields FID, modality distance, multi-modality distance, and diversity metrics. Experiments show that TransConductor outperforms dance-generation and conducting-generation baselines, while ablations verify the benefits of the Transformer backbone and the proposed alignment loss.
Primary: Beijing Jiaotong University
All Institutions: Beijing Jiaotong University, Malou Tech Inc, South-Central Minzu University, Fudan University, Renmin University of China
This paper presents a significant advancement in the field of music-driven motion synthesis through the introduction of a Transformer-based framework for generating conducting gestures. The methodology effectively combines detailed pose representation with a novel evaluation approach, setting a new standard for future research in this area.
The proposed methodology introduces a novel Transformer-based framework, TransConductor, which effectively addresses the challenge of generating conducting gestures from music. The use of SMPL parameters for detailed pose representation is a significant advancement over traditional sparse keypoint methods, allowing for a more nuanced and expressive depiction of conducting motions. The dual encoder-decoder architecture, comprising a Trans-Temporal Music Encoder and a Trans-Temporal Conducting Gesture Decoder, is well-conceived, leveraging the strengths of self-attention mechanisms to capture long-range dependencies in both music and gesture. The introduction of a retrieval-based evaluation model further enhances the methodology by providing a more meaningful assessment of the artistic correspondence between music and gestures, which is often overlooked in traditional metrics.
The experimental evaluation is robust, comparing the proposed model against established baselines in dance and conducting generation. The reported metrics (FID, M-Dist, MM-Dist, and diversity) indicate significant improvements in the quality and alignment of generated gestures with the corresponding music. The ablation studies convincingly demonstrate the contributions of the Transformer architecture and the alignment loss, supporting the claims of enhanced performance. The diversity in the dataset, covering various conducting styles and musical emotions, strengthens the validity of the results and showcases the model's adaptability.
While the paper provides a detailed description of the methodology and experimental setup, it lacks specific implementation details such as code availability or dataset access, which are crucial for reproducibility. The absence of a demo or project URL further limits the ability of other researchers to validate and build upon this work.
The paper acknowledges certain limitations, including the reliance on monocular reconstruction, which may not capture all nuances of conducting gestures, particularly baton motion and finger articulation. Additionally, the model struggles with very large gestures in energetic music and may lag during fast transitions. These limitations suggest areas for future research, such as incorporating hand-aware reconstruction techniques and exploring longer musical contexts.
The implications of this work extend beyond academic interest; it has potential applications in music education, virtual performances, and intelligent tutoring systems. By automating the generation of conducting gestures, this research could enhance interactive music learning environments and provide valuable tools for musicians and educators. The framework could also inspire further exploration of cross-modal motion synthesis in other artistic domains, promoting a deeper understanding of the interplay between music and movement. This paper presents a significant advancement in the field of music-driven motion synthesis through the introduction of a Transformer-based framework for generating conducting gestures. The methodology effectively combines detailed pose representation with a novel evaluation approach, setting a new standard for future research in this area.
Driven by the escalating global burden of mental health conditions, music-based interventions have attracted significant attention as a non-invasive, cost-effective modality for emotion regulation and psychological stress relief. However, current digital music services rely on static preferences and fail to adapt to users' instantaneous psychological states. Furthermore, directly mapping electroencephalography (EEG) to music generation remains challenging due to severe paired-data scarcity and a lack of interpretability. To address these limitations, we propose MindMelody, a fully functional, closed-loop real-time system for EEG-driven personalized music intervention. MindMelody introduces an emotion-mediated semantic bridge. Specifically, a hybrid Transformer-GNN first decodes real-time EEG signals into global Valence-Arousal states and local temporal affect trajectories. These states are then fed into a Retrieval-Augmented Generation (RAG)-equipped Large Language Model (LLM) to formulate structured intervention plans. Subsequently, a novel Hierarchical EEG Controller injects global affect prefixes and local temporal guidance into a pretrained music backbone, enabling fine-grained controllable audio synthesis. Crucially, the system incorporates a continuous feedback loop that updates generation parameters on the fly based on the user's evolving EEG dynamics. Extensive experiments show that MindMelody improves control adherence and emotional alignment, and receives higher perceived helpfulness in a short-term listening setting, suggesting its promise as an adaptive affect-aware music generation framework.
Primary: South China University of Technology
All Institutions: South China University of Technology
MindMelody presents a novel approach to EEG-driven personalized music intervention, demonstrating a sophisticated integration of machine learning techniques that enhance the adaptability and effectiveness of music therapy. The paper's contributions to the field of affective computing and music generation are substantial, offering a promising direction for future research and applications in mental health.
The methodology presented in MindMelody is innovative, integrating a hybrid Transformer-GNN architecture for EEG decoding with a Retrieval-Augmented Generation (RAG) mechanism to formulate structured intervention plans. The use of a Hierarchical EEG Controller to modulate a pretrained music generation backbone is particularly noteworthy, as it allows for fine-grained control over the music output based on real-time EEG data. The closed-loop feedback mechanism that continuously adapts to user feedback enhances the system's responsiveness and personalization, which is a significant advancement over static music generation systems.
The experiments conducted are robust, utilizing established datasets like DEAP for EEG affect modeling and MusicCaps for controllable music generation. The paper provides comprehensive quantitative metrics, including FAD and various subjective evaluations (Nat.-MOS, Emo.-MOS, Help.), which demonstrate the system's effectiveness in emotional alignment and perceived helpfulness. The pilot user study adds valuable qualitative insights into user experience, although it is limited in scope.
The paper includes detailed descriptions of the experimental setup, including hyperparameters and training procedures, which aids in reproducibility. However, the lack of publicly available code or a demo limits the ability for others to replicate the findings fully.
One limitation is the reliance on a relatively small dataset for training, which may affect the generalizability of the model across diverse populations. Additionally, while the pilot study shows promising results, it is not a clinical validation, and further research is needed to establish long-term efficacy and safety in real-world applications.
The potential applications of MindMelody are significant, particularly in mental health interventions, where personalized music therapy could provide non-invasive and cost-effective support for individuals experiencing emotional distress. The integration of EEG data with music generation could pave the way for more adaptive therapeutic tools in the field of affective computing. MindMelody presents a novel approach to EEG-driven personalized music intervention, demonstrating a sophisticated integration of machine learning techniques that enhance the adaptability and effectiveness of music therapy. The paper's contributions to the field of affective computing and music generation are substantial, offering a promising direction for future research and applications in mental health.
Speech technologies are deployed in high-stakes settings, yet fairness concerns remain fragmented across tasks and disciplines. Existing surveys either adopt a general machine-learning perspective that overlooks speech-specific properties or focus on a single task, missing failure patterns shared across the speech domain. Synthesizing over 400 studies spanning generation and perception tasks and emerging speech-language models, this survey presents a unified framework that links formal fairness definitions to evaluation, diagnosis, and mitigation. We formalize seven fairness definitions adapted to the speech modality and organize the field's conceptual evolution through three paradigms: Robustness, Representation, and Governance. We then ground evaluation metrics in the mathematical cores of these definitions and offer a decision tree for metric selection. We diagnose bias sources along the speech processing pipeline, surfacing speech-specific mechanisms such as channel bias as a demographic proxy and annotation subjectivity in emotion labels. We systematize mitigation strategies across four intervention stages, mapping each to the diagnosed sources. Finally, we identify open challenges and propose directions for future research.
Primary: National Taiwan University
All Institutions: National Taiwan University, University of Southern California, NTU Artificial Intelligence Center of Research Excellence
This paper serves as a foundational survey that systematically addresses bias and fairness in speech AI, providing a comprehensive framework that can guide future research and development in this critical area. The authors' approach to synthesizing existing literature and formalizing fairness definitions is a significant contribution to the field, setting the stage for more equitable speech technologies.
The paper presents a comprehensive survey that synthesizes over 400 studies related to bias and fairness in speech AI, establishing a unified framework that links formal fairness definitions to evaluation metrics, bias diagnosis, and mitigation strategies. The authors formalize seven fairness definitions specifically adapted to the speech modality and provide a decision tree for metric selection, which is a novel contribution to the field. The methodology is robust, drawing on a wide range of literature and systematically addressing the unique challenges posed by the speech domain.
While the paper is primarily a survey and does not include original experimental results, it effectively reviews existing literature and identifies gaps in current methodologies. It categorizes bias sources along the speech processing pipeline and systematizes mitigation strategies, which could serve as a foundation for future empirical studies. The depth of analysis into bias mechanisms and fairness paradigms is commendable, although the lack of original experimental validation limits the immediate applicability of the findings.
The survey does not present original experiments, thus reproducibility in the traditional sense does not apply. However, the clear organization of existing literature and the proposed frameworks allow for future researchers to build upon this work in a reproducible manner. The decision tree for metric selection is particularly useful for guiding future empirical studies.
One limitation of the paper is its reliance on existing literature without presenting new empirical data or case studies to validate the proposed frameworks. Additionally, while the survey covers a wide range of topics, it may not address all nuances of bias and fairness in speech technologies, particularly in emerging areas of research. The authors also acknowledge the complexity of navigating fairness in sociotechnical contexts, which may not be fully captured in their framework.
The implications of this work are significant, as it addresses critical issues of bias and fairness in speech technologies that are increasingly deployed in high-stakes environments. By highlighting the need for fairness as a core requirement rather than an afterthought, the paper encourages researchers and practitioners to consider the ethical implications of their technologies. This survey could influence future research directions and policy-making in the field of AI and speech technology. This paper serves as a foundational survey that systematically addresses bias and fairness in speech AI, providing a comprehensive framework that can guide future research and development in this critical area. The authors' approach to synthesizing existing literature and formalizing fairness definitions is a significant contribution to the field, setting the stage for more equitable speech technologies.
Audio-visual quality assessment (AVQA) is essential for streaming, teleconferencing, and immersive media. In realistic streaming scenarios, distortions are often asymmetric, where one modality may be severely degraded while the other remains clean. Still, most contemporary AVQA metrics treat audio and video as equally reliable, causing confidence-unaware fusion to emphasize unreliable signals. This paper proposes MCM-AVQA, a multimodal confidence-aware AVQA framework that explicitly estimates modality-specific confidence and injects it into a dedicated audio-visual mixer for cross-modal attention. The Audio-Visual Mixer utilizes frame-level, confidence-guided channel attention to gate fusion, modulating feature interaction between modalities so that high-confidence streams dominate while unreliable inputs are suppressed, preserving temporal degradation patterns. A multi-head visual confidence estimator turns frame-level artifact probabilities into temporally smoothed, clip-level visual confidence scores, while an audio confidence module derives confidence from speech-quality cues without requiring a clean reference. Experiments on multiple AVQA benchmarks show that MCM-AVQA, and specifically its confidence-guided Audio-Visual Mixer, improve correlation with human mean opinion scores and yield more interpretable behavior under real-world asymmetric audio-visual distortions.
Primary: Texas State University
All Institutions: Texas State University
The paper presents MCM-AVQA, a confidence-aware audio-visual quality assessment framework that improves the robustness of quality evaluation under asymmetric distortions. This work significantly advances the state of the art in AVQA by integrating modality-specific confidence into the fusion process, leading to more accurate and interpretable quality assessments.
The proposed MCM-AVQA framework introduces a novel approach to audio-visual quality assessment by explicitly modeling modality-specific confidence and integrating it into a dedicated Audio-Visual Mixer. This methodology allows for dynamic feature gating based on confidence levels, which is a significant advancement over traditional methods that treat audio and video as equally reliable. The use of a multi-head visual confidence estimator and an audio confidence module enhances the robustness of the model under asymmetric distortions, which is a common scenario in real-world applications. The architecture is well-structured, leveraging state-of-the-art transformer models and attention mechanisms, making it a strong contribution to the field.
The experiments conducted across multiple AVQA benchmarks (LIVE-SJTU, UnB-AV, UnB-AVQ) demonstrate the effectiveness of MCM-AVQA in improving correlation with human mean opinion scores. The results indicate that the model outperforms existing state-of-the-art methods, particularly in scenarios with asymmetric distortions. The ablation studies provide valuable insights into the contributions of each component of the model, reinforcing the importance of confidence-aware fusion. The use of statistical tests to validate performance improvements adds rigor to the evaluation.
The paper provides sufficient details regarding the architecture, training procedures, and evaluation metrics, which supports reproducibility. However, the absence of publicly available code or datasets limits the ability for other researchers to replicate the results directly. Including a project URL or demo would significantly enhance reproducibility.
One limitation of the study is the lack of a comprehensive comparison with more recent AVQA methods that may not have been included in the evaluation. Additionally, while the model shows robustness under asymmetric distortions, its performance in extreme distortion scenarios or with novel types of distortions remains untested. The reliance on subjective mean opinion scores for evaluation, while standard, could also introduce variability based on human judgment.
The MCM-AVQA framework has significant implications for real-world applications in streaming, teleconferencing, and immersive media, where audio-visual quality is critical. By improving the accuracy of quality assessments in asymmetric distortion scenarios, this work can enhance user experiences in various multimedia applications. The approach could also be extended to other multimodal quality assessment tasks, potentially influencing future research directions in the field. The paper presents MCM-AVQA, a confidence-aware audio-visual quality assessment framework that improves the robustness of quality evaluation under asymmetric distortions. This work significantly advances the state of the art in AVQA by integrating modality-specific confidence into the fusion process, leading to more accurate and interpretable quality assessments.
Autoregressive (AR) models with diffusion heads have recently achieved strong text-to-audio performance, yet their iterative decoding and multi-step sampling process introduce high-latency issues. To address this bottleneck, we propose a one-step sampling framework that combines an energy-distance training objective with representation-level distillation. An energy-scoring head maps Gaussian noise directly to audio latents in one step, eliminating the need for a costly recursive diffusion sampling process, while distillation from a masked autoregressive (MAR) text-to-audio model preserves the strong conditioning learned during diffusion training. On the AudioCaps benchmark, our method consistently outperforms prior one-step baselines such as ConsistencyTTA, SoundCTM, AudioLCM and AudioTurbo, on both objective and subjective metrics, while substantially narrowing the quality gap to AR diffusion systems with multi-step sampling. Compared to the state-of-the-art AR diffusion system, IMPACT, our approach achieves up to $8.5$x faster batch inference with highly competitive audio quality. These results demonstrate that combining energy-distance training with representation-level distillation provides an effective recipe for fast, high-quality text-to-audio synthesis.
Primary: Amazon AGI
All Institutions: Amazon AGI, National Taiwan University
The paper presents a significant advancement in efficient generative media by introducing a one-step sampling framework that achieves substantially faster inference while maintaining high audio fidelity and semantic relevance. The innovative combination of energy-distance training and representation distillation represents a meaningful contribution to the field of machine learning, particularly in audio generation.
The proposed methodology introduces a novel one-step sampling framework for text-to-audio generation that integrates an energy-distance training objective with representation-level distillation. This approach effectively reduces inference latency while maintaining audio quality, addressing a significant limitation in existing autoregressive models that rely on multi-step sampling. The use of energy-scoring to map Gaussian noise directly to audio latents is innovative and demonstrates a clear departure from traditional diffusion-based methods. The incorporation of distillation from a masked autoregressive model further enhances the model's performance, showcasing a thoughtful combination of techniques to achieve rapid and high-quality audio synthesis.
The experimental evaluation is comprehensive, utilizing the AudioCaps benchmark for both objective and subjective assessments. The paper reports consistent improvements over existing one-step baselines, with significant gains in fidelity and semantic relevance as measured by various metrics (FD, FAD, KL, IS, CLAP). The results demonstrate not only superior performance compared to prior models but also a substantial reduction in inference time, achieving up to 8.5 times faster batch inference than the state-of-the-art AR diffusion system, IMPACT. The thoroughness of the experiments, including ablation studies on representation distillation and classifier-free guidance, adds credibility to the findings.
The paper provides detailed descriptions of the experimental setup, including datasets, model configurations, and evaluation metrics, which contribute to reproducibility. However, the absence of a publicly accessible code repository or demo limits the ability of other researchers to replicate the results directly. Clear documentation of hyperparameters and training procedures is essential for future work in this area.
While the proposed method shows promising results, it still falls short of the audio quality achieved by multi-step diffusion models, indicating that there may be inherent trade-offs between speed and fidelity. The reliance on a single sampling step may also limit the model's flexibility in generating more complex audio sequences. Additionally, the paper does not address potential biases in the training datasets, which could affect the generalizability of the model.
The advancements in low-latency text-to-audio generation have significant implications for real-time applications in multimedia content creation, interactive media, and personalized audio experiences. The ability to generate high-quality audio quickly opens up new avenues for user engagement and creative expression. Furthermore, the integration of energy-distance training and representation distillation could inspire future research in other generative tasks across different modalities. The paper presents a significant advancement in efficient generative media by introducing a one-step sampling framework that achieves substantially faster inference while maintaining high audio fidelity and semantic relevance. The innovative combination of energy-distance training and representation distillation represents a meaningful contribution to the field of machine learning, particularly in audio generation.
In this paper, we propose GaMMA, a state-of-the-art (SoTA) large multimodal model (LMM) designed to achieve comprehensive musical content understanding. GaMMA inherits the streamlined encoder-decoder design of LLaVA, enabling effective cross-modal learning between music and language. By incorporating audio encoders in a mixture-of-experts manner, GaMMA effectively unifies both time-series and non-time-series music understanding tasks within one set of parameters. Our approach combines carefully curated datasets at scale with a progressive training pipeline, effectively pushing the boundaries of music understanding via pretraining, supervised fine-tuning (SFT), and reinforcement learning (RL). To comprehensively assess both temporal and non-temporal capability of music LMMs, we introduce MusicBench, the largest music-oriented benchmark, comprising 3,739 human-curated multiple-choice questions covering diverse aspects of musical understanding. Extensive experiments demonstrate that GaMMA establishes new SoTA in the music domain, achieving 79.1% accuracy on MuchoMusic, 79.3% on MusicBench-Temporal, and 81.3% on MusicBench-Global, consistently outperforming previous methods.
Primary: Fudan University
All Institutions: Fudan University, ByteDance
The main contribution of this paper is the introduction of GaMMA, a large multimodal model that effectively integrates temporal and non-temporal music understanding, alongside the establishment of MusicBench as a comprehensive evaluation benchmark. This work represents a significant advancement in the field of music AI, addressing critical gaps in existing models and providing a robust framework for future research.
The methodology presented in GaMMA is robust, utilizing a dual-encoder architecture that effectively captures both temporal and non-temporal aspects of music understanding. The mixture-of-experts approach, combined with a three-stage training strategy (pretraining, supervised fine-tuning, and reinforcement learning), is innovative and addresses existing gaps in music LMMs. The introduction of MusicBench as a comprehensive benchmark for evaluating music understanding adds significant value to the methodology, allowing for a nuanced assessment of model capabilities.
The experiments conducted demonstrate the effectiveness of GaMMA, achieving state-of-the-art results on multiple benchmarks, including MusicBench and MuChoMusic. The extensive evaluation across various dimensions of music understanding, including temporal reasoning and global attributes, showcases the model's capabilities. The use of human-curated questions in MusicBench enhances the credibility of the results, though the paper could benefit from more extensive comparisons with a wider range of existing models.
The paper provides detailed implementation specifics, including training strategies, hyperparameters, and data curation processes, which are essential for reproducibility. However, the absence of publicly available code or datasets limits the ability for independent verification of results.
One limitation is the reliance on curated datasets, which may introduce biases or limit the generalizability of the model. Additionally, while the dual-encoder approach is innovative, it may require significant computational resources, which could hinder accessibility for broader research applications.
GaMMA has the potential to significantly impact the field of music understanding and multimodal AI by providing a framework that can be adapted for various applications, such as music recommendation systems, educational tools, and interactive music assistants. Its ability to understand and reason about music in a nuanced manner could lead to advancements in how machines interact with human creativity and cultural expressions. The main contribution of this paper is the introduction of GaMMA, a large multimodal model that effectively integrates temporal and non-temporal music understanding, alongside the establishment of MusicBench as a comprehensive evaluation benchmark. This work represents a significant advancement in the field of music AI, addressing critical gaps in existing models and providing a robust framework for future research.
A speaker encoder used in multilingual voice cloning should treat the same speaker identically regardless of which script the audio was uttered in. Off-the-shelf encoders do not, and the failure is accent-conditional. On a 1043-pair Western-accented voice corpus across English, Hindi, Telugu, and Tamil, WavLM-base-plus-sv loses 0.082 absolute cosine similarity when the same voice changes script and ECAPA-TDNN loses 0.105. On a 1369-pair Indian-accented voice corpus, the gap shrinks to 0.006 (WavLM-SV) and 0.044 (ECAPA-TDNN). The leak is largest where it matters most for cross-script TTS: when a system projects a non-Indic-trained voice into Indic scripts. We present LASE (Language-Adversarial Speaker Encoder), a small projection head over frozen WavLM-base-plus trained with two losses: a supervised contrastive loss over voice identity, and a gradient-reversal cross-entropy against a 4-language classifier that pushes the embedding to be language-uninformative while remaining speaker-informative. Trained on 1118 quality-gated cross-script pairs synthesised from 8 commercial multilingual voices, LASE's residual gap is consistent with zero on both corpora (Delta = 0.013 Western, Delta = 0.026 Indian; both bootstrap 95% CIs include zero) and amplifies the cross-script-vs-floor margin 2.4-2.7x over both baselines. An ECAPA+GRL ablation shows the GRL objective improves either backbone but the WavLM choice contributes too. In synthetic multi-speaker diarisation, LASE matches ECAPA-TDNN on cross-script speaker recall (0.788 vs 0.789) with ~100x less training data. We release the r1 checkpoint, both corpora, and the bootstrap recipe.
Primary: Praxel Ventures
All Institutions: Praxel Ventures
The paper presents LASE, a novel approach to cross-script identity preservation in multilingual voice cloning, demonstrating significant advancements in disentangling language from speaker identity and providing valuable resources for future research. The methodology and results contribute meaningfully to the field of audio processing and speaker recognition, particularly in the context of Indic languages.
The paper introduces a novel approach using a Language-Adversarial Speaker Encoder (LASE) that effectively disentangles language from speaker identity in multilingual voice cloning tasks. The methodology employs a gradient-reversal layer and a supervised contrastive loss to create a speaker embedding that is invariant to language, which is a significant advancement in the field. The architecture is well-defined, consisting of a frozen WavLM-base-plus backbone and a trainable projection head, which allows for efficient training and effective performance on cross-script tasks.
The experiments are robust, utilizing two distinct corpora to evaluate the performance of LASE against established baselines (WavLM-base-plus-sv and ECAPA-TDNN). The results demonstrate a significant reduction in the identity gap across scripts, with LASE achieving a gap of 0.013 compared to 0.082 and 0.105 for the baselines. The paper also includes a thorough analysis of the training dynamics and presents a synthetic multi-speaker diarisation benchmark, showing that LASE can match ECAPA-TDNN's performance with significantly less training data.
The authors provide a comprehensive set of resources, including the model weights, training corpus, and evaluation scripts, which enhances reproducibility. The detailed description of the training process, loss functions, and hyperparameters further supports the ability of other researchers to replicate the results.
The study relies solely on synthetic data generated by ElevenLabs, which may not fully capture the complexities of natural human speech. Additionally, the held-out set shares voices with the training data, limiting the generalization assessment. The paper also acknowledges that the model's performance on real-world data and new voices remains to be evaluated.
The implications of this work are significant for applications in multilingual voice cloning, speaker verification, and diarisation systems, particularly in contexts involving Indian languages. The ability to maintain speaker identity across different scripts can enhance user experience in customer support, content creation, and accessibility technologies. The paper presents LASE, a novel approach to cross-script identity preservation in multilingual voice cloning, demonstrating significant advancements in disentangling language from speaker identity and providing valuable resources for future research. The methodology and results contribute meaningfully to the field of audio processing and speaker recognition, particularly in the context of Indic languages.
Recent advances in multimodal generation have enabled high-quality audio generation from silent videos. Practical applications, such as sound production, demand not only the generated audio but also explicit sound event labels detailing the type and timing of sounds. One straightforward approach involves applying a standard sound event detection to the generated audio. However, this post-hoc pipeline is inherently limited, as it is prone to error accumulation. To address this limitation, we propose MMAudio-LABEL (LAtent-Based Event Labeling), an event-aware audio generation framework built on a foundational audio generation model as its backbone that jointly generates audio and frame-aligned sound event predictions from silent videos. We evaluate our method on the Greatest Hits dataset for onset detection and 17-class material classification. Our approach improves onset-detection accuracy from 46.7% to 75.0% and material-classification accuracy from 40.6% to 61.0% over baselines. These results suggest that jointly learning audio generation and event prediction enables a more interpretable and practical video-to-audio synthesis.
Primary: Sony Group Corporation
All Institutions: Sony Group Corporation, Sony AI
The paper presents MMAudio-LABEL, a novel framework for joint audio generation and event labeling from silent videos, demonstrating significant improvements over existing methods. The technical contributions and methodology are well-articulated, showcasing the potential for broader applications in multimedia content creation and multimodal learning.
The proposed MMAudio-LABEL framework innovatively combines audio generation with event labeling in a unified architecture, addressing the limitations of traditional post-hoc sound event detection methods. By leveraging a multimodal transformer and exploring two distinct architectures (Parallel Heads and Joint Heads), the authors demonstrate a thoughtful approach to integrating visual and auditory information. The methodology is well-structured, with clear explanations of the model architecture and training objectives, although further details on the training data preprocessing and augmentation strategies could enhance clarity.
The experiments are robust, utilizing the Greatest Hits dataset to evaluate both onset detection and material classification. The reported improvements in accuracy metrics (from 46.7% to 75.0% for onset detection and from 40.6% to 61.0% for material classification) provide compelling evidence of the framework's effectiveness. However, the paper could benefit from additional comparative analyses against a wider range of baseline models to contextualize the performance gains further.
The implementation details are adequately described, including model architecture, training parameters, and evaluation metrics. However, the absence of a publicly available code repository or demo limits reproducibility. Providing access to the trained models or code would significantly enhance the paper's impact and usability for the research community.
One notable limitation is the reliance on a specific dataset (Greatest Hits), which may not fully represent the diversity of audio events in real-world scenarios. Additionally, the model's performance on less distinctive materials indicates potential challenges in generalization. The paper could also discuss the computational complexity and resource requirements of the proposed framework.
The MMAudio-LABEL framework has significant implications for content creation, immersive media, and human-computer interaction, as it enables more intuitive sound event labeling from silent videos. This could streamline workflows in various industries, including film production and gaming, where accurate audio representation is crucial. The integration of audio generation and event labeling also opens avenues for future research in multimodal learning and generative models. The paper presents MMAudio-LABEL, a novel framework for joint audio generation and event labeling from silent videos, demonstrating significant improvements over existing methods. The technical contributions and methodology are well-articulated, showcasing the potential for broader applications in multimedia content creation and multimodal learning.
We present MedMosaic, a medical audio question-answering dataset designed to benchmark language and audio reasoning models under realistic clinical constraints. Medical audio data is difficult to collect due to privacy regulations and high annotation costs arising from domain expertise. Thus, existing benchmarks tend to underrepresent complex medical audio scenarios. To address these challenges, MedMosaic features a diverse range of medical audio types, including condition-related physiological sounds, carefully constructed synthetic voices to mimic speech with artifacts as well as real short and long length clinical conversations to model varying context lengths. The dataset also features a total of 46,701 question-answer pairs, spanning categories such as multiple-choice, sequential multi-turn, and open-ended question-answers, enabling systematic evaluation of multi-hop reasoning and answer generation capabilities. Benchmarking 13 audio and multimodal reasoning models reveals that reasoning remains challenging for all evaluated systems, with substantial performance variation across question types. In particular, even state-of-the-art model like Gemini-2.5-pro can only achieve 68.1% accuracy approximately. These findings underscore persistent limitations in medical reasoning and highlight the need for more robust, domain-specific multimodal reasoning models.
Primary: University of Maryland, College Park, MD, USA
All Institutions: Centific Global Solutions Inc., University of Maryland, College Park, MD, USA
The paper presents MedMosaic, a large-scale medical audio question-answering benchmark designed to evaluate audio reasoning models under realistic clinical constraints. This work is significant as it addresses a critical gap in the evaluation of multimodal reasoning in the medical domain, providing a structured framework for future research and development in audio understanding and reasoning.
The methodology presented in this paper is robust, featuring a comprehensive pipeline for generating question-answer pairs from diverse medical audio sources. The authors effectively address the challenges of collecting and annotating medical audio data by leveraging synthetic audio generation techniques. The structured approach to creating varied question types (e.g., sound-only, speech-only, multi-turn) is commendable, as it allows for a nuanced evaluation of audio reasoning capabilities. The use of subject matter experts for validation adds credibility to the dataset's clinical relevance. However, the reliance on synthetic data raises questions about the authenticity of the generated audio and its implications for real-world applications.
The experimental evaluation is thorough, benchmarking 13 different audio and multimodal reasoning models against the MedMosaic dataset. The results demonstrate significant performance challenges across all models, highlighting the dataset's difficulty and the need for further advancements in medical audio reasoning. The detailed breakdown of model performance across various question types provides valuable insights into the strengths and weaknesses of current systems. However, the paper could benefit from more extensive comparisons with existing benchmarks to contextualize the results further.
The paper provides a detailed description of the dataset generation process and the evaluation framework, which aids in reproducibility. However, the absence of publicly available code or datasets limits the ability for other researchers to replicate the findings. The authors should consider releasing the dataset and the generation pipeline to enhance reproducibility and facilitate further research in this area.
The primary limitation of this work lies in the reliance on synthetic audio, which may not fully capture the complexities of real-world medical audio scenarios. Additionally, while the dataset is extensive, the performance of state-of-the-art models remains relatively low, indicating that the benchmark may still be too challenging for current systems. The authors acknowledge the need for further validation before clinical deployment, which is a critical consideration for any application in healthcare.
The development of MedMosaic has the potential to significantly advance the field of medical audio processing and reasoning. By providing a challenging benchmark, it encourages the development of more sophisticated models capable of understanding and reasoning over complex medical audio. This could ultimately lead to improved clinical decision-making and patient outcomes. However, the authors emphasize the importance of extensive validation before any real-world application, highlighting the need for caution in deploying AI systems in healthcare settings. The paper presents MedMosaic, a large-scale medical audio question-answering benchmark designed to evaluate audio reasoning models under realistic clinical constraints. This work is significant as it addresses a critical gap in the evaluation of multimodal reasoning in the medical domain, providing a structured framework for future research and development in audio understanding and reasoning.
While current federated multimodal continual learning over mixture-of-experts low-rank adaptation (MoE-LoRA) is built on the unverified assumption that routing isolates task-specific knowledge into disjoint experts, we argue that routing operates per-sample, while forgetting accumulates across the task sequence, and gradient conflict persists within each expert even when routing is maximally polarized. Moreover, activation-subspace protection can also fail because, under parameter-efficient fine-tuning, it entangles tasks due to a dimension-counting bound, and federated averaging (FedAvg) disrupts client-side orthogonality. To address this, we propose PRISM (Per-expert Routing-projection Interference-informed Subspace Method), which maintains a per-expert gradient subspace basis whose orthogonality is preserved under FedAvg and reinterprets MoE routing as a capacity allocator. Our results show that, on LLaVA-1.5-7B, LLaVA-1.5-13B, and Qwen2.5-VL-7B across CoIN-6 and CoIN-Long-10, PRISM outperforms sixteen the state of the art baselines in average accuracy. Compared to the best federated multimodal baseline, the performance margin increases from +3.23 pp on CoIN-6 to +6.06 pp on CoIN-Long-10.
Primary: South Dakota State University
All Institutions: South Dakota State University
The main contribution of this paper is the introduction of PRISM, a novel approach that effectively resolves issues of spurious isolation in federated multimodal continual learning by maintaining orthogonality in gradient subspaces and reinterpreting routing mechanisms. The comprehensive analysis of the methodology, experimental results, and potential applications underscores its significance in advancing the field of federated learning.
The paper introduces PRISM, a novel method addressing the limitations of existing federated multimodal continual learning approaches. The methodology is well-structured, focusing on the preservation of orthogonality in gradient subspaces and reinterpreting MoE routing as a capacity allocator. The proposed mechanisms, including the Per-Expert Federated Orthogonal Subspace Union (PE-FOSU) and interference-informed scheduling, are innovative and effectively tackle the identified issues of spurious isolation and entangled activation subspaces. The authors provide a clear theoretical foundation for their approach, which is crucial for understanding the underlying principles of their method.
The experimental setup is robust, evaluating PRISM against sixteen state-of-the-art baselines across two multimodal benchmarks (CoIN-6 and CoIN-Long-10). The results demonstrate significant improvements in average accuracy and backward transfer, with detailed comparisons that highlight the advantages of the proposed method. The paper includes comprehensive analyses of the results, showcasing the effectiveness of PRISM in various scenarios.
The paper provides sufficient implementation details, including the architecture, training protocols, and evaluation metrics. However, the absence of a public code repository or demo URL limits the reproducibility of the results. Future work should consider making the code available to facilitate validation by the research community.
While the proposed method shows promise, the paper does not address the computational overhead associated with maintaining per-expert gradient subspaces, which could be a concern in large-scale applications. Additionally, the evaluation is limited to specific multimodal benchmarks, and further testing on diverse datasets would strengthen the findings.
The implications of this research extend to various applications in federated learning, particularly in scenarios where data privacy is paramount. By enhancing the performance of multimodal continual learning systems, PRISM could contribute to advancements in areas such as personalized AI, healthcare, and collaborative learning environments. The main contribution of this paper is the introduction of PRISM, a novel approach that effectively resolves issues of spurious isolation in federated multimodal continual learning by maintaining orthogonality in gradient subspaces and reinterpreting routing mechanisms. The comprehensive analysis of the methodology, experimental results, and potential applications underscores its significance in advancing the field of federated learning.
Dance serves as both a cultural cornerstone and a medium for personal expression, yet the rapid growth of online dance content has made personalized discovery increasingly difficult. Text-based dance retrieval offers a natural interface for users to search with choreographic intent, but it remains underexplored because dance requires simultaneous reasoning over linguistic semantics, musical rhythm, and full-body motion dynamics. We introduce TD-Data, a large-scale open dataset for text-dance retrieval, containing about 4,000 12-second dance clips, 14.6 hours of motion, 22 genres, and annotations from professional dance experts. On top of this dataset, we propose CustomDancer, a multimodal retrieval framework that aligns text with dance through a CLIP-based text encoder, music and motion encoders, and a music-motion blending module. CustomDancer achieves state-of-the-art performance on TD-Data, reaching 10.23% Recall@1 and improving retrieval quality in both quantitative benchmarks and user preference studies.
Primary: South-Central Minzu University
All Institutions: South-Central Minzu University
The main contribution of this paper is the introduction of CustomDancer, a multimodal framework for text-dance retrieval, and the TD-Data dataset, which together advance the state-of-the-art in dance content discovery. The comprehensive methodology, rigorous experimental evaluation, and acknowledgment of limitations underscore the significance of this work in the intersection of machine learning and the performing arts.
The methodology is robust, introducing a novel multimodal retrieval framework (CustomDancer) that effectively combines text, music, and motion through a well-structured architecture. The use of a CLIP-based text encoder alongside dedicated music and motion encoders is innovative, allowing for a more nuanced understanding of dance retrieval. The music-motion blending module is particularly noteworthy as it captures the interaction between music and motion, which is crucial for dance. The construction of the TD-Data dataset with expert annotations adds significant value, providing a solid foundation for training and evaluation.
The experiments are comprehensive, utilizing multiple evaluation metrics (Recall@K, Median Rank, Mean Rank) that are appropriate for the task. The comparison with strong baselines demonstrates the effectiveness of CustomDancer, and the user study adds a qualitative dimension to the evaluation, confirming that the model aligns well with human judgments. The ablation studies provide insights into the contributions of different components of the model, reinforcing the importance of temporal modeling and feature fusion.
The paper provides detailed implementation details, including the architecture of the encoders and the training objectives. However, the lack of a publicly available code repository or dataset could hinder reproducibility. Future work should consider releasing the code and dataset to facilitate further research in this area.
The paper acknowledges several limitations, including challenges with specialized terminology, conflicts between visual motion and musical affect, and potential performer bias. These factors can impact retrieval accuracy and user satisfaction. Additionally, the dataset's focus on 3D motion and music may overlook important visual elements like costumes and facial expressions.
The work has the potential to significantly impact the fields of dance education, choreography, and creative recommendation systems. By making dance retrieval more accessible, it can facilitate learning and exploration of diverse dance styles. However, the authors emphasize the need for cultural sensitivity in dataset construction and application, highlighting the importance of preserving the context and community significance of dance styles. The main contribution of this paper is the introduction of CustomDancer, a multimodal framework for text-dance retrieval, and the TD-Data dataset, which together advance the state-of-the-art in dance content discovery. The comprehensive methodology, rigorous experimental evaluation, and acknowledgment of limitations underscore the significance of this work in the intersection of machine learning and the performing arts.
To address the limitations of existing Generative Fixed-Filter Active Noise Control (GFANC) methods, which rely on filter decomposition and recombination and require supervised learning with labeled data, this paper proposes a Transformer-based End-to-End Control-Filter Generation (E2E-CFG) framework. Unlike previous approaches that predict combination weights of sub control filters, the proposed method directly generates control filters in an unsupervised manner by integrating the co-processor and real-time controller into a fully differentiable ANC system, where the accumulated error signal is used as the training objective. By abandoning the decomposition--reconstruction process, the proposed design simplifies the control pipeline and avoids error accumulation, while the Transformer architecture effectively captures global and dynamic noise characteristics through its attention mechanism. Numerical simulations on real-recorded noises demonstrate that the proposed method achieves improved noise reduction performance and adaptability to different types of noises compared with the original GFANC framework.
Primary: unknown
All Institutions: unknown
The paper presents a novel Transformer-based framework for active noise control that simplifies the filter generation process and improves adaptability to real-world noise conditions. This work is significant as it combines advanced neural architectures with practical applications in noise cancellation, potentially leading to enhanced performance in diverse acoustic environments.
The proposed Transformer-based End-to-End Control-Filter Generation (E2E-CFG) framework represents a significant methodological advancement in active noise control (ANC) by integrating a Transformer architecture for direct control-filter generation. This approach eliminates the need for sub-filter decomposition and recombination, which simplifies the control pipeline and enhances adaptability to varying noise conditions. The unsupervised training paradigm, which relies on minimizing the accumulated residual error, is innovative as it reduces the dependency on labeled data, a common limitation in many machine learning applications. The use of a differentiable ANC system allows for end-to-end training, which is a notable strength of the methodology.
The experimental setup is robust, utilizing a large synthetic dataset of 83,977 noise samples and evaluating the model's performance on both unseen real-world and synthetic noises. The results indicate that the proposed method outperforms the existing GFANC framework in most real-noise scenarios, demonstrating its practical applicability. However, the performance on synthetic noises is mixed, suggesting that while the model excels in real-world conditions, it may not universally outperform all existing methods across all noise types. The evaluation metrics used, particularly the noise reduction (NR) levels, are appropriate for assessing ANC performance.
The paper provides sufficient detail regarding the model architecture, training parameters, and experimental setup, which should allow for reproducibility. However, the absence of a publicly available code repository or demo URL limits the ease with which other researchers can replicate the results. Future work could benefit from sharing the implementation details and datasets used for training and testing.
One significant limitation is the reliance on a fixed acoustic path during training and evaluation, which may not generalize well to different acoustic environments without retraining the model. Additionally, the increased complexity of the Transformer-based model, while beneficial for performance, raises concerns about computational efficiency and resource requirements, which could limit its deployment in real-time applications.
The proposed framework has the potential to significantly improve active noise control systems in various applications, including consumer electronics, automotive, and industrial environments. By enhancing adaptability to dynamic noise conditions, this research could lead to more effective noise cancellation solutions, improving user experience and comfort in noisy environments. The implications for real-time processing and deployment in practical scenarios are promising, although further work is needed to address the identified limitations. The paper presents a novel Transformer-based framework for active noise control that simplifies the filter generation process and improves adaptability to real-world noise conditions. This work is significant as it combines advanced neural architectures with practical applications in noise cancellation, potentially leading to enhanced performance in diverse acoustic environments.
Existing voice deepfake detection and localization models rely heavily on representations extracted from speech foundation models (SFMs). However, downstream finetuning has now reached a state of diminishing returns. In this paper, we shift the focus to pretraining and propose a novel recipe that combines bottleneck masked embedding prediction with flow-matching based spectrogram reconstruction. The outcome, Alethia, is the first foundational audio encoder for various voice deepfake detection and localization tasks. We evaluate on $5$ different tasks with $56$ benchmark datasets, and note Alethia significantly outperforms state-of-the-art SFMs with superior robustness to real-world perturbations and zero-shot generalization to unseen domains (e.g., singing deepfakes). We also demonstrate the limitation of discrete targets in masked token prediction, and show the importance of continuous embedding prediction and generative pretraining for capturing deepfake artifacts.
Primary: Reality Defender Inc.
All Institutions: Reality Defender Inc., INRS
The main contribution of this paper is the introduction of Alethia, a foundational encoder for voice deepfakes that significantly enhances detection and localization capabilities through an innovative pretraining methodology. This work addresses critical gaps in existing models and sets a new standard for future research in the domain of audio deepfake detection.
The paper introduces a novel pretraining framework for voice deepfake detection, Alethia, which innovatively combines bottleneck masked embedding prediction with flow-matching based spectrogram reconstruction. This dual-branch approach allows the model to learn robust representations that capture generative artifacts in voice deepfakes, addressing limitations in existing speech foundation models (SFMs) that primarily focus on downstream finetuning. The methodology is well-structured, with a clear explanation of the model architecture, pretraining objectives, and the rationale behind the design choices, such as the use of continuous embeddings instead of discrete tokens.
The experimental evaluation is comprehensive, covering five different tasks across 56 benchmark datasets, which is a significant contribution to the field. The results demonstrate that Alethia outperforms existing SFMs in various metrics, including equal error rate (EER) and accuracy, particularly in challenging scenarios. The zero-shot generalization capability to unseen domains, such as singing deepfakes, is a notable strength of the model. However, the paper could benefit from more detailed ablation studies to further validate the contributions of each component in the proposed framework.
The paper provides a thorough description of the experimental setup, including data preprocessing, model architecture, and training procedures. However, the lack of publicly available code or datasets limits reproducibility. Providing a GitHub repository or links to the datasets used would enhance the ability of other researchers to replicate the findings.
One limitation of the study is the reliance on self-curated datasets for pretraining, which may introduce biases or artifacts not present in real-world data. Additionally, while the model shows promising results, its performance on edge cases or highly diverse datasets remains to be fully explored. The paper also does not address potential ethical implications of deepfake technology, which is crucial given the sensitive nature of the application.
The research has significant implications for the field of audio processing and deepfake detection, contributing to the development of more robust systems that can help mitigate the risks associated with the misuse of deepfake technology. As deepfakes become more prevalent, the ability to detect and localize them effectively is crucial for maintaining trust in digital communications. The main contribution of this paper is the introduction of Alethia, a foundational encoder for voice deepfakes that significantly enhances detection and localization capabilities through an innovative pretraining methodology. This work addresses critical gaps in existing models and sets a new standard for future research in the domain of audio deepfake detection.
Accented automatic speech recognition (ASR) often degrades due to the limited availability of accented training data. Prior work has explored accent modeling in low-resource settings, but existing approaches typically require minutes to hours of labeled speech, which may still be impractical for truly scarce accent scenarios. We propose a pipeline that adapts a text-to-speech (TTS) decoder to a target-accent speaker using fewer than ten reference utterances and employs large language model (LLM)-based phoneme editing to generate accent-conditioned pronunciations. The resulting synthetic speech is used to fine-tune a self-supervised ASR model. Experiments demonstrate consistent word error rate (WER) reductions on real accented speech, including cross-speaker evaluation and ultra-low data regimes. A matched-rate random phoneme baseline shows that phoneme-space perturbation itself is a strong form of augmentation, while LLM-guided edits provide additional gains through accent-conditioned structure.
Primary: University of Illinois Urbana-Champaign
All Institutions: University of Illinois Urbana-Champaign, National Center for Supercomputing Applications
The main contribution of this paper is the development of a few-shot accent synthesis pipeline that leverages LLM-guided phoneme editing to improve ASR performance in low-resource settings. This innovative approach not only addresses the challenge of accent adaptation but also demonstrates the effectiveness of combining TTS and ASR technologies to enhance speech recognition across diverse accents.
The proposed methodology effectively combines few-shot learning with LLM-guided phoneme editing to address the challenge of accent adaptation in ASR systems. The approach is innovative in its use of a phoneme-conditioned TTS model and the integration of LLMs for phoneme editing, which allows for accent-specific pronunciation adjustments while maintaining prosodic alignment. The system's architecture is well-defined, and the use of a matched-rate random phoneme baseline provides a strong comparative framework to evaluate the effectiveness of the LLM-guided edits.
The experiments are comprehensive, evaluating the proposed method across multiple accents (Indian and Korean English) and demonstrating significant improvements in WER through synthetic data generation. The paper provides a clear experimental setup, including detailed descriptions of the datasets, evaluation metrics, and results. The findings indicate that the proposed method not only enhances ASR performance in low-resource scenarios but also shows potential for cross-speaker generalization, which is a critical aspect of practical ASR applications.
The paper includes sufficient implementation details, including training configurations, feature extraction methods, and evaluation protocols, which support reproducibility. However, the absence of a public code repository limits the ease with which other researchers can replicate the results. The authors should consider releasing their code and models to enhance reproducibility.
One notable limitation is that the system inherits prosody from the source speech rather than modeling accent-specific prosodic variations, which may restrict the fidelity of the synthesized speech. Additionally, the adaptation is limited to a single reference speaker, which could affect the generalizability of the results across different speakers and accents. Future work should address these limitations by exploring multi-speaker accent generation and explicit prosody modeling.
The research has significant implications for improving ASR systems in diverse linguistic contexts, particularly for underrepresented accents. By enabling effective accent adaptation with minimal data, this work can contribute to more inclusive speech technologies that better serve global populations. The potential applications extend to various domains, including voice assistants, transcription services, and accessibility tools, enhancing communication for speakers of different accents. The main contribution of this paper is the development of a few-shot accent synthesis pipeline that leverages LLM-guided phoneme editing to improve ASR performance in low-resource settings. This innovative approach not only addresses the challenge of accent adaptation but also demonstrates the effectiveness of combining TTS and ASR technologies to enhance speech recognition across diverse accents.
Wide exploration on robocall surveillance research is hindered due to limited access to public datasets, due to privacy concerns. In this work, we first curate Robo-SAr, a synthetic robocall dataset designed for robocall surveillance research. Robo-SAr comprises of ~200 unwanted and ~1200 legitimate synthetic robocall samples across three realistic adversarial axes: psycholinguistics-manipulated transcripts, emotion-eliciting speech, and cloned voices. We further propose RoboKA, a Kolmogorov-Arnold Network (KAN)-based multimodal fusion framework designed to model structured nonlinear interactions between acoustic and linguistic cues that characterize diverse adversarial robocall strategies. RoboKA first leverages cross-modal contrastive learning to align latent modality representations and feeds the resulting embeddings to a KAN-projection head for final classification. We benchmark RoboKA against strong unimodal and multimodal baselines in both in-domain and out-of-domain setups, finding RoboKA to surpass all baselines in terms of recall and F1-score.
Primary: Indraprastha Institute of Information Technology Delhi
All Institutions: Indraprastha Institute of Information Technology Delhi, George Mason University
The main contribution of this paper is the introduction of Robo-SAr, a novel adversarial dataset for robocall surveillance, and the development of RoboKA, a KAN-informed multimodal framework that significantly improves the detection of unwanted calls. This work addresses critical gaps in the field by providing a comprehensive approach to modeling the complex interactions between audio and linguistic cues, thereby advancing the state of the art in robocall detection.
The methodology presented in this paper is robust and innovative, leveraging a novel dataset (Robo-SAr) that addresses the limitations of existing datasets in robocall research. The use of Kolmogorov-Arnold Networks (KAN) for multimodal fusion is a significant advancement, as it allows for the modeling of complex nonlinear interactions between audio and text modalities. The cross-modal contrastive learning approach enhances the alignment of representations, which is crucial for effective robocall detection. The authors also provide a clear explanation of their methods and the rationale behind their choices, making the methodology both sound and well-justified.
The experimental evaluation is comprehensive, benchmarking RoboKA against various unimodal and multimodal baselines under different conditions, including in-domain and out-of-domain setups. The results demonstrate a clear performance advantage for RoboKA, particularly in challenging scenarios, which underscores the effectiveness of the proposed approach. The use of human validation for the dataset adds credibility to the findings, although the paper could benefit from more detailed statistical analysis of the results.
The paper commits to releasing the dataset and code upon review, which is a positive step towards ensuring reproducibility. However, the lack of explicit URLs for accessing the dataset and code is a drawback. The methodology is described in sufficient detail to allow for replication, but the absence of a demo or project URL limits immediate accessibility for other researchers.
The paper acknowledges several limitations, including the focus on English language robocalls, which restricts the applicability of the findings to multilingual contexts. Additionally, the reliance on synthetic data raises questions about the generalizability of the results to real-world scenarios. The authors also note that the dataset may not fully capture the complexities of real-world robocalls, which could impact the robustness of the model in practical applications.
The implications of this research are significant, particularly in the context of increasing robocall threats. By providing a robust framework for detecting deceptive robocalls, this work has the potential to enhance consumer protection and inform regulatory efforts. The methodology could also be adapted for other domains where multimodal deception detection is relevant, such as phishing or online scams. The main contribution of this paper is the introduction of Robo-SAr, a novel adversarial dataset for robocall surveillance, and the development of RoboKA, a KAN-informed multimodal framework that significantly improves the detection of unwanted calls. This work addresses critical gaps in the field by providing a comprehensive approach to modeling the complex interactions between audio and linguistic cues, thereby advancing the state of the art in robocall detection.
We show that pretrained acoustic embeddings classify elephant vocalisations at a level approaching that of end-to-end supervised neural networks, without any fine-tuning of the embedding model. This result is of practical importance because annotated bioacoustic data are scarce and costly to obtain, leaving conventional supervised approaches prone to overfitting and to poor generalisation under domain shift. A broad range of embedding models drawn from general audio, speech, and bioacoustic domains is evaluated, all of which are either out-of-domain (containing no bioacoustic data) or out-of-species (containing no elephant call data). The embedding networks themselves remain fixed; only the lightweight downstream classifiers, which include a linear model and several small neural networks, are trained. Among the models considered, Perch 2.0 achieves the best cross-validated classification performance, attaining AUCs of 0.849 on African bush elephant (Loxodonta africana) calls and 0.936 on Asian elephant (Elephas maximus) calls, with Perch 1.0 close behind. The best-performing system is within 2.2 % of an end-to-end supervised elephant call classification system. A layerwise analysis of pretrained transformer encoders, considered as embedding models, shows that intermediate representations outperform final-layer outputs. The second layer of both wav2vec2.0 and HuBERT encodes sufficient information for effective elephant call classification; truncation at this layer therefore preserves classification performance whilst retaining only approximately 10 % of the parameters of the full network. Such compact embedding networks are well suited to on-device processing where computational resources are limited.
Primary: University of Stellenbosch
All Institutions: University of Stellenbosch
The paper presents a pioneering evaluation of elephant call classification using pretrained acoustic embeddings, achieving significant performance without fine-tuning. This work not only advances the field of bioacoustics but also sets a precedent for leveraging existing models in low-data scenarios, thereby enhancing conservation efforts through automated analysis of wildlife vocalizations.
The paper introduces a novel approach to elephant call classification using pretrained acoustic embeddings without fine-tuning, which is significant given the scarcity of annotated bioacoustic data. The methodology is well-structured, employing a variety of embedding models from different domains and evaluating their performance with lightweight classifiers. The choice to analyze intermediate layers of transformer models for their efficacy in classification is particularly innovative, providing insights into the model's internal representations. The segmentation and classification processes are clearly defined, ensuring a robust experimental design.
The experiments are comprehensive, utilizing two distinct datasets for evaluation, which enhances the validity of the results. The performance metrics, including AUC and MAP, are appropriate for the classification task and allow for a nuanced understanding of model effectiveness. The results demonstrate that the best-performing embedding model, Perch 2.0, achieves competitive performance compared to end-to-end supervised models, highlighting the potential of using out-of-domain embeddings in low-resource settings.
The paper provides sufficient detail regarding the experimental setup, including data segmentation, model configurations, and hyperparameter tuning, which supports reproducibility. However, the lack of publicly available code or datasets limits the ease with which other researchers can replicate the study.
One notable limitation is the reliance on pretrained models that may not be strictly out-of-species, particularly with Perch 2.0, which raises questions about the generalizability of the findings. Additionally, the paper does not address potential biases in the datasets or the implications of using embeddings from models trained on other species.
The implications of this research extend beyond elephant call classification, as it demonstrates the utility of pretrained embeddings in bioacoustics, potentially influencing conservation strategies and wildlife management. The approach could be adapted for other endangered species, promoting the use of machine learning in ecological research and conservation efforts. The paper presents a pioneering evaluation of elephant call classification using pretrained acoustic embeddings, achieving significant performance without fine-tuning. This work not only advances the field of bioacoustics but also sets a precedent for leveraging existing models in low-data scenarios, thereby enhancing conservation efforts through automated analysis of wildlife vocalizations.
Audio-based stuttering systems to date have been trained for detection -- what disfluency is present now -- leaving prediction, the capability needed for closed-loop intervention, unstudied at deployable scale. We train a 616K-parameter CNN on SEP-28k (Apple, 20,131 three-second clips) to predict whether the next contiguous clip contains any disfluency. (1) Severity-selective precursor signal: on the episode-grouped test set, aggregate preblock AUC is modest (0.581 [0.542, 0.619]), but stratifying by upcoming event type reveals concentration on clinically severe events -- blocks 0.601 [0.554, 0.651] and sound repetitions 0.617 [0.567, 0.667] both exclude chance, while fillers (0.45) and word repetitions (0.49) are at chance. The aggregate objective converges to a severity-selective predictor because severe events carry prosodic precursors; fillers do not. (2) Cross-population transfer: without fine-tuning, the same checkpoint applied to 1,024 pediatric Children-Who-Stutter utterances (FluencyBank Teaching) attains AUC 0.674 detection and 0.655 prediction; DisfluencySpeech and LibriStutter reach 0.58-0.60 AUC. (3) Deployable on-device: lossless export to CoreML (1.19 MB), ONNX (40 KB), TFLite. Neural-Engine latency per 3 s window: 0.25 ms (iPhone 17 Pro Max, A19 Pro) to 0.55 ms (iPhone SE 3rd-gen and M1 Max). A 4 Hz streaming simulation uses 0.54% of the real-time budget. Platt-calibrated outputs (test ECE 0.010, from 0.177 raw). Five negative ablations -- output-level Future-Guided Learning, multi-clip GRU, time-axis concatenation, asymmetric focal loss, direct block-targeted training -- none improved over the vanilla baseline.
Primary: Kozak Technologies Inc
All Institutions: Kozak Technologies Inc
The main contribution of this paper is the development of a predictive model for stuttering events using audio data, demonstrating that a relatively simple CNN can effectively identify clinically severe disfluencies based on prosodic precursors. This work not only advances the understanding of stuttering prediction but also paves the way for practical applications in speech therapy and real-time intervention systems.
The paper employs a convolutional neural network (CNN) architecture specifically designed for predicting stuttering events based on audio input. The methodology is robust, utilizing a well-defined dataset (SEP-28k) and employing a clear training objective that focuses on predicting upcoming disfluencies. The stratification of results by severity of disfluency types is a significant methodological strength, allowing for a nuanced understanding of the model's predictive capabilities. The inclusion of negative ablation studies further strengthens the methodology by demonstrating a thorough exploration of potential improvements that did not yield better results.
The experiments are well-structured, with a clear focus on both detection and prediction tasks. The use of multiple datasets, including cross-population transfer evaluations, enhances the credibility of the findings. The reported AUC scores provide a quantitative measure of performance, and the stratified analysis reveals important insights into the model's strengths and weaknesses. The deployment metrics, including on-device latency and model size, are particularly relevant for practical applications, showcasing the model's readiness for real-world use.
The paper emphasizes reproducibility by providing access to the training code, label-generation scripts, and the trained model weights. The detailed description of the training process, including hyperparameters and data preprocessing steps, further supports reproducibility. The inclusion of a catalog of negative results is a commendable practice that aids future research by preventing redundant efforts.
The paper acknowledges several limitations, including the single-clip context that may restrict the model's performance and the potential for variability across different speakers and datasets. The lack of fine-tuning on external datasets raises questions about the generalizability of the model's predictions. Additionally, the reliance on a coarse label for upcoming events could be improved with more precise annotations.
The research has significant implications for the field of speech therapy and assistive technologies for individuals who stutter. By enabling predictive capabilities in real-time, the model could facilitate closed-loop interventions that provide timely feedback to users. The deployment of such technology on consumer devices could enhance accessibility and usability for a broader audience, potentially improving the quality of life for many individuals. The main contribution of this paper is the development of a predictive model for stuttering events using audio data, demonstrating that a relatively simple CNN can effectively identify clinically severe disfluencies based on prosodic precursors. This work not only advances the understanding of stuttering prediction but also paves the way for practical applications in speech therapy and real-time intervention systems.
Multi-talker automatic speech recognition (ASR) in conversational recordings remains an open problem, particularly in scenarios with large portion of overlapping speech where identifying and transcribing a target speaker is difficult from audio alone. Visual cues can help resolve speaker ambiguity, yet their integration into long-context audio-visual (AV) ASR systems has been limited. The CHiME-9 MCoRec task addresses this challenge by requiring transcription of audio-visual recordings of heavily-overlapped parallel conversations, followed by clustering the participants into conversational groups. In this work, we present the BUT system based on a long-context target-speaker AV-ASR model capable of processing long-form recordings in a single decoding pass. Our architecture conditions a pre-trained NVIDIA Parakeet-v2 ASR model on visual representations from a pre-trained AV-HuBERT model. To cluster participants into conversation groups, we employ Qwen3.5-122B LLM to estimate transcript topic similarity followed by hierarchical agglomerative clustering. On the MCoRec development set, the proposed system achieves 33.7% WER and a clustering F1 score of 0.97, improving over the official baseline by 16.2% WER and 0.15 F1 absolute. On the eval set, our team ranked second, being 0.16% WER and 0.5% F1 worse than the best system.
Primary: Brno University of Technology
All Institutions: Brno University of Technology
This paper presents a novel approach to multi-talker ASR by integrating audio-visual cues and leveraging LLMs for clustering, achieving significant improvements over existing methods. The methodology is well-structured, and the results indicate a meaningful contribution to the field, although attention to limitations and reproducibility could enhance its impact further.
The proposed methodology integrates audio-visual cues into a long-context ASR system, leveraging pre-trained models (NVIDIA Parakeet-v2 and AV-HuBERT) effectively. The use of a gated mechanism for fusing audio and visual features is a notable innovation, allowing the model to dynamically adjust its reliance on each modality. The clustering approach, which employs a large language model (LLM) for semantic topic similarity, represents a significant departure from traditional heuristic methods. This combination of techniques is well-justified and demonstrates a thoughtful approach to addressing the challenges of multi-talker ASR.
The experimental setup is robust, with clear metrics for both transcription (WER) and clustering (F1 score). The authors provide a thorough analysis of their results, showing substantial improvements over the baseline. However, the reliance on synthetic data for training raises questions about the generalizability of the results to real-world scenarios. The evaluation on both development and eval sets, along with comparisons to baseline systems, adds credibility to their findings.
The paper includes sufficient implementation details, including the training regimen, data preprocessing, and the use of specific frameworks (NeMo and DSPy). The availability of the code on GitHub enhances reproducibility, although the authors could provide more detailed instructions for replicating the experiments.
One limitation is the potential domain mismatch between the synthetic training data and the real-world MCoRec dataset, which could affect the model's performance in practical applications. Additionally, while the clustering approach shows promise, its reliance on LLMs may introduce variability based on the model's performance and the quality of the transcripts.
The advancements in multi-talker ASR have significant implications for applications in various fields, including telecommunications, accessibility for the hearing impaired, and human-computer interaction. The integration of visual cues into ASR systems could lead to more robust and accurate transcription services, enhancing communication in noisy environments. This paper presents a novel approach to multi-talker ASR by integrating audio-visual cues and leveraging LLMs for clustering, achieving significant improvements over existing methods. The methodology is well-structured, and the results indicate a meaningful contribution to the field, although attention to limitations and reproducibility could enhance its impact further.
Cross-lingual speaker verification suffers from severe language-speaker entanglement. This causes systematic degradation in the hardest scenario: correctly accepting utterances from the same speaker across different languages while rejecting those from different speakers sharing the same language. Standard adversarial disentanglement degrades speaker discriminability; blind discriminators inadvertently penalize speaker-discriminative traits that merely correlate with language. To address this, we propose Dual-LoRA, injecting trainable task-factorized LoRA adapters into a frozen pre-trained backbone. Our core innovation is a Language-Anchored Adversary: by grounding the discriminator with an explicit language branch, adversarial gradients target true linguistic cues rather than arbitrary correlations, preserving essential speaker characteristics. Evaluated on the TidyVoice benchmark, our system achieves a 0.91% validation EER and achieves 3rd place in the official challenge.
Primary: Nanjing University
All Institutions: Nanjing University, AISpeech Co, Jiangsu Key Lab of Language Computing, MoE Key Lab of Artificial Intelligence, Shanghai Jiao Tong University, Soul AI Lab
The paper presents Dual-LoRA, an innovative framework for cross-lingual speaker verification that effectively disentangles language and speaker identity, achieving notable performance improvements on benchmark evaluations. The comprehensive methodology and rigorous experimental validation contribute significantly to the field, addressing a critical challenge in speaker verification systems.
The methodology presented in the paper is innovative, particularly in its use of Dual-LoRA, which introduces a parameter-efficient approach to disentangle language and speaker identity in cross-lingual speaker verification. The architecture's design, which incorporates two parallel LoRA streams and a Language-Anchored Adversary, is well-justified and addresses key challenges in the field. The decision to keep the backbone frozen while adapting only the LoRA modules is a strategic choice that enhances the model's efficiency and effectiveness.
The experiments conducted on the TidyVoice benchmark are robust, with a clear focus on evaluating the proposed framework against established baselines. The use of multiple backbones and the systematic analysis of different configurations provide strong evidence for the effectiveness of the Dual-LoRA approach. The reported results, including the significant reduction in EER, particularly in challenging scenarios, underscore the practical impact of the proposed method.
The paper provides sufficient implementation details, including the architecture, training procedures, and hyperparameters, which facilitate reproducibility. However, the absence of a publicly accessible code repository or demo limits the ability for others to replicate the results independently.
One notable limitation is the reliance on a single benchmark dataset (TidyVoice) for evaluation, which may not fully capture the generalizability of the proposed method across diverse real-world scenarios. Additionally, while the paper addresses the issue of language-speaker entanglement, it does not explore potential biases that may arise from the training data or the implications of using specific languages.
The proposed Dual-LoRA framework has the potential to significantly enhance cross-lingual speaker verification systems, making them more effective for applications in voice authentication and personalization across different languages. This advancement could lead to broader adoption of voice-based technologies in multilingual contexts, improving accessibility and user experience. The paper presents Dual-LoRA, an innovative framework for cross-lingual speaker verification that effectively disentangles language and speaker identity, achieving notable performance improvements on benchmark evaluations. The comprehensive methodology and rigorous experimental validation contribute significantly to the field, addressing a critical challenge in speaker verification systems.
Conventional neural speech codecs suffer from severe intelligibility degradation at ultra-low bitrates, where the bottleneck transitions from acoustic distortion to semantic loss. To address this issue, this paper conducts a systematic investigation into the role and fundamental limits of integrating frozen semantic priors -- specifically HuBERT and Whisper -- into neural speech coding. We introduce and quantitatively validate a novel Semantic Retirement phenomenon: while semantic constraints reduce the Word Error Rate (WER) by up to ~10% relatively at 1.5 kbps, their benefits rapidly diminish beyond 6 kbps, indicating a practical capacity boundary. We further uncover a clear trade-off between different prior types: acoustic-rich priors (HuBERT) better preserve prosodic and timbral details, whereas high-level linguistic priors (Whisper) effectively suppress phonetic hallucinations in noisy environments (reducing hallucination rates by 26 percent) and substantially narrow the generalization gap for unseen speakers. Building on these findings, we propose a bitrate-aware regulation strategy that dynamically adjusts prior strength to optimize the trade-off between semantic consistency and perceptual naturalness. Extensive experimental evaluations confirm that our approach achieves competitive intelligibility and noise robustness compared to existing baselines, offering a principled pathway toward ultra-low-bitrate generative speech coding.
Primary: Tsinghua Shenzhen International Graduate School, Tsinghua University
All Institutions: Tsinghua Shenzhen International Graduate School, Tsinghua University, Tencent
This paper presents a comprehensive analysis of the role of semantic priors in neural speech coding, introducing a novel framework that enhances intelligibility and robustness at ultra-low bitrates. The innovative methodology and thorough experimental evaluation contribute significantly to the field of audio processing, addressing a critical challenge in speech codec design.
The methodology presented in this paper is robust and well-structured. The authors propose a novel framework that integrates frozen semantic priors (HuBERT and Whisper) into a neural speech codec, addressing the challenges of intelligibility degradation at ultra-low bitrates. The introduction of the "Semantic Retirement" phenomenon is a significant contribution, as it quantitatively defines the limits of semantic guidance in speech coding. The bitrate-aware regulation strategy is particularly innovative, allowing the model to dynamically adjust the strength of semantic constraints based on the bitrate, which is a practical approach to optimize performance across varying conditions.
The experimental evaluation is extensive and well-executed, utilizing the LibriSpeech dataset to validate the proposed framework. The authors provide a thorough analysis of the performance metrics, including Word Error Rate (WER), Perceptual Evaluation of Speech Quality (PESQ), and robustness against noise. The results convincingly demonstrate the effectiveness of the proposed method in improving intelligibility and reducing hallucination rates, particularly in low-bitrate scenarios. The ablation studies further strengthen the findings by isolating the effects of different semantic priors and the regulation strategy.
The paper includes sufficient implementation details, such as the architecture of the neural codec, the configuration of the Residual Vector Quantization, and the training setup. However, the absence of a publicly available code repository or demo URL limits the reproducibility of the results. Providing access to the models and datasets used would enhance the ability of other researchers to replicate and build upon this work.
One limitation is the reliance on frozen semantic priors, which may not capture the full range of acoustic nuances needed for optimal performance in all scenarios. Additionally, the paper primarily focuses on two specific priors (HuBERT and Whisper), which may limit the generalizability of the findings to other types of semantic guidance. The authors also acknowledge the potential for over-smoothing at higher bitrates, which could affect the naturalness of the output.
The findings of this research have significant implications for the development of efficient speech coding systems, particularly in applications where bandwidth is severely limited, such as mobile communications and low-bitrate streaming services. The insights gained from the "Semantic Retirement" phenomenon could inform future research on codec design and the integration of semantic information into other audio processing tasks. The approach could also pave the way for advancements in speech synthesis and recognition systems that require high intelligibility in challenging acoustic environments. This paper presents a comprehensive analysis of the role of semantic priors in neural speech coding, introducing a novel framework that enhances intelligibility and robustness at ultra-low bitrates. The innovative methodology and thorough experimental evaluation contribute significantly to the field of audio processing, addressing a critical challenge in speech codec design.
Achieving robust generalization against unseen attacks remains a challenge in Audio Deepfake Detection (ADD), driven by the rapid evolution of generative models. To address this, we propose a framework centered on hard sample classification. The core idea is that a model capable of distinguishing challenging hard samples is inherently equipped to handle simpler cases effectively. We investigate multiple reconstruction paradigms, identifying the diffusion-based method as optimal for generating hard samples. Furthermore, we leverage multi-layer feature aggregation and introduce a Regularization-Assisted Contrastive Learning (RACL) objective to enhance generalizability. Experiments demonstrate the superior generalization of our approach, with our best model achieving a significant reduction in the average Equal Error Rate (EER) compared to the baseline.
Primary: Southern University of Science and Technology
All Institutions: Southern University of Science and Technology, Tencent Youtu Lab
The main contribution of this paper is the development of a robust framework for Audio Deepfake Detection that leverages hard sample classification and diffusion-based reconstruction to enhance generalization against unseen attacks. This work represents a meaningful advancement in the field of audio deepfake detection, addressing critical challenges posed by evolving generative models.
The paper proposes a novel framework for Audio Deepfake Detection (ADD) that emphasizes hard sample classification and utilizes diffusion-based reconstruction methods. The integration of multi-layer feature aggregation and the introduction of Regularization-Assisted Contrastive Learning (RACL) are significant contributions that enhance the model's generalization capabilities. The methodology is well-structured, with clear explanations of the reconstruction paradigms and loss functions employed. However, while the approach is innovative, it builds on existing concepts in contrastive learning and reconstruction, which slightly limits its novelty.
The experiments are comprehensive, evaluating the proposed methods across multiple datasets, including ASVspoof and CodecFake. The results demonstrate a significant reduction in the average Equal Error Rate (EER) compared to baseline models, showcasing the effectiveness of the proposed framework. The ablation studies provide insights into the contributions of different components of the methodology, reinforcing the validity of the findings. However, the paper could benefit from a more detailed analysis of potential edge cases or scenarios where the model may underperform.
The implementation details are sufficiently detailed, including data preprocessing, model architecture, and training parameters, which enhances reproducibility. However, the absence of a publicly available code repository or demo limits the ability for other researchers to replicate the results directly.
One limitation is the reliance on specific reconstruction methods, which may not generalize well across all types of audio deepfakes. Additionally, the performance on certain datasets showed minor degradation, suggesting that the model may prioritize generalization over specific artifacts. The paper could also discuss potential biases in the datasets used for training and evaluation.
The implications of this research are significant, particularly in the context of security and misinformation, as robust audio deepfake detection systems are crucial for maintaining trust in audio communications. The proposed framework could be applied in various domains, including cybersecurity, media verification, and social media platforms, where audio authenticity is paramount. The main contribution of this paper is the development of a robust framework for Audio Deepfake Detection that leverages hard sample classification and diffusion-based reconstruction to enhance generalization against unseen attacks. This work represents a meaningful advancement in the field of audio deepfake detection, addressing critical challenges posed by evolving generative models.
Objective metrics for emotional expressiveness are vital for speech generation, particularly in expressive synthesis and voice conversion requiring emotional prosody transfer. To quantify this, the field widely relies on emotion similarity between reference and generated samples. This approach computes cosine similarity of embeddings from encoders like emotion2vec, assuming they capture affective cues despite linguistic and speaker variations. We challenge this assumption through controlled adversarial tasks and human alignment tests. Despite high classification accuracy, these latent spaces are unsuitable for zero-shot similarity evaluation. Representational limitations cause linguistic and speaker interference to overshadow emotional features, degrading discriminative ability. Consequently, the metric misaligns with human perception. This acoustic vulnerability reveals it rewards acoustic mimicry over genuine emotional synthesis.
Primary: National Taiwan University
All Institutions: National Taiwan University, University of Southern California
The paper critically examines the limitations of the emotion similarity metric EMO-SIM in evaluating emotional expressiveness in speech generation, revealing its misalignment with human perception and robustness issues. This comprehensive analysis challenges existing methodologies and underscores the need for improved evaluation frameworks in the field.
The paper employs a systematic approach to evaluate the limitations of the widely adopted EMO-SIM metric for emotional expressiveness in speech generation. It rigorously tests the metric against three criteria: categorical emotion robustness, dimensional emotion sensitivity, and human perception alignment. The methodology includes adversarial sampling, calibration of latent spaces, and a comprehensive evaluation against human judgments, which is a significant strength. However, the lack of a clear new metric or framework to replace EMO-SIM is a notable gap.
The experiments are well-designed, utilizing diverse datasets and multiple evaluation scenarios to assess the performance of EMO-SIM. The results consistently demonstrate the metric's inadequacy in capturing genuine emotional expressiveness, particularly under various acoustic and linguistic distractors. The statistical analyses, including Spearman's correlation and triplet accuracy, provide robust evidence of the findings. However, the paper could benefit from additional comparisons with existing metrics to contextualize its claims further.
The paper provides sufficient detail on the experimental setup, including dataset preparation and evaluation criteria, which aids reproducibility. However, the absence of publicly available code or datasets limits the ability for other researchers to replicate the findings fully.
The primary limitation is the lack of a proposed alternative metric to EMO-SIM, which leaves a gap in practical applicability. Additionally, the focus on a single metric may overlook other potential evaluation frameworks that could be more effective. The experiments also rely heavily on subjective human evaluations, which may introduce variability.
This work has significant implications for the development of more reliable metrics in speech synthesis and emotional voice conversion, which are critical for applications in human-computer interaction, entertainment, and accessibility technologies. By highlighting the deficiencies of current evaluation methods, it encourages the community to pursue more accurate and meaningful metrics for emotional expressiveness in generated speech. The paper critically examines the limitations of the emotion similarity metric EMO-SIM in evaluating emotional expressiveness in speech generation, revealing its misalignment with human perception and robustness issues. This comprehensive analysis challenges existing methodologies and underscores the need for improved evaluation frameworks in the field.