Audio ML Papers

Accelerating Multi-Condition T2I Generation via Adaptive Condition Offloading and Pruning

Yuxin Kong, Peng Yang, Chongbin Yi ... · IEEE ICME 2026

Text-to-image (T2I) generation using multiple conditions enables fine-grained user control on the generated image. Yet, incorporating multi-condition inputs incurs substantial computation and communication overhead, due to additional preprocessing subtasks and control optimizatio...

Text-to-image (T2I) generation using multiple conditions enables fine-grained user control on the generated image. Yet, incorporating multi-condition inputs incurs substantial computation and communication overhead, due to additional preprocessing subtasks and control optimizations. It hence leads to unacceptable generation latency. In this paper, we propose an end-edge collaborative system design to accelerate multi-condition T2I generation through adaptive condition offloading and pruning. Extensive offline profiling reveal that, different conditions exhibit significant diversity in computation and communication costs. To this end, we propose a \textit{Subtask Manager} that jointly optimizes condition inference offloading and bandwidth allocation using a heuristic algorithm, balancing local and edge execution delays to minimize overall preprocessing latency. Then, we design a lightweight feature-driven \textit{Conditioning Scale Estimator} that evaluates the contribution of each condition by analyzing its feature activation strength and overlap with other conditions. This allows adaptive conditioning scale selection and pruning of insignificant conditions, thereby accelerating the denoising process. Extensive experimental results show that our system reduces latency by nearly 25\% and improves 6\% average generation quality, outperforming other benchmarks.

Institutional Affiliations

Primary: Huazhong University of Science and Technology

All Institutions: Huazhong University of Science and Technology, Central South University

ML Relevance Analysis (83)

The main contribution of this paper is the introduction of an end-edge collaborative system that effectively accelerates multi-condition T2I generation through adaptive condition offloading and pruning. This work represents a meaningful advancement in the field, addressing critical challenges in computational efficiency and user control in AI-generated content.

Comprehensive Analysis

Methodology Assessment

The paper presents a novel end-edge collaborative system design that addresses the computational and communication overhead associated with multi-condition text-to-image (T2I) generation. The proposed Subtask Manager optimizes condition inference offloading and bandwidth allocation using a heuristic algorithm, which is a significant improvement over existing methods. The Conditioning Scale Estimator further enhances the system by evaluating the contribution of each condition, allowing for adaptive pruning of insignificant conditions. This dual approach effectively reduces latency while maintaining image quality, showcasing a well-thought-out methodology that balances local and edge processing.

Experimental Evaluation

The experimental results are robust, demonstrating a 25% reduction in latency and a 6% improvement in average generation quality compared to existing benchmarks. The authors conduct extensive profiling and performance evaluations across various hardware setups, which strengthens the validity of their claims. However, the paper could benefit from more detailed comparisons with a broader range of existing methods to contextualize the improvements more effectively.

Reproducibility

The paper provides a clear description of the experimental setup, including the hardware used and the specific configurations for the algorithms. However, the lack of a publicly accessible code repository or demo limits the reproducibility of the results. Future work should consider making the implementation available to facilitate further research and validation.

Limitations

One limitation of the proposed system is its reliance on specific hardware configurations, which may not generalize to all user devices. Additionally, the heuristic nature of the optimization may not guarantee the absolute best performance in all scenarios, particularly in highly variable network conditions. The paper also does not address potential scalability issues when the number of users or conditions increases significantly.

Broader Impact

The proposed system has significant implications for real-time applications in AI-generated content, particularly in scenarios where user interaction and control are paramount. By reducing latency and improving generation quality, this work could enhance user experiences in creative industries, gaming, and virtual reality. The approach also opens avenues for further research in edge computing and collaborative AI systems. The main contribution of this paper is the introduction of an end-edge collaborative system that effectively accelerates multi-condition T2I generation through adaptive condition offloading and pruning. This work represents a meaningful advancement in the field, addressing critical challenges in computational efficiency and user control in AI-generated content.

Analysis: Full Paper • Full text: 26,714 characters

Omni-DeepSearch: A Benchmark for Audio-Driven Omni-Modal Deep Search

Tao Yu, yiming ding, Shenghua Chai ... · arXiv

Current omni-modal benchmarks mainly evaluate models under settings where multiple modalities are provided simultaneously, while the ability to start from audio alone and actively search for cross-modal evidence remains underexplored. In this paper, we introduce \textbf{Omni-Deep...

Current omni-modal benchmarks mainly evaluate models under settings where multiple modalities are provided simultaneously, while the ability to start from audio alone and actively search for cross-modal evidence remains underexplored. In this paper, we introduce \textbf{Omni-DeepSearch}, a benchmark for audio-driven omni-modal deep search. Given one or more audio clips and a related question, models must infer useful clues from audio, invoke text, image, and video search tools, and perform multi-hop reasoning to produce a short, objective, and verifiable answer. Omni-DeepSearch contains 640 samples across 15 fine-grained categories, covering four retrieval target modalities and four audio content types. A multi-stage filtering pipeline ensures audio dependence, retrieval necessity, visual modality necessity, and answer uniqueness. Experiments on recent closed-source and open-source omni-modal models show that this task remains highly challenging: the strongest evaluated model, Gemini-3-Pro, achieves only 43.44\% average accuracy. Further analyses illustrate key bottlenecks in audio entity inference, query formulation, tool-use reliability, multi-hop retrieval, and cross-modal verification. These results highlight audio-driven omni-modal deep search as an important and underexplored direction for future multimodal agents.

Institutional Affiliations

Primary: Chinese Academy of Sciences (CASIA)

All Institutions: Chinese Academy of Sciences (CASIA), University of Chinese Academy of Sciences (UCAS), Beijing Academy of Artificial Intelligence (BAAI), Peking University, Tsinghua University

ML Relevance Analysis (83)

The paper presents Omni-DeepSearch, a benchmark for audio-driven omni-modal deep search, highlighting the challenges and limitations of current models while providing a structured methodology for future research in this underexplored area.

Comprehensive Analysis

Methodology Assessment

The paper introduces a novel benchmark, Omni-DeepSearch, which focuses on audio-driven omni-modal deep search, a largely unexplored area in multimodal learning. The methodology is well-structured, with a clear definition of the task and a multi-stage filtering pipeline that ensures the quality and relevance of the dataset. The authors emphasize audio dependence and multi-hop reasoning, which are critical for evaluating models that must infer and retrieve information across different modalities based solely on audio input. The task taxonomy and dataset construction are thorough, providing a solid foundation for future research.

Experimental Evaluation

The experiments conducted on various models, including both closed-source and open-source, reveal significant challenges in the task, with the best-performing model achieving only 43.44% accuracy. This highlights the complexity of audio-driven retrieval and reasoning, as well as the limitations of current models. The ablation studies and case analyses provide valuable insights into specific failure modes, such as dominant clue bias and misclassification, which are critical for understanding the limitations of existing approaches.

Reproducibility

While the paper provides a comprehensive description of the dataset construction and evaluation metrics, it lacks detailed implementation specifics that would facilitate reproducibility. The absence of a publicly available dataset or code repository further limits the ability of other researchers to replicate the results or build upon this work.

Limitations

The paper acknowledges several limitations, including the inherent ambiguity of audio signals and the reliance on external knowledge for retrieval. Additionally, the performance gap between closed-source and open-source models suggests that there is still much work to be done in improving model capabilities in this domain. The lack of a publicly available dataset or code also hinders broader adoption and experimentation.

Broader Impact

The introduction of Omni-DeepSearch has the potential to significantly impact the field of multimodal learning by providing a new benchmark that emphasizes audio as a primary modality for information retrieval. This could lead to advancements in various applications, including voice-activated assistants, audio-based search engines, and enhanced human-computer interaction systems. By addressing the challenges of audio-driven reasoning, this work opens up new avenues for research and development in multimodal AI. The paper presents Omni-DeepSearch, a benchmark for audio-driven omni-modal deep search, highlighting the challenges and limitations of current models while providing a structured methodology for future research in this underexplored area.

Analysis: Full Paper • Full text: 50,026 characters

Reducing Linguistic Hallucination in LM-Based Speech Enhancement via Noise-Invariant Acoustic-Semantic Distillation

Zheng Wang, Xiaobin Rong, Hang Su ... · arXiv

Language model (LM)-based speech enhancement (SE) can generate natural-sounding speech, but under severe noise it often suffers from unreliable conditioning, leading to perceptually plausible yet linguistically incorrect outputs. To address this issue, we propose L3-SE, a noise-i...

Language model (LM)-based speech enhancement (SE) can generate natural-sounding speech, but under severe noise it often suffers from unreliable conditioning, leading to perceptually plausible yet linguistically incorrect outputs. To address this issue, we propose L3-SE, a noise-invariant acoustic-semantic distillation framework for reducing linguistic hallucination in LM-based SE. The proposed method learns a noise-invariant conditioning encoder from noisy speech by jointly distilling two complementary clean-speech targets: an acoustic target for reconstruction fidelity and a semantic target for linguistic consistency. The resulting noise-invariant acoustic-semantic representations are used to condition a decoder-only autoregressive language model, which predicts clean acoustic tokens that are decoded into enhanced speech. To support high-quality generation, we further employ a high-fidelity codec built on learnable weighted WavLM layer representations as the discrete acoustic interface. By improving the reliability of conditioning under adverse conditions, the proposed framework substantially reduces hallucination and improves content faithfulness. Experiments show that the proposed method consistently outperforms prior LM-based speech enhancement baselines on linguistic consistency metrics, with especially clear gains under low-SNR and reverberant conditions, while maintaining competitive perceptual quality. Audio samples are available at https://max1wz.github.io/L3-SE-Demo-Page/. The complete source code will be released after the manuscript is accepted.

Institutional Affiliations

Primary: Nanjing University

All Institutions: Nanjing University, MiLM Plus, Xiaomi Inc.

Demo

ML Relevance Analysis (83)

The paper presents L3-SE, a novel framework for reducing linguistic hallucination in LM-based speech enhancement through noise-invariant acoustic-semantic distillation, demonstrating significant improvements in linguistic consistency and perceptual quality under challenging conditions.

Comprehensive Analysis

Methodology Assessment

The proposed L3-SE framework introduces a novel approach to speech enhancement by utilizing a noise-invariant acoustic-semantic distillation strategy. This dual-target distillation method, which leverages both acoustic fidelity and semantic consistency, is innovative in addressing the issue of linguistic hallucination in generative speech models. The architecture effectively combines a shared backbone with task-specific heads, allowing for robust conditioning that enhances the model's performance under noisy conditions. The integration of a high-fidelity codec further supports the quality of the generated speech, making the methodology both comprehensive and well-structured.

Experimental Evaluation

The experiments are thorough, utilizing a variety of datasets and evaluation metrics that cover perceptual quality, linguistic consistency, and speaker preservation. The results demonstrate that L3-SE outperforms existing baselines, particularly in challenging conditions such as low-SNR and reverberation. The use of both objective and subjective metrics strengthens the evaluation, providing a well-rounded assessment of the framework's capabilities.

Reproducibility

The paper mentions that the complete source code will be released upon acceptance, which is a positive aspect for reproducibility. However, the implementation details are somewhat dense, and while they provide a comprehensive overview of the training process, clearer guidelines or supplementary materials could enhance reproducibility further.

Limitations

One limitation is the reliance on specific datasets for training and evaluation, which may affect generalizability to other speech enhancement scenarios. Additionally, while the framework shows improvements in linguistic consistency, the perceptual quality metrics could still be further optimized to match or exceed the best-performing models in all conditions.

Broader Impact

The proposed framework has significant implications for applications in speech recognition, communication technologies, and assistive devices, where clarity and accuracy in speech are crucial. By addressing linguistic hallucination effectively, it could enhance user experience in various real-world applications, making it a valuable contribution to the field of audio processing and machine learning. The paper presents L3-SE, a novel framework for reducing linguistic hallucination in LM-based speech enhancement through noise-invariant acoustic-semantic distillation, demonstrating significant improvements in linguistic consistency and perceptual quality under challenging conditions.

Analysis: Full Paper • Full text: 50,026 characters

Towards Trustworthy Audio Deepfake Detection: A Systematic Framework for Diagnosing and Mitigating Gender Bias

Aishwarya Fursule, Shruti Kshirsagar, Anderson R. Avila · SMC 2026 conference

Audio deepfake detection systems are increasingly deployed in high-stakes security applications, yet their fairness across demographic groups remains critically underexamined. Prior work measures gender disparity but does not investigate where it comes from or how to fix it syste...

Audio deepfake detection systems are increasingly deployed in high-stakes security applications, yet their fairness across demographic groups remains critically underexamined. Prior work measures gender disparity but does not investigate where it comes from or how to fix it systematically. We present the first diagnosis-first framework that identifies bias source before applying targeted mitigation, evaluated on two models, AASIST and Wav2Vec2+ResNet18, on ASVSpoof5. Our diagnosis shows that bias does not stem from imbalanced training data but from acoustic representation differences, gender leakage in learned features, and structural evaluation asymmetry. We test mitigation strategies across in-processing, post-processing and combined families, including novel methods introduced in this work. Adjusting the decision threshold separately per gender reduces unfairness by 54% to 75% at no cost to detection accuracy, and our new epoch-level fairness regularisation method outperforms existing per-batch approaches. Adversarial debiasing succeeds only when gender leakage is localised, and fails when it is diffuse, an outcome correctly predicted by our diagnosis before training. No single method fully closes the fairness gap, confirming that bias sources must be identified before fixes are applied and that fairer benchmark design is equally important

Institutional Affiliations

Primary: Wichita State University

All Institutions: Wichita State University, Institut national de la recherche scientifique (INRS-EMT), INRS-UQO Mixed Research Unit on Cybersecurity

ML Relevance Analysis (82)

This paper presents a pioneering diagnosis-first framework for addressing gender bias in audio deepfake detection systems, significantly advancing the understanding and mitigation of bias in machine learning applications. The comprehensive methodology and rigorous experimental evaluation contribute valuable insights to the field, highlighting the importance of systematic bias diagnosis before applying mitigation strategies.

Comprehensive Analysis

Methodology Assessment

The paper introduces a systematic diagnosis-first framework for identifying and mitigating gender bias in audio deepfake detection. This approach is innovative as it emphasizes understanding the sources of bias before applying mitigation strategies, which is a significant departure from existing methods that often apply fixes without thorough diagnosis. The methodology is well-structured, detailing a comprehensive evaluation of bias sources at data, model, and decision levels, and it introduces novel mitigation techniques such as EAFR, SGFS, and GNEA. Each method is clearly defined, and the rationale for their implementation is well-articulated.

Experimental Evaluation

The experimental setup is robust, utilizing the ASVSpoof5 dataset, which is appropriate for the study's focus on gender fairness in audio deepfake detection. The paper conducts extensive experiments across multiple models (AASIST and Wav2Vec2+ResNet18) and evaluates various mitigation strategies, providing a thorough analysis of their effectiveness. The results are presented clearly, with a focus on multiple fairness metrics, which enhances the credibility of the findings. However, the reliance on a single dataset may limit the generalizability of the results.

Reproducibility

The paper provides sufficient detail regarding the experimental setup, model architectures, and evaluation protocols, which supports reproducibility. However, the absence of publicly available code or a project repository limits the ability for others to reproduce the findings directly. Including a demo or project URL would enhance the reproducibility aspect significantly.

Limitations

The study is limited to a single dataset (ASVSpoof5) and focuses on binary gender labels, which may not capture the full spectrum of gender representation. Additionally, while the paper identifies multiple sources of bias, it acknowledges that no single method completely closes the fairness gap, indicating that further research is needed to address these issues comprehensively.

Broader Impact

The implications of this work are significant, particularly in high-stakes applications such as security and identity verification, where fairness and bias in detection systems can have profound societal impacts. By addressing gender bias in audio deepfake detection, the paper contributes to the broader discourse on fairness in AI systems, emphasizing the need for equitable treatment across demographic groups. This paper presents a pioneering diagnosis-first framework for addressing gender bias in audio deepfake detection systems, significantly advancing the understanding and mitigation of bias in machine learning applications. The comprehensive methodology and rigorous experimental evaluation contribute valuable insights to the field, highlighting the importance of systematic bias diagnosis before applying mitigation strategies.

Analysis: Full Paper • Full text: 26,356 characters

Do Joint Audio-Video Generation Models Understand Physics?

Zijun Cui, Xiulong Liu, Hao Fang ... · arXiv

Joint audio-video generation models are rapidly approaching professional production quality, raising a central question: do they understand audio-visual physics, or merely generate plausible sounds and frames that violate real-world consistency? We introduce AV-Phys Bench, a benc...

Joint audio-video generation models are rapidly approaching professional production quality, raising a central question: do they understand audio-visual physics, or merely generate plausible sounds and frames that violate real-world consistency? We introduce AV-Phys Bench, a benchmark for evaluating physical commonsense in joint audio-video generation. AV-Phys Bench tests models across three scene categories: Steady State, Event Transition, and Environment Transition. It covers physics-grounded subcategories drawn from real-world scenes, plus Anti-AV-Physics prompts that deliberately request physically inconsistent audio-video behavior. Each generation is evaluated along five dimensions: visual semantic adherence, audio semantic adherence, visual physical commonsense, audio physical commonsense, and cross-modal physical commonsense. Across three proprietary and four open-source models, we find that Seedance 2.0 performs best overall, but all models remain far from robust physical understanding. Performance drops sharply on event-driven and environment-driven transitions, and even strong proprietary systems collapse on Anti-AV-Physics prompts. We further introduce AV-Phys Agent, a ReAct-style evaluator that combines a multimodal language model with deterministic acoustic measurement tools, producing rankings that closely align with human ratings. Our results identify cross-modal physical consistency and transition-driven scene dynamics as key open challenges for joint audio-video generation.

Institutional Affiliations

Primary: University of Texas at Dallas

All Institutions: University of Texas at Dallas, University of Washington, University of California, Los Angeles

Demo · GitHub

ML Relevance Analysis (86)

The main contribution of this paper is the establishment of a comprehensive benchmark for evaluating physical commonsense in joint audio-video generation models, which addresses a critical gap in the current evaluation landscape. The innovative methodology and thorough experimental evaluation provide valuable insights into the limitations of existing models and pave the way for future advancements in the field.

Comprehensive Analysis

Methodology Assessment

The paper introduces AV-Phys Bench, a novel benchmark for evaluating physical commonsense in joint audio-video generation models. The methodology is robust, employing a structured evaluation rubric that assesses five dimensions of performance across three scene categories. The use of both human evaluation and an automated evaluator (AV-Phys Agent) that integrates multimodal reasoning with deterministic audio measurement tools is particularly innovative. This dual approach enhances the reliability of the assessments and provides a comprehensive framework for understanding model performance beyond mere perceptual quality.

Experimental Evaluation

The experiments conducted are thorough, evaluating seven models across various categories and dimensions. The results reveal significant gaps in physical commonsense understanding among current models, particularly in transition scenarios. The performance metrics are well-defined, and the analysis is detailed, providing insights into the strengths and weaknesses of the evaluated models. The findings are significant, highlighting the challenges in achieving physical consistency in audio-video generation.

Reproducibility

The paper provides sufficient details regarding the evaluation setup, including the datasets and scoring mechanisms. The availability of the dataset and the code repository enhances reproducibility. However, the reliance on specific models for evaluation may limit the generalizability of the findings to other models not included in the study.

Limitations

The paper acknowledges limitations such as the focus on English prompts and the binary nature of the evaluation rubric, which may not capture the nuances of model performance. Additionally, the study is constrained to eight-second clips, which may not represent longer or more complex scenarios effectively.

Broader Impact

The introduction of AV-Phys Bench has the potential to significantly influence the development of joint audio-video generation models by providing a clear framework for assessing physical commonsense. This could lead to improvements in model architectures and training methodologies, ultimately enhancing the applicability of these models in real-world scenarios, such as virtual environments and educational content. The main contribution of this paper is the establishment of a comprehensive benchmark for evaluating physical commonsense in joint audio-video generation models, which addresses a critical gap in the current evaluation landscape. The innovative methodology and thorough experimental evaluation provide valuable insights into the limitations of existing models and pave the way for future advancements in the field.

Analysis: Full Paper • Full text: 50,026 characters

A Decomposed Retrieval-Edit-Rerank Framework for Chord Generation

Qiqi He, Dichucheng Li, Xiaoheng Sun ... · ACM International Conference on Multimedia Retrieval (ICMR 2026)

Chord generation is an inherently constrained creative task that requires balancing stylistic diversity with music-theoretic feasibility. Existing approaches typically entangle candidate generation and constraint enforcement within a single model, making the diversity-feasibility...

Chord generation is an inherently constrained creative task that requires balancing stylistic diversity with music-theoretic feasibility. Existing approaches typically entangle candidate generation and constraint enforcement within a single model, making the diversity-feasibility trade-off difficult to control and interpret. In this work, we approach chord generation from a system-level perspective, introducing a Retrieval-Edit-Rerank (RER) framework that decomposes the task into three explicit stages: i) retrieval, which defines a stylistically plausible candidate space; ii) editing, which enforces music-theoretic feasibility through minimal modifications; and iii) reranking, which resolves soft preferences among feasible candidates. This separation provides a controllable pipeline, where each component addresses a distinct aspect of the generation process, thereby enhancing both the interpretability and adjustability of the output chords. Through objective metrics and subjective evaluation, our decomposed system outperforms all end-to-end chord generation baselines in balancing chord diversity and music-theoretic feasibility. Ablation studies further confirm the complementary roles of each stage in creative exploration and constraint satisfaction.

Institutional Affiliations

Primary: NetEase Cloud Music

All Institutions: NetEase Cloud Music, Individual Researcher

ML Relevance Analysis (83)

The paper introduces a novel Retrieval-Edit-Rerank framework for chord generation that effectively balances stylistic diversity and music-theoretic feasibility. This work is significant as it provides a structured approach to a complex creative task, advancing the field of music generation by offering a system that is both interpretable and adaptable.

Comprehensive Analysis

Methodology Assessment

The proposed Retrieval-Edit-Rerank (RER) framework effectively decomposes the chord generation task into three distinct stages, allowing for a clear separation of concerns that enhances both interpretability and control over the generation process. The methodology is well-structured, with a focus on leveraging a melody-chord memory for retrieval, followed by an editing stage that enforces music-theoretic constraints, and a reranking stage that resolves preferences among feasible candidates. This approach is innovative in the context of music generation, as it combines stylistic diversity with theoretical validity in a systematic manner. The use of a contrastive learning framework for memory construction is a notable strength, as it allows for the retrieval of stylistically relevant chord progressions without sacrificing harmonic integrity.

Experimental Evaluation

The experiments are comprehensive, utilizing multiple datasets and a variety of metrics for both objective and subjective evaluation. The inclusion of ablation studies strengthens the findings by demonstrating the importance of each stage in the RER framework. The results show that the proposed method outperforms existing end-to-end models in terms of balancing diversity and feasibility, which is a critical aspect of chord generation. The subjective evaluations involving human participants provide additional validation of the system's effectiveness, indicating a well-rounded experimental design.

Reproducibility

The paper provides a clear description of the methodology and experimental setup, which facilitates reproducibility. However, the lack of publicly available code or datasets limits the ability of other researchers to replicate the results fully. Including a GitHub repository or links to datasets would significantly enhance the reproducibility of the work.

Limitations

One limitation is the reliance on a fixed set of music-theoretic constraints, which may not capture the full range of stylistic diversity present in various musical genres. Additionally, the system's performance may vary depending on the quality and diversity of the melody-chord memory constructed during training. The paper also notes that the editing stage can sometimes lead to overly conservative outputs, which may limit creative exploration.

Broader Impact

The RER framework has the potential to significantly impact music generation applications, particularly in contexts where adherence to music theory is essential, such as in music education, composition tools, and automated music production systems. By providing a controllable and interpretable approach to chord generation, this work could facilitate more nuanced interactions between musicians and AI systems, enhancing creativity while respecting musical traditions. The paper introduces a novel Retrieval-Edit-Rerank framework for chord generation that effectively balances stylistic diversity and music-theoretic feasibility. This work is significant as it provides a structured approach to a complex creative task, advancing the field of music generation by offering a system that is both interpretable and adaptable.

Analysis: Full Paper • Full text: 17,664 characters

Anisotropic Modality Align

Xiaomin Yu, Yijiang Li, Yuhui Zhang ... · arXiv

Training multimodal large language models has long been limited by the scarcity of high-quality paired multimodal data. Recent studies show that the shared representation space of pretrained multimodal contrastive models can serve as a bridge, enabling models to perform multimoda...

Training multimodal large language models has long been limited by the scarcity of high-quality paired multimodal data. Recent studies show that the shared representation space of pretrained multimodal contrastive models can serve as a bridge, enabling models to perform multimodal training with unimodal data. However, the key premise of this paradigm remains insufficiently understood: can representations from different modalities be reliably interchanged? The core obstacle lies in the persistent Modality Gap in the shared space. In this work, we revisit the geometric nature of the modality gap. We find that modality representations already share compatible dominant semantic geometry. What truly hinders modality interchangeability is not a simple global shift, but an anisotropic residual structure concentrated along a small number of dominant directions. Based on this finding, we further propose the principle of anisotropic modality gap alignment: effective modality alignment should align with the target-modality distribution while preserving the semantic structure of the source modality. Guided by this principle, we propose an anisotropic geometric correction framework, AnisoAlign, for unpaired modality alignment. This framework leverages the internal geometric prior of the target modality and performs bounded correction on source-modality representations, thereby constructing substitute representations in the target modality. Experiments confirm its benefits in both geometric diagnostics and text-only MLLM training. Overall, this work recasts the modality gap from an empirical observation into a correctable, structured geometric phenomenon and provides a new representation alignment perspective for training multimodal models with unimodal data.

Institutional Affiliations

Primary: Hong Kong University of Science and Technology (HKUST)

All Institutions: Hong Kong University of Science and Technology (HKUST), Stanford University

ML Relevance Analysis (83)

The main contribution of this paper is the introduction of AnisoAlign, a structured geometric correction framework that addresses the modality gap in multimodal learning, allowing for effective training of multimodal models using unimodal data. This work significantly advances the understanding of modality alignment and provides a robust methodology that can be applied in various multimodal applications.

Comprehensive Analysis

Methodology Assessment

The proposed methodology, AnisoAlign, presents a novel approach to addressing the modality gap in multimodal learning by focusing on the geometric structure of modality representations. The authors effectively identify that the modality gap is not merely a centroid shift but an anisotropic residual, which is a significant insight. The method involves a two-stage process that includes a target-modality prior pretraining and a bounded refinement step to ensure that the source modality's semantic structure is preserved while aligning with the target modality. This structured approach is well-justified and theoretically supported, making it a strong contribution to the field.

Experimental Evaluation

The experiments are comprehensive, evaluating both geometric diagnostics and the performance of the model in multimodal large language model (MLLM) training. The results demonstrate that AnisoAlign outperforms existing methods in terms of both representation alignment and MLLM training effectiveness. The use of various metrics to assess performance adds rigor to the evaluation, although the paper could benefit from more extensive ablation studies to further clarify the contributions of individual components.

Reproducibility

The paper provides a detailed description of the methodology and experimental setup, which aids in reproducibility. However, the lack of a public code repository or demo limits the ability for external validation of results. Future work should consider making the implementation available to enhance reproducibility.

Limitations

One limitation is the reliance on high-quality unimodal data, which may not always be available. Additionally, while the paper discusses the geometric aspects of modality alignment, it does not fully explore the implications of varying data quality on the effectiveness of the proposed method.

Broader Impact

The findings have significant implications for the development of multimodal models, particularly in scenarios where paired data is scarce. By enabling the use of unimodal data for training, this work could facilitate advancements in applications such as image captioning, visual question answering, and other areas where multimodal understanding is crucial. The main contribution of this paper is the introduction of AnisoAlign, a structured geometric correction framework that addresses the modality gap in multimodal learning, allowing for effective training of multimodal models using unimodal data. This work significantly advances the understanding of modality alignment and provides a robust methodology that can be applied in various multimodal applications.

Analysis: Full Paper • Full text: 50,026 characters

BeeVe: Unsupervised Acoustic State Discovery in Honey Bee Buzzing

Hamze Hammami, Nidhal Abdulaziz · arXiv

Discovering structure in biological signals without supervision is a fundamental problem in computational intelligence, yet existing bioacoustic methods assume vocal production models or predefined semantic units, leaving non-vocal species poorly served. This work introduces BeeV...

Discovering structure in biological signals without supervision is a fundamental problem in computational intelligence, yet existing bioacoustic methods assume vocal production models or predefined semantic units, leaving non-vocal species poorly served. This work introduces BeeVe, an unsupervised framework for acoustic state discovery in collective honey bee buzzing. BeeVe uses the self-supervised Patchout Spectrogram Transformer (PaSST) as a frozen feature extractor, then trains a Vector-Quantized Variational Autoencoder (VQ-VAE) without labels on those embeddings, learning a finite discrete codebook of acoustic tokens directly from unlabelled hive audio. No labels, pretext tasks, or contrastive objectives are used at any stage. Post-hoc evaluation against known queen status reveals that the learned tokens separate queenright and queenless conditions with Jensen-Shannon Divergence values between 0.609 and 0.688, and that the queenless condition further decomposes into three internally coherent sub-states stable across experiments with different codebook sizes and random seeds. Token transition analysis confirms non-random sequential structure (p << 0.001) across all experiments. Generalisation to unseen recordings preserves both token overlap (Jaccard = 0.947) and global manifold topology. These results demonstrate that unsupervised discrete codebook learning can recover repeatable acoustic structure from a non-vocal biological signal without annotation, opening a path toward non-invasive acoustic hive health monitoring.

Institutional Affiliations

Primary: Heriot-Watt University Dubai

All Institutions: Heriot-Watt University Dubai

ML Relevance Analysis (83)

The paper presents a significant advancement in unsupervised learning for bioacoustic state discovery, demonstrating the ability to extract structured acoustic patterns from honey bee buzzing without prior assumptions or annotations. The methodology is innovative and the results impactful, contributing to both machine learning and ecological monitoring fields.

Comprehensive Analysis

Methodology Assessment

The paper introduces BeeVe, a novel unsupervised framework for acoustic state discovery in honey bee buzzing, leveraging a self-supervised Patchout Spectrogram Transformer (PaSST) as a feature extractor and a Vector-Quantized Variational Autoencoder (VQ-VAE) for learning a discrete codebook of acoustic tokens. The methodology is well-structured, employing a rigorous unsupervised learning approach without relying on predefined labels or semantic assumptions. The use of post-hoc evaluation against known queen status to validate the learned tokens adds robustness to the methodology. However, the choice of PaSST as a frozen feature extractor, while justified, may limit the model's adaptability to other non-vocal species.

Experimental Evaluation

The experiments are comprehensive, utilizing a dataset of honey bee audio to assess the effectiveness of the proposed method. The results demonstrate significant separation between queenright and queenless conditions, with Jensen-Shannon Divergence values indicating meaningful distinctions. The identification of stable sub-states within the queenless condition and the analysis of token transition patterns provide strong evidence of the model's capability to uncover structured acoustic states. The metrics used for evaluation, including Jaccard overlap and manifold projection, are appropriate and effectively illustrate the model's performance.

Reproducibility

The paper provides detailed implementation details, including the architecture of the VQ-VAE, training objectives, and evaluation metrics, which contribute to reproducibility. However, the absence of a publicly accessible code repository or demo limits the ability for other researchers to replicate the findings directly.

Limitations

The study is limited by its reliance on a controlled subset of the UrBAN dataset, which may not fully capture the diversity of acoustic states across different hives and conditions. Additionally, while the findings are promising, the lack of biological annotation to ground the discovered states raises questions about their true biological relevance. The scalability of the approach to larger datasets and more complex hive conditions remains to be validated.

Broader Impact

The implications of this work extend to non-invasive monitoring of honey bee colonies, potentially aiding in the early detection of conditions such as queen loss or swarming. The unsupervised nature of the framework allows for the identification of previously unlabelled states, which could enhance hive management practices and contribute to pollinator conservation efforts. The approach also opens avenues for future research in bioacoustics and machine learning applications in non-vocal species. The paper presents a significant advancement in unsupervised learning for bioacoustic state discovery, demonstrating the ability to extract structured acoustic patterns from honey bee buzzing without prior assumptions or annotations. The methodology is innovative and the results impactful, contributing to both machine learning and ecological monitoring fields.

Analysis: Full Paper • Full text: 41,049 characters

Evaluating voice anonymisation using similarity rank disclosure

Shilpa Chandra, Matteo Pettenò, Nicholas Evans ... · arXiv

The evaluation of voice anonymisation remains challenging. Current practice relies on automatic speaker verification metrics such as the equal error rate (EER). Performance estimates dependent on the classifier and operating point provide an incomplete or even misleading characte...

The evaluation of voice anonymisation remains challenging. Current practice relies on automatic speaker verification metrics such as the equal error rate (EER). Performance estimates dependent on the classifier and operating point provide an incomplete or even misleading characterisation of privacy risk. We investigate the use of similarity rank disclosure (SRD), an information-theoretic metric, which operates on feature representations rather than classifier decisions, providing a threshold-independent assessment of privacy and analysis of both average and worst-case disclosure. We report its application to speaker embeddings, fundamental frequency, and phone embeddings using 2024 VoicePrivacy Challenge systems. The SRD reveals privacy leaks and system-specific weaknesses missed by EER-based evaluation. Findings highlight the merit of representation-level metrics and demonstrate the potential of SRD as a flexible and interpretable tool for the evaluation of voice anonymisation.

Institutional Affiliations

Primary: EURECOM

All Institutions: EURECOM, Ruhr-Universität Bochum, Orange Innovation, University of Stuttgart

ML Relevance Analysis (83)

The main contribution of this paper is the introduction of the Similarity Rank Disclosure (SRD) metric for evaluating voice anonymisation, which provides a more interpretable and comprehensive assessment of privacy risks compared to traditional metrics. The technical contribution is significant as it addresses critical gaps in existing evaluation practices, offering a robust framework for future research and application in voice privacy.

Comprehensive Analysis

Methodology Assessment

The paper introduces the Similarity Rank Disclosure (SRD) as a novel metric for evaluating voice anonymisation, which operates independently of classifier decisions and provides a more nuanced understanding of privacy risks. The methodology is well-structured, detailing the steps for computing SRD, including ranking, distribution generation, and statistical modeling. The use of empirical probability distributions and beta-binomial fitting enhances the robustness of the evaluation. However, the paper could benefit from clearer explanations of the statistical methods used and their implications for the results.

Experimental Evaluation

The experiments leverage a comprehensive dataset from the 2024 VoicePrivacy Challenge, applying the SRD to various anonymisation systems. The results demonstrate that SRD can reveal privacy leaks and weaknesses that traditional metrics like EER miss. The evaluation includes both qualitative and quantitative analyses, providing a thorough comparison of different anonymisation approaches. However, the paper does not provide extensive details on the experimental setup, such as the specific configurations of the anonymisation systems or the exact nature of the datasets used.

Reproducibility

The paper lacks sufficient details for full reproducibility. While it describes the methodology and provides some results, it does not include code or data availability, which are critical for other researchers to replicate the findings. Clearer documentation of the experimental setup and access to the datasets used would enhance reproducibility.

Limitations

One limitation is the reliance on a specific dataset (2024 VoicePrivacy Challenge) which may not generalize to other contexts or datasets. Additionally, the SRD's effectiveness in various real-world scenarios remains to be fully validated. The paper also acknowledges the potential for overestimation of privacy if strong attack models are not used, which is a critical consideration for future work.

Broader Impact

The findings have significant implications for the development of privacy-preserving technologies in voice processing, particularly in light of increasing concerns about data privacy and regulation. The SRD could serve as a foundational tool for evaluating voice anonymisation systems, influencing both academic research and industry practices. The flexibility of the SRD to adapt to various feature representations also opens avenues for future research in related domains. The main contribution of this paper is the introduction of the Similarity Rank Disclosure (SRD) metric for evaluating voice anonymisation, which provides a more interpretable and comprehensive assessment of privacy risks compared to traditional metrics. The technical contribution is significant as it addresses critical gaps in existing evaluation practices, offering a robust framework for future research and application in voice privacy.

Analysis: Full Paper • Full text: 31,421 characters

Online Segmented Beamforming via Dynamic Programming

Manan Mittal, Ryan M. Corey, Diego Cuji ... · arXiv

In dynamic acoustic environments characterized by time-varying interferers and moving sources, effective beamforming requires accurately identifying stationary regions over time. Traditional Capon beamformers rely on the instantaneous ensemble covariance matrix, which is inaccess...

In dynamic acoustic environments characterized by time-varying interferers and moving sources, effective beamforming requires accurately identifying stationary regions over time. Traditional Capon beamformers rely on the instantaneous ensemble covariance matrix, which is inaccessible in practice. Practical implementations overcome this by estimating the sample covariance matrix (SCM) through averaging over a block of temporal samples. However, in non-stationary settings, a naive batch approach fails. Moving interferers smear the SCM, causing the beamformer to place nulls in outdated locations while failing to track newly active interferers, thereby degrading its nulling capabilities. To address this fundamental limitation, an Online Segmented Beamformer is proposed. This algorithm incorporates data-driven temporal segmentation to causally minimize output power while dynamically adapting the SCM estimation windows to local stationarity. By framing the problem through the lens of dynamic programming, the proposed method tracks abrupt environmental changes and resets covariance estimates in real-time. We validate the performance of this framework in a complex, reverberant simulated acoustic environment and in highly reverberant real world experiments, demonstrating its superiority over fixed-window adaptive methods.

Institutional Affiliations

Primary: Stony Brook University

All Institutions: Stony Brook University, University of Illinois Chicago, University of Massachusetts Dartmouth

ML Relevance Analysis (83)

The main contribution of this paper is the introduction of an Online Segmented Beamformer that dynamically adapts its covariance estimation windows to track changes in non-stationary acoustic environments, significantly enhancing the performance of adaptive beamforming techniques. The comprehensive analysis of the technical contribution, methodology, and significance to the field highlights its potential to advance audio processing applications in complex environments.

Comprehensive Analysis

Methodology Assessment

The proposed methodology introduces the Online Segmented Beamformer, which innovatively adapts the integration window for covariance matrix estimation based on the statistical characteristics of the incoming data. This dynamic programming approach allows for real-time tracking of environmental changes, addressing the limitations of traditional fixed-window methods. The algorithm's ability to maintain a balance between bias and variance through temporal segmentation is a significant advancement in adaptive beamforming techniques.

Experimental Evaluation

The experiments conducted in both simulated and real-world environments demonstrate the proposed method's effectiveness. The use of complex reverberant scenarios and dynamic sources provides a robust testing ground, and the reported performance metrics (SI-SDR, PESQ, STOI) indicate substantial improvements over fixed-window methods. The comprehensive evaluation across various conditions strengthens the validity of the results.

Reproducibility

While the paper provides a detailed description of the algorithm and its implementation, the lack of publicly available code or datasets limits reproducibility. Including a demo or project URL would enhance the ability for other researchers to validate and build upon this work.

Limitations

One limitation is the potential computational complexity associated with maintaining multiple candidate beamformers and the need for efficient real-time processing. Additionally, the algorithm's performance in highly dynamic environments with abrupt changes could be further explored to assess its robustness.

Broader Impact

The Online Segmented Beamformer has the potential to significantly impact various applications in audio processing, including speech enhancement, hearing aids, and sonar systems, where dynamic acoustic environments are prevalent. Its ability to adaptively manage interference in real-time could lead to advancements in communication technologies and improve user experiences in noisy settings. The main contribution of this paper is the introduction of an Online Segmented Beamformer that dynamically adapts its covariance estimation windows to track changes in non-stationary acoustic environments, significantly enhancing the performance of adaptive beamforming techniques. The comprehensive analysis of the technical contribution, methodology, and significance to the field highlights its potential to advance audio processing applications in complex environments.

Analysis: Full Paper • Full text: 18,002 characters

TARNet: A Temporal-Aware Multi-Scale Architecture for Closed-Set Speaker Identification

Yassin Terraf, Youssef Iraqi · IEEE International Conference on Multimedia and Expo (ICME) 2026

Closed-Set speaker identification aims to assign a speech utterance to one of a predefined set of enrolled speakers and requires robust modeling of speaker-specific characteristics across multiple temporal scales. While recent deep learning approaches have achieved strong perform...

Closed-Set speaker identification aims to assign a speech utterance to one of a predefined set of enrolled speakers and requires robust modeling of speaker-specific characteristics across multiple temporal scales. While recent deep learning approaches have achieved strong performance, many existing architectures provide limited mechanisms for modeling temporal dependencies across different time scales, which can restrict the effective use of complementary short-, mid-, and long-term speaker characteristics. In this paper, we propose TARNet, a lightweight Temporal-Aware Representation Network for closed-set speaker identification. TARNet explicitly models temporal information at multiple time scales using a multi-stage temporal encoder with stage-specific dilation configurations. The resulting multi-scale representations are fused and aggregated via an Attentive Statistics Pooling (ASP) module to produce a discriminative utterance-level speaker embedding. Experiments on the VoxCeleb1 and LibriSpeech datasets show that TARNet outperforms state-of-the-art methods while maintaining competitive computational complexity, making it suitable for practical speaker identification systems. The code is publicly available at https://github.com/YassinTERRAF/TARNet.

Institutional Affiliations

Primary: University Mohammed VI Polytechnic

All Institutions: University Mohammed VI Polytechnic, CID Development

GitHub

ML Relevance Analysis (82)

The paper presents TARNet, a novel multi-scale architecture for closed-set speaker identification that effectively models temporal dependencies, achieving state-of-the-art performance while maintaining computational efficiency. The comprehensive evaluation of the methodology, experimental results, and potential applications underscores its significance in the field of audio processing and speaker recognition.

Comprehensive Analysis

Methodology Assessment

The proposed TARNet architecture introduces a multi-scale temporal encoder that effectively captures speaker-specific characteristics across different temporal scales. The use of dilated convolutions allows for the modeling of temporal dependencies while preserving resolution, which is a significant improvement over traditional CNN architectures. The Attentive Statistics Pooling (ASP) module further enhances the model's ability to focus on discriminative features, making the methodology both innovative and practical for real-world applications.

Experimental Evaluation

The experiments conducted on VoxCeleb1 and LibriSpeech datasets demonstrate TARNet's superior performance compared to state-of-the-art models. The results are well-presented, showing a clear advantage in accuracy metrics. The paper also includes ablation studies that validate the importance of each component in the architecture, providing a comprehensive evaluation of the model's effectiveness.

Reproducibility

The authors have made the code publicly available, which enhances reproducibility. However, the paper could benefit from more detailed descriptions of the experimental setup, including hyperparameters and training procedures, to facilitate easier replication of results by other researchers.

Limitations

One limitation of the study is the lack of evaluation in noisy or reverberant conditions, which are common in real-world scenarios. Additionally, while TARNet shows strong performance on the evaluated datasets, its generalizability to other speaker identification tasks or languages remains untested.

Broader Impact

The advancements presented in TARNet have significant implications for biometric authentication and forensic analysis, where accurate speaker identification is crucial. The lightweight nature of the model also suggests potential applications in mobile and embedded systems, expanding its usability in various domains. The paper presents TARNet, a novel multi-scale architecture for closed-set speaker identification that effectively models temporal dependencies, achieving state-of-the-art performance while maintaining computational efficiency. The comprehensive evaluation of the methodology, experimental results, and potential applications underscores its significance in the field of audio processing and speaker recognition.

Analysis: Full Paper • Full text: 24,982 characters

Latent Secret Spin: Keyed Orthogonal Rotations for Blind Speech Watermarking in Anisotropic Latent Spaces

Emma Coletta, Massimiliano Todisco, Michele Panariello ... · arXiv

We introduce Latent Secret Spin (LSS), a blind speech watermarking method based on geometric operations in codec latent space. Based upon orthogonal rotations to principal components, LSS induces imperceptible but detectable covariance signatures according to a pseudo-random wate...

We introduce Latent Secret Spin (LSS), a blind speech watermarking method based on geometric operations in codec latent space. Based upon orthogonal rotations to principal components, LSS induces imperceptible but detectable covariance signatures according to a pseudo-random watermarking schedule. The scheme generalises across datasets, preserves perceptual quality and, unlike some learned, neural watermarking schemes, it does not require neural network training, is resistant to common signal manipulations and is flexible to payload size. Analyses show that structured latent-space watermarking is a promising and interpretable alternative to existing approaches.

Institutional Affiliations

Primary: unknown

All Institutions: unknown

GitHub

ML Relevance Analysis (80)

The main contribution of this paper is the introduction of Latent Secret Spin (LSS), a novel blind speech watermarking method that utilizes geometric operations in latent spaces to achieve robust and imperceptible watermarking. The technical contributions are significant, providing a new perspective on watermarking methodologies that could influence future research and applications in audio content security.

Comprehensive Analysis

Methodology Assessment

The proposed Latent Secret Spin (LSS) methodology is innovative in its approach to blind speech watermarking by leveraging geometric operations in the latent space of neural audio codecs. The use of orthogonal rotations in principal component space to induce detectable covariance signatures is a novel contribution that distinguishes it from traditional watermarking techniques. The methodology is well-structured, with a clear explanation of the embedding and detection processes, including the use of a pseudo-random schedule for key management. However, while the geometric principles are sound, the reliance on PCA may limit the approach's adaptability to various audio contexts.

Experimental Evaluation

The experiments conducted to evaluate LSS are robust, utilizing two different speech datasets (VoxPopuli and ASVspoof) to assess both in-domain and out-of-domain performance. The evaluation metrics, including AUC-ROC for detection performance and PESQ for perceptual quality, provide a comprehensive view of the method's effectiveness. The results demonstrate strong detection capabilities across various audio manipulations, indicating good robustness. However, the paper could benefit from a more extensive range of experiments to explore the method's performance under more aggressive adversarial conditions.

Reproducibility

The paper includes a link to the source code and sample utterances, which facilitates reproducibility. The detailed description of the experimental setup, including the configuration of the encoder and decoder, as well as the parameters used for watermark embedding and detection, enhances the clarity of the methodology. However, the lack of subjective listening tests for perceptual quality assessment is a drawback that could affect reproducibility in practical applications.

Limitations

The study acknowledges some limitations, such as the focus on bona fide speech and the evaluation under common, non-malicious manipulations. The method's performance against stronger, adaptive attacks remains untested, and the reliance on objective metrics for perceptual quality could overlook important subjective aspects. Additionally, the distribution of watermarks at the chunk level may be vulnerable to temporal manipulations like splicing.

Broader Impact

The implications of LSS are significant, particularly in the context of increasing concerns around misinformation and content authenticity in the audio domain. By providing a robust and imperceptible watermarking solution, LSS could play a crucial role in verifying the provenance of speech recordings and enhancing the security of audio content. The method's flexibility and interpretability also suggest potential applications beyond speech, potentially extending to other forms of media where watermarking is essential. The main contribution of this paper is the introduction of Latent Secret Spin (LSS), a novel blind speech watermarking method that utilizes geometric operations in latent spaces to achieve robust and imperceptible watermarking. The technical contributions are significant, providing a new perspective on watermarking methodologies that could influence future research and applications in audio content security.

Analysis: Full Paper • Full text: 29,202 characters

X-OmniClaw Technical Report: A Unified Mobile Agent for Multimodal Understanding and Interaction

Xiaoming Ren, Ru Zhen, Chao Li ... · arXiv

Inspired by the development of OpenClaw, there is a growing demand for mobile-based personal agents capable of handling complex and intuitive interactions. In this technical report, we introduce X-OmniClaw, a unified mobile agent designed for multimodal understanding and interact...

Inspired by the development of OpenClaw, there is a growing demand for mobile-based personal agents capable of handling complex and intuitive interactions. In this technical report, we introduce X-OmniClaw, a unified mobile agent designed for multimodal understanding and interaction in the Android ecosystem. This unified architecture of perception, memory, and action enables the agent to handle complex mobile tasks with high contextual awareness. Specifically, Omni Perception provides a unified multimodal ingress pipeline that integrates UI states, real-world visual contexts, and speech inputs, leveraging a temporal alignment module to decompose raw data into structured multimodal intent representations. Omni Memory leverages multimodal memory optimization to enhance personalized intelligence by integrating runtime working memory for task continuity with long-term personal memory distilled from local data, enabling highly context-aware and personalized interactions. Finally, Omni Action employs a hybrid grounding strategy that combines structural XML metadata with visual perception for robust interaction. Through Behavior Cloning and Trajectory Replay, the system captures user navigation as reusable skills, enabling precise direct-access execution. Demonstrations across diverse scenarios show that X-OmniClaw effectively enhances interaction efficiency and task reliability, providing a practical architectural blueprint for the next generation of mobile-native personal assistants.

Institutional Affiliations

Primary: OPPO AI Center

All Institutions: OPPO AI Center

ML Relevance Analysis (84)

The main contribution of this paper is the introduction of X-OmniClaw, a unified mobile agent architecture that enhances multimodal understanding and interaction on Android devices. This work represents a significant step towards creating more intelligent and context-aware personal assistants, although it would benefit from more rigorous experimental validation and reproducibility measures.

Comprehensive Analysis

Methodology Assessment

The methodology of X-OmniClaw is robust, integrating multimodal perception, memory, and action into a cohesive framework designed for mobile agents. The use of a unified ingress pipeline for multimodal data and the incorporation of behavior cloning and trajectory replay for skill execution are significant advancements. The architecture is well thought out, allowing for real-time contextual awareness and personalized interactions, which is crucial for mobile applications. However, while the proposed methods are innovative, they build upon existing frameworks like OpenClaw and Hermes, which may limit the perceived novelty.

Experimental Evaluation

The paper provides a thorough overview of the system's capabilities through various demo scenarios, showcasing practical applications of the proposed architecture. However, it lacks quantitative evaluation metrics or comparative analysis against existing systems, which would strengthen the claims of enhanced interaction efficiency and task reliability. The absence of rigorous experimental validation is a notable gap.

Reproducibility

The paper does not provide specific implementation details or code availability, which raises concerns about reproducibility. While it mentions the intention to release code and materials as open source, the current lack of access limits the ability for others to replicate the findings.

Limitations

The paper does not address potential limitations of the system, such as the challenges of real-time processing on mobile devices, privacy concerns regarding data collection, and the dependency on the accuracy of the multimodal inputs. Additionally, the reliance on local device capabilities may restrict the system's performance in resource-constrained environments.

Broader Impact

The development of X-OmniClaw has the potential to significantly enhance user interaction with mobile devices, making them more intuitive and responsive to user needs. Its applications could extend beyond personal assistants to areas such as accessibility technology, education, and smart home integration, thereby impacting a wide range of fields. The main contribution of this paper is the introduction of X-OmniClaw, a unified mobile agent architecture that enhances multimodal understanding and interaction on Android devices. This work represents a significant step towards creating more intelligent and context-aware personal assistants, although it would benefit from more rigorous experimental validation and reproducibility measures.

Analysis: Full Paper • Full text: 31,827 characters

Do Melody and Rhythm Coevolve?

Harin Lee, Rainer Polak, Manuel Anglada-Tort ... · Proceedings of the Annual Meeting of the Cognitive Science Society

Music comprises two core structural components, melody and rhythm, that vary widely across cultures. Whether these components coevolve in a coupled way or follow independent trajectories remains unclear. We introduce a novel computational pipeline to extract vocal melodic pitch-i...

Music comprises two core structural components, melody and rhythm, that vary widely across cultures. Whether these components coevolve in a coupled way or follow independent trajectories remains unclear. We introduce a novel computational pipeline to extract vocal melodic pitch-interval and percussive inter-onset timing distributions from 27,628 popular songs across 59 countries, enabling large-scale cross-cultural comparison that bypasses traditional music annotations. Musical similarities between countries aligned with geographic and linguistic relationships, validating our approach. Substantial variation emerged in both melodic and rhythmic structures across countries, yet the diversity of the two components was not significantly correlated, challenging assumptions of coupled evolution. Only rhythmic diversity was significantly associated with ethnic and linguistic heterogeneity, while melodic diversity showed no such association. These findings suggest that melody and rhythm constitute partially independent systems shaped by distinct cultural and evolutionary pressures, rather than components of a single monolithic musical style.

Institutional Affiliations

Primary: University of Cambridge

All Institutions: University of Cambridge, RITMO Centre for Interdisciplinary Studies in Rhythm, Time and Motion, University of Oslo, Department of Psychology, Goldsmiths College, University of London, Department of Life Sciences, Leipzig University, Division of Social Science, New York University Abu Dhabi, Department of Psychology, Cornell University

GitHub

ML Relevance Analysis (83)

This paper presents a significant advancement in understanding the independent evolution of melody and rhythm across cultures through a novel computational approach. The methodology is innovative and the findings challenge existing assumptions in music theory, providing a fresh perspective on cultural music analysis.

Comprehensive Analysis

Methodology Assessment

The paper introduces a novel computational pipeline that leverages deep learning source separation techniques to extract melodic and rhythmic features from a large dataset of songs. This approach is innovative as it allows for the analysis of music without relying on traditional, often biased, manual annotations. The methodology is well-detailed, including the use of kernel density estimation for summarizing melodic and rhythmic distributions, and the careful consideration of time scales for analyzing pitch intervals. The choice of using distributional profiles rather than higher-level constructs is a significant strength, as it minimizes analytical biases. The operational definitions of melody and rhythm are clear, although they are somewhat limited in scope.

Experimental Evaluation

The experimental design is robust, utilizing a large dataset of 27,628 songs from 59 countries, which provides a comprehensive basis for cross-cultural analysis. The authors validate their computational pipeline by demonstrating that the extracted distributions align with known musical patterns, thus establishing face validity. The use of Jensen-Shannon divergence to assess musical similarity between countries is appropriate and effectively highlights the independence of melodic and rhythmic diversity. However, the paper could benefit from additional metrics or qualitative assessments to further substantiate the findings.

Reproducibility

The paper provides sufficient detail regarding the methods and algorithms used, including the specific tools and parameters for source separation and feature extraction. The availability of the code and metadata through the provided GitHub link enhances reproducibility. However, the reliance on proprietary audio data from YouTube may limit the ability of others to fully replicate the study, particularly in regions with less representation.

Limitations

The authors acknowledge several limitations, including the potential biases introduced by using YouTube chart data, which may not capture traditional or non-commercial music. Additionally, the source separation algorithms are primarily trained on Western music, which could affect the accuracy of the extracted features for non-Western genres. The operational definitions of melody and rhythm are also somewhat narrow, potentially overlooking the complexity of musical interactions.

Broader Impact

The findings have significant implications for the fields of music cognition and cultural evolution, suggesting that melody and rhythm are shaped by different cultural and evolutionary pressures. This research could influence how music is studied across disciplines, including anthropology, psychology, and musicology. The methodology could also be applied to other forms of cultural expression, providing insights into the interplay between different artistic components. This paper presents a significant advancement in understanding the independent evolution of melody and rhythm across cultures through a novel computational approach. The methodology is innovative and the findings challenge existing assumptions in music theory, providing a fresh perspective on cultural music analysis.

Analysis: Full Paper • Full text: 27,431 characters

Modality-Aware Contrastive and Uncertainty-Regularized Emotion Recognition

Yan Zhuang, Minhao Liu, Yanru Zhang ... · arXiv

Multimodal Emotion Recognition (MER) has attracted growing attention with the rapid advancement of human-computer interaction. However, different modalities exhibit substantial discrepancies in semantics, quality, and availability, leading to highly heterogeneous modality combina...

Multimodal Emotion Recognition (MER) has attracted growing attention with the rapid advancement of human-computer interaction. However, different modalities exhibit substantial discrepancies in semantics, quality, and availability, leading to highly heterogeneous modality combinations and posing significant challenges to achieving consistent and reliable emotion understanding. To address this challenge, we propose the Modality-Aware Contrastive and Uncertainty-Regularized (MCUR) framework, which approaches MER from the perspective of representation consistency, aiming to enable robust emotion prediction across heterogeneous modality combinations. MCUR incorporates two core components: (1) Modality Combination-Based and Category-Based Contrastive Learning mechanism (MCB-CL), which encourages samples with the same emotion category and the same available modalities to be close in the representation space; and (2) Sample-wise Uncertainty-Guided Regularization (SUGR), which adaptively assigns sample-wise uncertain weights to samples to optimize training. Extensive experiments demonstrate that MCUR consistently outperforms existing methods, achieving average F1 gains of 2.2% on MOSI, 2.67% on MOSEI, and 4.37% on IEMOCAP.

Institutional Affiliations

Primary: University of Electronic Science and Technology of China

All Institutions: University of Electronic Science and Technology of China, Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China

ML Relevance Analysis (83)

The paper introduces the MCUR framework, which enhances multimodal emotion recognition by promoting representation consistency and addressing uncertainty in predictions. This comprehensive analysis highlights the technical contributions, innovative methodology, and potential impact on the field of machine learning and human-computer interaction.

Comprehensive Analysis

Methodology Assessment

The proposed MCUR framework presents a novel approach to multimodal emotion recognition (MER) by focusing on representation consistency across heterogeneous modalities. The integration of Modality Combination-Based and Category-Based Contrastive Learning (MCB-CL) and Sample-wise Uncertainty-Guided Regularization (SUGR) is a significant methodological advancement. MCB-CL enhances the discriminative power of representations by enforcing proximity in the embedding space for samples with the same emotion category and modality combination, while SUGR addresses uncertainty in predictions, allowing for adaptive weighting during training. This dual approach is innovative and effectively tackles the challenges posed by modality heterogeneity.

Experimental Evaluation

The experiments are comprehensive, utilizing three widely recognized datasets (MOSI, MOSEI, and IEMOCAP) to validate the effectiveness of the MCUR framework. The reported performance improvements over existing state-of-the-art methods, with average F1 gains of 2.2% on MOSI, 2.67% on MOSEI, and 4.37% on IEMOCAP, demonstrate the robustness of the proposed approach. The ablation studies further substantiate the contributions of each component of the framework, revealing the critical role of both MCB-CL and SUGR in enhancing model performance.

Reproducibility

The paper provides detailed implementation details, including the training configurations, hyperparameter settings, and evaluation protocols, which are crucial for reproducibility. The authors also mention the use of official implementations for baseline models, ensuring a fair comparison. However, the lack of publicly available code or demo URLs limits the ease of reproduction for external researchers.

Limitations

While the MCUR framework shows promising results, the paper does not address the potential computational overhead associated with the added complexity of the proposed methods. Additionally, the performance in real-world noisy conditions is not thoroughly evaluated, which could limit the applicability of the framework in practical scenarios. The reliance on specific datasets may also restrict generalizability to other contexts or domains.

Broader Impact

The advancements presented in this paper have significant implications for human-computer interaction, particularly in enhancing emotion recognition systems that can adapt to varying modalities. The ability to maintain consistent representations across different modalities can improve the robustness of applications in areas such as virtual assistants, mental health monitoring, and social robotics. The focus on uncertainty in predictions may also lead to more reliable systems that can better handle real-world variability. The paper introduces the MCUR framework, which enhances multimodal emotion recognition by promoting representation consistency and addressing uncertainty in predictions. This comprehensive analysis highlights the technical contributions, innovative methodology, and potential impact on the field of machine learning and human-computer interaction.

Analysis: Full Paper • Full text: 50,026 characters

NDF+: Joint Neural Directional Filtering and Diffuse Sound Extraction

Weilong Huang, Le Nhat Tam Huynh, Oliver Thiergart ... · arXiv

Recently, neural directional filtering (NDF) has been introduced as a flexible approach for reconstructing a virtual directional microphone (VDM) with a desired directivity pattern for spatial sound capture. Building on this idea, we propose NDF+, which enables joint neural direc...

Recently, neural directional filtering (NDF) has been introduced as a flexible approach for reconstructing a virtual directional microphone (VDM) with a desired directivity pattern for spatial sound capture. Building on this idea, we propose NDF+, which enables joint neural directional filtering and diffuse sound extraction. NDF+ reformulates VDM estimation into two coupled subtasks: dereverberated VDM reconstruction and diffuse sound extraction. This reformulation enables NDF+ to manipulate diffuse components in the final reconstructed VDM output. We evaluated NDF+ under reverberant conditions and compared it with representative conventional baselines. Results show that NDF+ consistently outperforms the baselines on both subtasks, while maintaining VDM reconstruction quality comparable to that of the original single-task NDF model. These findings indicate that NDF+ introduces an additional degree of freedom for diffuse sound control in the VDM reconstruction. In a stereo recording application, NDF+ provides controllable inter-channel level differences between left and right channels by adjusting the estimated diffuse component.

Institutional Affiliations

Primary: International Audio Laboratories Erlangen

All Institutions: International Audio Laboratories Erlangen, Fraunhofer IIS, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU)

ML Relevance Analysis (83)

The main contribution of this paper is the introduction of NDF+, a joint framework for neural directional filtering and diffuse sound extraction that enhances VDM reconstruction while allowing for effective control of diffuse sound components. This work represents a significant step forward in spatial audio processing, combining innovative methodologies with rigorous experimental validation to address key challenges in the field.

Comprehensive Analysis

Methodology Assessment

The paper introduces NDF+, a novel framework that combines neural directional filtering with diffuse sound extraction, effectively reformulating the VDM estimation into two coupled subtasks. The methodology employs a dual-mask architecture using LSTM networks to estimate coherent and diffuse components, which is a significant advancement over previous models that focused solely on VDM reconstruction. The approach is well-structured, with a clear explanation of the DNN architecture, training strategy, and loss functions, demonstrating a thoughtful integration of existing techniques with innovative modifications.

Experimental Evaluation

The experimental evaluation is comprehensive, comparing NDF+ against conventional baselines under various reverberant conditions. The results indicate that NDF+ consistently outperforms these baselines on both subtasks while maintaining VDM reconstruction quality. The use of objective metrics such as SDR and PESQ to measure performance adds rigor to the evaluation. However, the paper could benefit from more detailed qualitative assessments, such as user studies or subjective listening tests, to further validate the improvements in audio quality.

Reproducibility

The paper provides a detailed description of the experimental setup, including the configurations of the microphone array, training data, and evaluation metrics. However, the absence of a public code repository or demo URL limits the reproducibility of the results. Including such resources would enhance the paper's impact and allow other researchers to validate and build upon the findings.

Limitations

One limitation is the reliance on simulated environments for training and testing, which may not fully capture the complexities of real-world acoustic scenarios. Additionally, while the paper discusses the performance of NDF+ in stereo recording applications, it does not explore its scalability to larger microphone arrays or more complex sound environments.

Broader Impact

The advancements presented in NDF+ have significant implications for spatial audio applications, particularly in enhancing the quality of recordings in reverberant environments. The ability to control diffuse sound components can improve immersive audio experiences in various fields, including virtual reality, telecommunications, and music production. The framework could also inspire further research into joint signal processing techniques in audio applications. The main contribution of this paper is the introduction of NDF+, a joint framework for neural directional filtering and diffuse sound extraction that enhances VDM reconstruction while allowing for effective control of diffuse sound components. This work represents a significant step forward in spatial audio processing, combining innovative methodologies with rigorous experimental validation to address key challenges in the field.

Analysis: Full Paper • Full text: 18,892 characters

Optimal Transport Audio Distance with Learned Riemannian Ground Metrics

Wonwoo Jeong · arXiv

In audio generation evaluation, Fréchet Audio Distance (FAD) is a 2-Wasserstein distance with structural constraints for both primitives: the cost is a frozen embedding pullback whose invariance set hides severe artifacts, and the coupling is a Gaussian fit that dilutes rank-1 co...

In audio generation evaluation, Fréchet Audio Distance (FAD) is a 2-Wasserstein distance with structural constraints for both primitives: the cost is a frozen embedding pullback whose invariance set hides severe artifacts, and the coupling is a Gaussian fit that dilutes rank-1 contamination relative to discrete OT. We propose Optimal Transport Audio Distance (OTAD), which corrects each primitive with one dedicated mechanism -- a residual Riemannian ground-metric adapter for the cost and entropic Sinkhorn optimal transport for the coupling. Across eight encoders under a four-axis protocol, coupling-only comparisons at $ε= 0.05$ show that Sinkhorn's rank-1 sensitivity exceeds FAD's by a factor of 1.9 to 3.6. Furthermore, OTAD achieves a higher mean Spearman correlation with audio-quality MOS (DCASE 2023 Task 7) than baseline metrics. As an intrinsic benefit of the discrete transport plan, OTAD yields per-sample diagnostics with AUROC $\ge 0.86$, a capability that scalar- or kernel-aggregated metrics structurally lack.

Institutional Affiliations

Primary: Sogang University

All Institutions: Sogang University

GitHub

ML Relevance Analysis (83)

The paper presents a significant advancement in audio evaluation metrics by introducing OTAD, which effectively addresses the limitations of existing methods through innovative methodological contributions and rigorous empirical validation.

Comprehensive Analysis

Methodology Assessment

The proposed methodology introduces a novel Optimal Transport Audio Distance (OTAD) metric that addresses the limitations of existing metrics like Fréchet Audio Distance (FAD) by employing a dual correction mechanism: a learned Riemannian ground-metric adapter for the cost function and entropic Sinkhorn optimal transport for the coupling. This innovative approach allows for a more sensitive detection of artifacts in audio generation, which is crucial for applications in text-to-audio synthesis. The method is theoretically grounded and systematically validated through a comprehensive experimental design, including a factorial decomposition of the contributions from cost and coupling.

Experimental Evaluation

The experiments are robust, utilizing eight different encoders and a four-axis evaluation protocol to assess the performance of OTAD against FAD and KAD. The results indicate a significant improvement in sensitivity to rank-1 contamination and a higher correlation with human Mean Opinion Scores (MOS). The experiments also include per-sample diagnostics, which provide insights into the specific artifacts present in audio samples, highlighting the practical utility of OTAD in real-world applications.

Reproducibility

The paper includes sufficient detail regarding the implementation of the OTAD metric and the experimental setup, including the datasets used (FSD50K and ESC-50) and the training of the adapters. The release of the OTAD toolkit on GitHub further enhances reproducibility, allowing other researchers to replicate the findings and utilize the metric in their own work.

Limitations

The study acknowledges several limitations, including the reliance on a single listening test for MOS validation and the potential biases introduced by training on a specific dataset (FSD50K). Additionally, the performance of OTAD on music and speech domains remains untested, and the scalability of the method for larger datasets is not fully explored.

Broader Impact

The introduction of OTAD has significant implications for the field of audio generation evaluation, providing a more nuanced and sensitive metric that can improve the quality of generated audio. This advancement could lead to better user experiences in applications such as music synthesis, sound design, and audio restoration. The methodology can also serve as a blueprint for future research in audio evaluation metrics across different domains. The paper presents a significant advancement in audio evaluation metrics by introducing OTAD, which effectively addresses the limitations of existing methods through innovative methodological contributions and rigorous empirical validation.

Analysis: Full Paper • Full text: 30,942 characters

PianoCoRe: Combined and Refined Piano MIDI Dataset

Ilya Borovik · Transactions of the International Society for Music Information Retrieval, 9(1), 144-163, 2026 · Transactions of the International Society for Music Information Retrieval

Symbolic music datasets with matched scores and performances are essential for many music information retrieval (MIR) tasks. Yet, existing resources often cover a narrow range of composers, lack performance variety, omit note-level alignments, or use inconsistent naming formats. ...

Symbolic music datasets with matched scores and performances are essential for many music information retrieval (MIR) tasks. Yet, existing resources often cover a narrow range of composers, lack performance variety, omit note-level alignments, or use inconsistent naming formats. This work presents PianoCoRe, a large-scale piano MIDI dataset that unifies and refines major open-source piano corpora. The dataset contains 250,046 performances of 5,625 pieces written by 483 composers, totaling 21,763 h of performed music. PianoCoRe is released in tiered subsets to support different applications: from large-scale analysis and pre-training (PianoCoRe-C and deduplicated PianoCoRe-B) to expressive performance modeling with note-level score alignment (PianoCoRe-A/A*). The note-aligned subset, PianoCoRe-A, provides the largest open-source collection of 157,207 performances aligned to 1,591 scores to date. In addition to the dataset, the contributions are: (1) a MIDI quality classifier for detecting corrupted and score-like transcriptions and (2) RAScoP, an alignment refinement pipeline that cleans temporal alignment errors and interpolates missing notes. The analysis shows that the refinement reduces temporal noise and eliminates tempo outliers. Moreover, an expressive performance rendering model trained on PianoCoRe demonstrates improved robustness to unseen pieces compared to models trained on raw or smaller datasets. PianoCoRe provides a ready-to-use foundation for the next generation of expressive piano performance research.

Institutional Affiliations

Primary: Skolkovo Institute of Science and Technology

All Institutions: Skolkovo Institute of Science and Technology

GitHub

ML Relevance Analysis (83)

The main contribution of this paper is the introduction of the PianoCoRe dataset, a refined and comprehensive MIDI dataset that addresses the limitations of existing resources in symbolic music analysis. This work significantly enhances the foundation for future research in expressive piano performance modeling and music information retrieval, showcasing a meticulous approach to dataset curation and quality assessment.

Comprehensive Analysis

Methodology Assessment

The methodology presented in this paper is robust and comprehensive, detailing a systematic approach to curating and refining a large-scale piano MIDI dataset. The authors employ a multi-tiered strategy that includes deduplication, quality assessment, and note alignment refinement using the RAScoP pipeline. The integration of various existing datasets into a unified collection is particularly noteworthy, as it addresses the inconsistencies and limitations found in previous datasets. The use of a MIDI quality classifier to filter out corrupted transcriptions and the detailed description of the alignment process further enhance the methodological rigor.

Experimental Evaluation

The experiments conducted demonstrate the effectiveness of the proposed dataset and methodologies. The authors provide a thorough evaluation of the MIDI quality classifier, achieving a high macro F1 score, which indicates the classifier's reliability in distinguishing between performance qualities. Additionally, the application of the dataset in training an expressive performance rendering model shows significant improvements in robustness, suggesting that the dataset effectively supports advanced modeling tasks. However, specific quantitative results from the expressive performance rendering model could further strengthen the experimental validation.

Reproducibility

The paper includes detailed descriptions of the dataset construction process, including data sources, matching methodologies, and quality assessment techniques. The authors provide a GitHub repository link for the project, which enhances reproducibility. However, the paper could benefit from including specific implementation details or code snippets to facilitate replication of the methodologies by other researchers.

Limitations

One limitation identified is the reliance on existing datasets, which may still contain inherent biases or limitations that could affect the quality of the combined dataset. Additionally, while the RAScoP pipeline improves alignment, the paper does not fully address potential edge cases where alignment might still be problematic. The focus on public domain works may also limit the dataset's applicability to contemporary compositions.

Broader Impact

The PianoCoRe dataset has the potential to significantly impact the field of music information retrieval and computational musicology by providing a comprehensive resource for training models in expressive performance rendering and analysis. Its tiered structure allows for diverse applications, from large-scale analysis to specific performance modeling tasks, thus fostering advancements in music generation and understanding. The main contribution of this paper is the introduction of the PianoCoRe dataset, a refined and comprehensive MIDI dataset that addresses the limitations of existing resources in symbolic music analysis. This work significantly enhances the foundation for future research in expressive piano performance modeling and music information retrieval, showcasing a meticulous approach to dataset curation and quality assessment.

Analysis: Full Paper • Full text: 50,026 characters

Predictive-Generative Drift Decomposition for Speech Enhancement and Separation

Julius Richter, Yoshiki Masuyama, Christoph Boeddeker ... · arXiv

We propose a plug-and-play framework for speech enhancement and separation that augments predictive methods with a generative speech prior. Our approach, termed Stochastic Interpolant Prior for Speech (SIPS), builds on stochastic interpolants and leverages their flexibility to br...

We propose a plug-and-play framework for speech enhancement and separation that augments predictive methods with a generative speech prior. Our approach, termed Stochastic Interpolant Prior for Speech (SIPS), builds on stochastic interpolants and leverages their flexibility to bridge predictive and generative modeling. Specifically, we decompose the interpolation dynamics into a task-specific drift and a stochastic denoising component, allowing a predictive estimate to be integrated directly into the generative sampling process. This results in a mathematically grounded framework for combining strong pretrained predictors with the expressive power of generative models. To this end, we train a score model using only clean speech, yielding a degradation-agnostic prior that can be reused across tasks. During inference, the predictor provides a deterministic drift that steers the sampling process toward a task-consistent estimate, while the score model preserves perceptual naturalness. Unlike prior hybrid approaches, which typically rely on architecture-specific conditioning and are tied to particular predictors or degradation settings, SIPS provides a unified framework that generalizes across predictors and additive degradation tasks. We demonstrate its effectiveness for both speech enhancement and speech separation using recent predictors such as SEMamba and FlexIO. The proposed method consistently improves perceptual quality, achieving gains up +1.0 NISQA for speech separation.

Institutional Affiliations

Primary: MERL

All Institutions: MERL

GitHub

ML Relevance Analysis (83)

The paper presents a novel framework that bridges predictive and generative modeling for speech enhancement and separation, demonstrating significant improvements in perceptual quality while maintaining competitive performance on traditional metrics. The comprehensive methodology and robust experimental validation position this work as a meaningful contribution to the field of machine learning in audio processing.

Comprehensive Analysis

Methodology Assessment

The proposed Stochastic Interpolant Prior for Speech (SIPS) framework effectively integrates predictive and generative modeling approaches, addressing the limitations of both paradigms by introducing a mathematically grounded decomposition of interpolation dynamics. This innovative methodology allows for a flexible and efficient plug-and-play integration with existing predictors, enhancing the perceptual quality of speech enhancement and separation tasks while maintaining fidelity to the original signals.

Experimental Evaluation

The experiments conducted demonstrate the efficacy of SIPS across various tasks, including speech enhancement and separation, using multiple state-of-the-art predictors. The results indicate consistent improvements in non-intrusive perceptual quality metrics, alongside competitive performance in reference-based metrics, showcasing the robustness and versatility of the proposed method.

Reproducibility

The paper provides a clear implementation of the proposed method, including detailed descriptions of the experimental setup, data representation, and training procedures. The availability of the implementation on GitHub enhances reproducibility, allowing other researchers to validate and build upon the findings.

Limitations

One limitation is the reliance on clean speech data for training the generative prior, which may affect performance in real-world scenarios with diverse degradation types. Additionally, while the method shows promise, further exploration of its generalization capabilities across different audio domains is warranted.

Broader Impact

The SIPS framework has significant implications for various applications in speech processing, including telecommunications, assistive technologies, and audio content creation. By improving speech quality in challenging conditions, this work can enhance user experiences in voice communication systems and contribute to advancements in automatic speech recognition and natural language processing. The paper presents a novel framework that bridges predictive and generative modeling for speech enhancement and separation, demonstrating significant improvements in perceptual quality while maintaining competitive performance on traditional metrics. The comprehensive methodology and robust experimental validation position this work as a meaningful contribution to the field of machine learning in audio processing.

Analysis: Full Paper • Full text: 35,870 characters

Task-Aware Answer Preservation under Audio Compression for Large Audio Language Models

Amir Ivry · arXiv

Large audio language models (LALMs) are increasingly used to reason over long audio clips, yet deployment often compresses audio before inference to reduce memory and latency. The risk is that compression can leave aggregate accuracy acceptable while sharply degrading answers for...

Large audio language models (LALMs) are increasingly used to reason over long audio clips, yet deployment often compresses audio before inference to reduce memory and latency. The risk is that compression can leave aggregate accuracy acceptable while sharply degrading answers for a deployment-critical query family. We study answer-preserving audio compression, judging a compressor by the excess answer-error it induces, especially for the worst-affected family. We formulate this theoretically as a compressor acceptance-rejection criterion, derive a practical sign-off protocol that returns compression budgets satisfying worst-family checks with statistical confidence, and evaluate it on five multiple-choice audio question-answering benchmarks with two Qwen-based backbones. The protocol exposes hidden family-level damage, shows that the chosen query-family partition can change the approved budget, and identifies regimes where query-conditioned compression helps maintain answer preservation.

Institutional Affiliations

Primary: Technion--Israel Institute of Technology

All Institutions: Technion--Israel Institute of Technology

ML Relevance Analysis (83)

The main contribution of this paper is the introduction of a framework for task-aware answer-preserving audio compression, which addresses the critical challenge of maintaining answer quality in large audio language models under compression constraints. This work significantly advances the understanding of audio compression impacts on model performance and provides a practical methodology for evaluating and ensuring answer preservation across diverse query families.

Comprehensive Analysis

Methodology Assessment

The methodology presented in this paper is robust and well-structured. The authors introduce a theoretical framework for task-aware answer-preserving audio compression, which is a novel approach to evaluating audio compression techniques in the context of large audio language models (LALMs). The paper formulates a compressor acceptance-rejection criterion and derives a practical sign-off protocol that incorporates statistical confidence, which is a significant contribution to the field. The approach is grounded in a solid theoretical foundation, linking practical deployment configurations to answer preservation metrics. The use of paired evaluations and the focus on worst-family checks are particularly noteworthy, as they address the critical issue of performance degradation across different query families.

Experimental Evaluation

The experimental evaluation is comprehensive, utilizing five multiple-choice audio question-answering benchmarks. The authors effectively demonstrate the applicability of their framework and the importance of considering family-level performance rather than relying solely on average metrics. The results reveal significant insights into how compression can affect different query families, showcasing the hidden damage that can occur when using average performance metrics. The experiments are well-designed, and the analysis is thorough, providing empirical support for the theoretical claims made in the paper.

Reproducibility

The paper provides detailed descriptions of the experimental setup, including the datasets, models, and evaluation metrics used. However, there are some limitations regarding the availability of code and data, as no URLs for project repositories or demo pages are provided. This lack of resources may hinder reproducibility for other researchers looking to validate or build upon the findings.

Limitations

The paper acknowledges several limitations, including the potential for query-family coarsening and the challenges of estimating true Bayes risks due to calibration errors and prompt sensitivity. Additionally, the framework's applicability to different languages, longer audio clips, or varying deployment scenarios is not fully established, which may limit its generalizability.

Broader Impact

The proposed framework has significant implications for the deployment of audio language models in real-world applications, particularly in scenarios where audio compression is necessary for efficiency. By emphasizing the importance of answer preservation across different query families, this work could influence future research and development in audio processing, machine learning, and multimodal systems. The findings could lead to improved audio compression techniques that better maintain the integrity of information critical for specific tasks. The main contribution of this paper is the introduction of a framework for task-aware answer-preserving audio compression, which addresses the critical challenge of maintaining answer quality in large audio language models under compression constraints. This work significantly advances the understanding of audio compression impacts on model performance and provides a practical methodology for evaluating and ensuring answer preservation across diverse query families.

Analysis: Full Paper • Full text: 50,026 characters

WavCube: Unifying Speech Representation for Understanding and Generation via Semantic-Acoustic Joint Modeling

Guanrou Yang, Tian Tan, Qian Chen ... · arXiv

Integrating speech understanding and generation is a pivotal step toward building unified speech models. However, the different representations required for these two tasks currently pose significant compatibility challenges. Typically, semantics-oriented features are learned fro...

Integrating speech understanding and generation is a pivotal step toward building unified speech models. However, the different representations required for these two tasks currently pose significant compatibility challenges. Typically, semantics-oriented features are learned from self-supervised learning (SSL), and acoustic-oriented features from reconstruction. Such fragmented representations hinder the realization of truly unified speech systems. We present WavCube, a compact continuous latent derived from an SSL speech encoder that simultaneously supports speech understanding, reconstruction, and generation. WavCube employs a two-stage training scheme. Stage 1 trains a semantic bottleneck to filter off-manifold redundancy that makes raw SSL features intractable for diffusion. Stage 2 injects fine-grained acoustic details via end-to-end reconstruction, while a semantic anchoring loss ensures the representation remains grounded within its original semantic manifold. Comprehensive experiments show that WavCube closely approaches WavLM performance on SUPERB despite an 8x dimensional compression, attains reconstruction quality on par with existing acoustic representations, delivers state-of-the-art zero-shot TTS performance with markedly faster training convergence, and excels in speech enhancement, separation, and voice conversion tasks on the SUPERB-SG benchmark. Systematic ablations reveal that WavCube's two-stage recipe resolves two intrinsic flaws of SSL features for generative modeling, paving the way for future unified speech systems. Codes and checkpoints are available at https://github.com/yanghaha0908/WavCube.

Institutional Affiliations

Primary: Shanghai Jiao Tong University

All Institutions: Shanghai Jiao Tong University, Shanghai Innovation Institute, Tencent, Independent Researcher, Peking University, Tianjin University, Zhejiang University

GitHub

ML Relevance Analysis (83)

WavCube presents a novel approach to unify speech understanding and generation through a compact continuous latent representation. This paper makes a substantial contribution to the field by addressing the compatibility challenges between semantic and acoustic features, demonstrating its effectiveness through rigorous experimentation across multiple benchmarks.

Comprehensive Analysis

Methodology Assessment

The methodology employed in WavCube is innovative, utilizing a two-stage training scheme that effectively addresses the challenges of integrating semantic and acoustic representations. The first stage compresses high-dimensional SSL features into a compact latent space, while the second stage enriches this latent space with fine-grained acoustic details. This approach is well-justified and systematically tackles the inherent flaws of existing SSL representations, making it a significant contribution to the field.

Experimental Evaluation

The experiments conducted are comprehensive and well-structured, demonstrating WavCube's performance across various tasks, including speech understanding, reconstruction, and generation. The results show that WavCube achieves competitive performance against existing methods, indicating its effectiveness and robustness. The use of benchmarks like SUPERB and the detailed evaluation metrics further enhance the credibility of the findings.

Reproducibility

The paper provides sufficient details regarding the methodology and experimental setup, including the datasets and training configurations used. However, the lack of a demo or interactive component may hinder some aspects of reproducibility for practitioners who wish to implement the model.

Limitations

While the paper presents a strong framework, it does not explicitly discuss potential limitations or assumptions underlying the proposed approach. For instance, the performance drop due to dimensionality reduction and the reliance on specific datasets could be areas of concern that warrant further exploration.

Broader Impact

The implications of WavCube are significant, as it offers a unified framework for speech processing that could enhance applications in voice synthesis, speech recognition, and multimodal interactions. By bridging the gap between understanding and generation, WavCube could pave the way for more integrated and efficient speech technologies. WavCube presents a novel approach to unify speech understanding and generation through a compact continuous latent representation. This paper makes a substantial contribution to the field by addressing the compatibility challenges between semantic and acoustic features, demonstrating its effectiveness through rigorous experimentation across multiple benchmarks.

Analysis: Full Paper • Full text: 50,026 characters

X-Voice: Enabling Everyone to Speak 30 Languages via Zero-Shot Cross-Lingual Voice Cloning

Rixi Xu, Qingyu Liu, Haitao Li ... · arXiv

In this paper, we present X-Voice, a 0.4B multilingual zero-shot voice cloning model that clones arbitrary voices and enables everyone to speak 30 languages. X-Voice is trained on a 420K-hour multilingual corpus using the International Phonetic Alphabet (IPA) as a unified represe...

In this paper, we present X-Voice, a 0.4B multilingual zero-shot voice cloning model that clones arbitrary voices and enables everyone to speak 30 languages. X-Voice is trained on a 420K-hour multilingual corpus using the International Phonetic Alphabet (IPA) as a unified representation. To eliminate the reliance on prompt text without complex preprocessing like forced alignment, we design a two-stage training paradigm. In Stage 1, we establish X-Voice$_{\text{s1}}$ through standard conditional flow-matching training and use it to synthesize 10K hours of speaker-consistent segments as audio prompts. In Stage 2, we fine-tune on these audio pairs with prompt text masked to derive X-Voice$_{\text{s2}}$, which enables zero-shot voice cloning without requiring transcripts of audio prompts. Architecturally, we extend F5-TTS by implementing a dual-level injection of language identifiers and decoupling and scheduling of Classifier-Free Guidance to facilitate multilingual speech synthesis. Subjective and objective evaluation results demonstrate that X-Voice outperforms existing flow-matching based multilingual systems like LEMAS-TTS and achieves zero-shot cross-lingual cloning capabilities comparable to billion-scale models such as Qwen3-TTS. To facilitate research transparency and community advancement, we open-source all related resources.

Institutional Affiliations

Primary: Zhejiang University

All Institutions: Zhejiang University, Beijing Haitian Ruisheng Science Technology Ltd, Center for Language and Speech Processing, Fudan University, Geely Automobile Research Institute (Ningbo) Company Ltd, MoE Key Lab of Artificial Intelligence, Shanghai Innovation Institute, Shanghai Jiao Tong University, X-LANCE Lab

GitHub

ML Relevance Analysis (83)

The paper presents X-Voice, a novel multilingual zero-shot voice cloning model that significantly advances the capabilities of TTS systems across 30 languages. The methodology, which includes a two-stage training process and innovative architectural enhancements, addresses critical limitations in existing systems, making it a valuable contribution to the field of machine learning and audio processing.

Comprehensive Analysis

Methodology Assessment

The paper introduces a two-stage training paradigm for zero-shot voice cloning, which is a significant advancement in the field. The first stage focuses on building a robust multilingual backbone using a large corpus, while the second stage fine-tunes the model using synthetic audio prompts without the need for reference transcripts. This approach effectively addresses the challenges of multilingual TTS systems, particularly the reliance on aligned text and audio, which is often problematic for low-resource languages. The introduction of dual-level language injection and decoupled classifier-free guidance further enhances the model's ability to maintain speaker identity and prosodic accuracy across languages.

Experimental Evaluation

The experimental results are comprehensive, comparing X-Voice against several state-of-the-art models across multiple languages. The use of both subjective and objective evaluation metrics, including WER, SIM-o, IMOS, and SMOS, provides a well-rounded assessment of the model's performance. The results indicate that X-Voice achieves competitive performance, particularly in low-resource languages, while also demonstrating improvements in intelligibility and speaker consistency compared to existing systems. The release of a new evaluation benchmark with human annotations adds significant value to the research community.

Reproducibility

The paper provides detailed implementation details, including model configurations, training setups, and evaluation protocols, which enhances reproducibility. The authors have also open-sourced their training corpus and evaluation benchmarks, fostering transparency and allowing other researchers to build upon their work.

Limitations

Despite its strengths, the model still faces challenges in preserving speaker similarity in certain phonological contexts, indicating a trade-off between accent suppression and timbre preservation. Additionally, the handling of intra-sentential code-switching is noted as an area for future improvement. The reliance on high-quality synthetic data in the fine-tuning stage may also limit the model's applicability in scenarios where such data is not available.

Broader Impact

The advancements presented in this paper have the potential to democratize high-fidelity TTS technology, making it accessible for a wider range of languages, including low-resource ones. The implications extend to various applications, such as personalized voice assistants, language learning tools, and accessibility technologies for individuals with speech impairments. The open-sourcing of resources could significantly accelerate research in multilingual TTS systems and contribute to the development of more inclusive technologies. The paper presents X-Voice, a novel multilingual zero-shot voice cloning model that significantly advances the capabilities of TTS systems across 30 languages. The methodology, which includes a two-stage training process and innovative architectural enhancements, addresses critical limitations in existing systems, making it a valuable contribution to the field of machine learning and audio processing.

Analysis: Full Paper • Full text: 45,343 characters

Quantum Kernels for Audio Deepfake Detection Using Spectrogram Patch Features

Lisan Al Amin, Rakib Hossain, Mahbubul Islam ... · arXiv

Quantum machine learning has emerged as a promising tool for pattern recognition, yet many audio-focused approaches still treat spectrograms as generic images and do not explicitly exploit their time-frequency structure. We propose Q-Patch, a quantum feature map tailored to audio...

Quantum machine learning has emerged as a promising tool for pattern recognition, yet many audio-focused approaches still treat spectrograms as generic images and do not explicitly exploit their time-frequency structure. We propose Q-Patch, a quantum feature map tailored to audio that encodes local time-frequency patches from mel-spectrograms into quantum states using shallow, hardware-efficient circuits with adjacency-aware entanglement. Each selected patch is summarized by a compact four-dimensional acoustic descriptor and mapped to a four-qubit circuit with depth at most three, enabling practical quantum kernel construction under near-term constraints. We evaluate Q-Patch on an audio spoofing detection task using a controlled, balanced protocol and compare it with size-matched classical baselines. Q-Patch improves discrimination between bona fide and spoofed samples, achieving an area under the receiver operating characteristic curve (AUROC) of 0.87, compared with 0.82 for a radial basis function support vector machine (RBF-SVM) trained on the same patch-level features. Kernel-space analysis further reveals a clear class structure, with cross-class similarity around 0.615 and within-class self-similarity of 1.00. Overall, Q-Patch provides a practical framework for incorporating time-frequency-aware representations into quantum kernel learning for audio authenticity assessment in low-resource settings.

Institutional Affiliations

Primary: Potomac Quantum

All Institutions: Potomac Quantum, United International University, University of Maryland, Monash University, University of the Sunshine Coast

ML Relevance Analysis (77)

The paper presents Q-Patch, a quantum feature-mapping framework for audio spoofing detection that effectively utilizes time-frequency structures in spectrograms. This innovative approach, combined with rigorous experimental validation, positions the work as a meaningful contribution to the field of audio deepfake detection and quantum machine learning.

Comprehensive Analysis

Methodology Assessment

The methodology introduces Q-Patch, a novel quantum feature mapping framework specifically designed for audio spoofing detection. It effectively utilizes local time-frequency patches from mel-spectrograms, which is a significant improvement over treating spectrograms as generic images. The use of shallow, hardware-efficient quantum circuits with adjacency-aware entanglement is innovative, as it addresses practical constraints of near-term quantum computing. The approach to summarize patches into compact four-dimensional descriptors before quantum embedding is well thought out, allowing for efficient processing while maintaining relevant information.

Experimental Evaluation

The experimental evaluation is conducted on a balanced dataset derived from LJ Speech, which includes both bona fide and spoofed audio samples. The results indicate that Q-Patch outperforms classical baselines, achieving an AUROC of 0.87 compared to 0.82 for RBF-SVM. The analysis of kernel-space structure further supports the effectiveness of the proposed method, showing clear class separability. However, the limited dataset size (100 samples) raises concerns about the generalizability of the results, which should be addressed in future work.

Reproducibility

The paper provides a detailed description of the methodology, including data preparation, feature extraction, and quantum embedding processes. However, the absence of code or a project URL limits the reproducibility of the results. Future work should include sharing the implementation details or code to enable other researchers to replicate the findings.

Limitations

The study's limitations include the small dataset size, which may not capture the full diversity of real-world audio spoofing attacks. The controlled nature of the spoof generation (using additive noise and spectral distortions) may not reflect the complexities of actual spoofing methods. Additionally, the results are based on ideal quantum simulations, which may not translate directly to performance on physical quantum hardware.

Broader Impact

The proposed Q-Patch framework has the potential to significantly impact the field of audio deepfake detection by introducing quantum machine learning techniques that leverage time-frequency structures. This could lead to more robust detection methods that are particularly useful in low-resource settings. As quantum computing technology advances, the framework may become increasingly applicable in real-world scenarios, enhancing security against audio spoofing. The paper presents Q-Patch, a quantum feature-mapping framework for audio spoofing detection that effectively utilizes time-frequency structures in spectrograms. This innovative approach, combined with rigorous experimental validation, positions the work as a meaningful contribution to the field of audio deepfake detection and quantum machine learning.

Analysis: Full Paper • Full text: 30,031 characters

Benchmarking LLMs on the Massive Sound Embedding Benchmark (MSEB)

Cyril Allauzen, Tom Bagby, Georg Heigold ... · arXiv

The Massive Sound Embedding Benchmark (MSEB) has emerged as a standard for evaluating the functional breadth of audio models. While initial baselines focused on specialized encoders, the shift toward "audio-native" Large Language Models (LLMs) suggests a new paradigm where a sing...

The Massive Sound Embedding Benchmark (MSEB) has emerged as a standard for evaluating the functional breadth of audio models. While initial baselines focused on specialized encoders, the shift toward "audio-native" Large Language Models (LLMs) suggests a new paradigm where a single multimodal backbone may replace complex, task-specific pipelines. This paper provides a rigorous empirical evaluation of leading LLMs - including members from the Gemini and GPT families - across the eight core MSEB capabilities to assess their efficacy and audio-text parity. Our results indicate that while a significant modality gap persists regarding performance and robustness, the empirical evidence for an "optimal" modeling approach remains inconclusive. Ultimately, the choice between audionative and cascaded architectures depends heavily on specific use-case requirements and the underlying assumptions regarding latency, cost, and reasoning depth.

Institutional Affiliations

Primary: Google

All Institutions: Google USA & Germany

GitHub

ML Relevance Analysis (83)

The main contribution of this paper is the rigorous empirical evaluation of leading audio-native LLMs on the MSEB, providing valuable insights into their performance and the challenges of achieving audio-text parity. The comprehensive analysis of methodologies and results positions this work as a significant step forward in the integration of audio processing within the framework of large language models, addressing both theoretical and practical aspects of the field.

Comprehensive Analysis

Methodology Assessment

The paper presents a comprehensive methodology for applying large language models (LLMs) to the Massive Sound Embedding Benchmark (MSEB), detailing a systematic approach to task-specific prompting and evaluation across diverse audio tasks. The methodology is well-structured, with clear definitions of tasks, input/output formats, and considerations for model performance. The iterative refinement of prompt templates through interactions with models like Gemini 3 demonstrates a thoughtful approach to optimizing LLMs for audio tasks. However, the paper could benefit from a more detailed discussion of the limitations of the chosen methodologies, particularly regarding the adaptability of LLMs to non-generative tasks.

Experimental Evaluation

The experimental evaluation is robust, covering a wide range of models and tasks, with detailed performance metrics provided for each evaluation. The use of a diverse set of datasets, including multilingual and varied acoustic environments, enhances the reliability of the results. The paper effectively compares audio-native LLMs with traditional cascaded systems, providing insights into their relative strengths and weaknesses. However, the analysis of results could be improved by including more visual aids (e.g., graphs) to illustrate performance trends across tasks and models.

Reproducibility

The paper mentions the open-source nature of the MSEB toolkit and provides a link to the GitHub repository, which is a positive aspect for reproducibility. However, the paper lacks detailed implementation specifics, such as hyperparameter settings, training protocols, and the exact versions of models used, which could hinder full reproducibility for other researchers.

Limitations

The paper acknowledges the significant modality gap that persists between audio and text processing, which is a critical limitation. Additionally, the authors note the challenges in achieving consistent performance across different locales and acoustic conditions, indicating that the models may not generalize well in real-world applications. The potential for test data contamination is also a significant concern that could skew results.

Broader Impact

The findings of this research have significant implications for the development of audio processing systems, particularly in enhancing the capabilities of LLMs in understanding and reasoning about audio data. The establishment of the MSEB as a benchmark could drive further research and innovation in the field, promoting the development of more robust and versatile audio-native models. The open-source nature of the toolkit encourages community engagement and collaboration, which could accelerate advancements in auditory intelligence. The main contribution of this paper is the rigorous empirical evaluation of leading audio-native LLMs on the MSEB, providing valuable insights into their performance and the challenges of achieving audio-text parity. The comprehensive analysis of methodologies and results positions this work as a significant step forward in the integration of audio processing within the framework of large language models, addressing both theoretical and practical aspects of the field.

Analysis: Full Paper • Full text: 32,356 characters

Hearing the Ocean: Bio-inspired Gammatone-CNN framework for Robust Underwater Acoustic Target Classification

Rajeshwar Tripathi, Sandeep Kumar, Monika Aggarwal ... · arXiv

This study presents a bio inspired signal processing framework for robust Underwater Acoustic Target Recognition (UATR). The latest state of the art methods often fail to resolve dense low frequency harmonic structures in vessel propulsion signals under high noise conditions, whi...

This study presents a bio inspired signal processing framework for robust Underwater Acoustic Target Recognition (UATR). The latest state of the art methods often fail to resolve dense low frequency harmonic structures in vessel propulsion signals under high noise conditions, which is addressed by the proposed framework using a biologically inspired Gammatone filter bank that emulates the cochlea nonlinear frequency selectivity. By distributing filters according to the Equivalent Rectangular Bandwidth (ERB) scale, the framework achieves a high fidelity representation of engine radiated tonals while effectively suppressing isotropic ambient interference. The resulting Cochleagram features are processed by a lightweight, custom designed Convolutional Neural Network (CNN) that leverages large receptive fields to integrate spectral-temporal continuities. Experimental results on the VTUAD dataset demonstrate a state of the art classification accuracy of 98.41%, outperforming Continuous Wavelet Transform and Mel Frequency Cepstral Coefficients baselines by 3.5% and 7.7% respectively. Furthermore, the framework achieves an inference latency of only 0.77 ms and a 0.971 Cohen Kappa score, validating its efficacy for real time deployment on autonomous, low-power sonar hardware.

Institutional Affiliations

Primary: Centre for Applied Research in Electronics (CARE)

All Institutions: Central Research Laboratory, Bharat Electronics Limited, Ghaziabad, India, Centre for Applied Research in Electronics (CARE), IIT Delhi, India

ML Relevance Analysis (83)

The main contribution of this paper is the development of a bio-inspired Gammatone-CNN framework for underwater acoustic target classification, achieving state-of-the-art performance through innovative feature extraction techniques. This research significantly advances the field of underwater acoustics by providing a method that combines biological principles with modern machine learning, demonstrating the potential for improved classification accuracy in challenging acoustic environments.

Comprehensive Analysis

Methodology Assessment

The paper introduces a novel bio-inspired Gammatone filter bank for underwater acoustic target classification, leveraging the non-linear frequency selectivity of the cochlea. The methodology prioritizes feature extraction over architectural complexity, employing a lightweight CNN that effectively integrates spectral-temporal features. The use of the Equivalent Rectangular Bandwidth (ERB) scale for filter distribution is particularly innovative, allowing for high fidelity in low-frequency representation, which is crucial for underwater acoustics. The mathematical foundations of the Gammatone filter and the detailed description of the Cochleagram formation process provide a solid basis for the proposed approach.

Experimental Evaluation

The experimental validation is robust, utilizing the VTUAD dataset to demonstrate the framework's effectiveness. Achieving a classification accuracy of 98.41% and a Cohen Kappa score of 0.971 indicates strong performance and reliability. The comparative analysis against established methods like Continuous Wavelet Transform and Mel Frequency Cepstral Coefficients shows significant improvements, reinforcing the proposed method's superiority. The inclusion of diverse metrics such as ROC curves and confusion matrices adds depth to the evaluation.

Reproducibility

The paper provides detailed information on the experimental setup, including dataset partitioning, feature extraction parameters, and model architecture. However, the absence of a publicly available code repository limits reproducibility. Future work should consider sharing implementation details to facilitate validation by other researchers.

Limitations

While the proposed framework shows remarkable performance, it may struggle with class imbalance, particularly for underrepresented classes like Passengership. The reliance on a specific dataset (VTUAD) may also limit generalizability to other underwater environments. Additionally, the computational efficiency on standard CPUs, while acceptable, could be a concern for real-time applications in more constrained environments.

Broader Impact

The implications of this research extend to maritime security, ecological monitoring, and autonomous underwater vehicles (AUVs). By improving underwater target recognition, the framework can enhance surveillance capabilities and contribute to the protection of marine ecosystems. The low-power, real-time processing capabilities make it suitable for deployment in resource-constrained environments. The main contribution of this paper is the development of a bio-inspired Gammatone-CNN framework for underwater acoustic target classification, achieving state-of-the-art performance through innovative feature extraction techniques. This research significantly advances the field of underwater acoustics by providing a method that combines biological principles with modern machine learning, demonstrating the potential for improved classification accuracy in challenging acoustic environments.

Analysis: Full Paper • Full text: 30,521 characters

JASTIN: Aligning LLMs for Zero-Shot Audio and Speech Evaluation via Natural Language Instructions

Leying Zhang, Bowen Shi, Haibin Wu ... · arXiv

The rapid advancement of generative audio models has outpaced the development of robust evaluation methodologies. Existing objective metrics and general multimodal large language models (MLLMs) often struggle with domain generalization, zero-shot capabilities, and instructional f...

The rapid advancement of generative audio models has outpaced the development of robust evaluation methodologies. Existing objective metrics and general multimodal large language models (MLLMs) often struggle with domain generalization, zero-shot capabilities, and instructional flexibility. To address these bottlenecks, we propose JASTIN, a generalizable, instruction-driven audio evaluation framework that formulates audio assessment as a self-instructed reasoning task. JASTIN bridges a frozen high-performance audio encoder with a fine-tuned LLM backbone via a trainable audio adapter. To ensure robust zero-shot generalization, we introduce a comprehensive instruction following data preparation pipeline, incorporating Multi-Source, Multi-Task, Multi-Calibration, and Multi-Description data. Experimental results demonstrate that JASTIN achieves state-of-the-art Pearson and Spearman correlations with human subjective ratings. It consistently outperforms general MLLMs across speech, sound, music, and out-of-domain evaluation tasks without the need for task-specific retraining.

Institutional Affiliations

Primary: Shanghai Jiao Tong University

All Institutions: Shanghai Jiao Tong University, MoE Key Laboratory of Artificial Intelligence, AI Institute

GitHub

ML Relevance Analysis (83)

The main contribution of this paper is the introduction of JASTIN, a novel instruction-driven framework for zero-shot audio evaluation that significantly enhances the evaluation process by integrating multimodal LLMs with advanced audio processing techniques. This work represents a meaningful advancement in the field of audio evaluation, addressing critical challenges and setting a new standard for future research.

Comprehensive Analysis

Methodology Assessment

The proposed JASTIN framework innovatively integrates a frozen high-performance audio encoder with a fine-tuned LLM backbone through a trainable audio adapter, addressing the limitations of existing evaluation metrics by employing a self-instructed reasoning paradigm. The comprehensive data preparation pipeline, which includes multi-source, multi-task, multi-calibration, and multi-description strategies, enhances the model's zero-shot generalization capabilities, making it adaptable to various audio evaluation tasks without the need for task-specific retraining.

Experimental Evaluation

The experimental results demonstrate that JASTIN achieves state-of-the-art Pearson and Spearman correlations with human subjective ratings across diverse audio domains, including speech, sound, and music. The framework consistently outperforms both traditional metrics and general MLLMs, showcasing its robustness and effectiveness in real-world applications. The evaluation on out-of-domain tasks further emphasizes its generalization capabilities, which is a significant advancement in the field.

Reproducibility

The authors have provided detailed implementation information, including training configurations, data preparation methods, and evaluation metrics, which enhances the reproducibility of their results. However, the lack of a demo URL limits immediate accessibility for other researchers to test the framework.

Limitations

While the paper presents a comprehensive framework, it does not address potential biases in the training data or the limitations of the LLMs used. Additionally, the model's performance on highly specialized audio tasks may still require further validation.

Broader Impact

The JASTIN framework has the potential to revolutionize audio evaluation methodologies by providing a more flexible and generalizable approach. Its implications extend to various applications in audio synthesis, music generation, and speech processing, enabling more efficient and scalable evaluation processes in these domains. The main contribution of this paper is the introduction of JASTIN, a novel instruction-driven framework for zero-shot audio evaluation that significantly enhances the evaluation process by integrating multimodal LLMs with advanced audio processing techniques. This work represents a meaningful advancement in the field of audio evaluation, addressing critical challenges and setting a new standard for future research.

Analysis: Full Paper • Full text: 50,026 characters

Spatial-Magnifier: Spatial upsampling for multichannel speech enhancement

Dongheon Lee, Ashutosh Pandey, Sanjeel Parekh ... · arXiv

While the spatial directivity of multichannel speech enhancement algorithms improves with the number of microphones, fitting large capture arrays into real-world edge devices is typically limited by physical constraints. To overcome this limitation, we propose Spatial-Magnifier, ...

While the spatial directivity of multichannel speech enhancement algorithms improves with the number of microphones, fitting large capture arrays into real-world edge devices is typically limited by physical constraints. To overcome this limitation, we propose Spatial-Magnifier, a neural network designed to generate virtual microphone (VM) signals from a limited set of real microphone (RM) measurements. Moreover, we introduce the Spatial Audio Representation Learning (SARL) framework, which leverages estimated VM signals and features to condition a downstream speech enhancement system. Experimental results demonstrate that the proposed framework outperforms existing spatial upsampling baselines across various speech extraction systems, including end-to-end multichannel speech enhancement and neural beamforming. The proposed method nearly recovers the oracle performance achieved when all microphones are available.

Institutional Affiliations

Primary: Korea Advanced Institute of Science and Technology (KAIST)

All Institutions: Korea Advanced Institute of Science and Technology (KAIST), Meta Reality Labs Research

ML Relevance Analysis (83)

The paper introduces Spatial-Magnifier, a neural network for spatial upsampling in multichannel speech enhancement, and the SARL framework, significantly enhancing downstream speech processing tasks. The innovative approach and rigorous experimental validation position this work as a valuable contribution to the field of audio signal processing and machine learning.

Comprehensive Analysis

Methodology Assessment

The paper presents a novel neural network architecture, Spatial-Magnifier, which effectively generates virtual microphone signals from real microphone measurements. It introduces the Spatial Audio Representation Learning (SARL) framework, which enhances the conditioning of downstream speech enhancement tasks by leveraging both estimated virtual microphone signals and features. The use of a GAN-based approach and the incorporation of selection and dynamic channel allocation modules are innovative aspects that contribute to the flexibility and efficiency of the model. The methodology is well-structured, with clear definitions and a logical flow from problem identification to proposed solutions.

Experimental Evaluation

The experiments are comprehensive, utilizing a well-defined dataset and a robust experimental setup to evaluate the performance of the proposed methods. The authors conduct ablation studies and comparisons with existing baselines, demonstrating the effectiveness of their approach across different configurations and tasks. The results indicate significant improvements in performance metrics such as SI-SDR, SNR, PESQ, and STOI, showcasing the technical superiority of the proposed methods over traditional approaches.

Reproducibility

The paper provides sufficient details regarding the experimental setup, including the architecture parameters, training procedures, and evaluation metrics. However, the absence of a publicly accessible code repository or demo URL limits the reproducibility of the results. Future work could benefit from sharing the implementation to facilitate validation by the research community.

Limitations

One limitation is the reliance on simulated data for training and evaluation, which may not fully capture the complexities of real-world environments. Additionally, while the proposed methods show promise, the performance in highly dynamic or noisy environments remains to be thoroughly evaluated. The computational efficiency, while improved, could still be a concern for deployment on resource-constrained devices.

Broader Impact

The proposed methods have significant implications for real-world applications in speech enhancement, particularly in consumer electronics such as AR glasses and hearing aids. By enabling effective multichannel speech enhancement with fewer microphones, the work addresses a critical need for improved audio capture in compact devices. The advancements in spatial audio processing could also benefit various fields, including telecommunications, virtual reality, and assistive technologies. The paper introduces Spatial-Magnifier, a neural network for spatial upsampling in multichannel speech enhancement, and the SARL framework, significantly enhancing downstream speech processing tasks. The innovative approach and rigorous experimental validation position this work as a valuable contribution to the field of audio signal processing and machine learning.

Analysis: Full Paper • Full text: 21,431 characters

To Fuse or to Drop? Dual-Path Learning for Resolving Modality Conflicts in Multimodal Emotion Recognition

Yangchen Yu, Qian Chen, Jia Li ... · arXiv

Multimodal emotion recognition (MER) benefits from combining text, audio, and vision, yet standard fusion often fails when modalities conflict. Crucially, conflicts differ in resolvability: benign conflicts stem from missing, weak, or ambiguous cues and can be mitigated by cross-...

Multimodal emotion recognition (MER) benefits from combining text, audio, and vision, yet standard fusion often fails when modalities conflict. Crucially, conflicts differ in resolvability: benign conflicts stem from missing, weak, or ambiguous cues and can be mitigated by cross-modal calibration, while severe conflicts arise from intrinsically contradictory (e.g., sarcasm) or misleading signals, for which forced fusion may amplify errors. Recognizing this, we propose Dual-Path Conflict Resolution (DCR), a unified framework that learns when to fuse and when to drop modalities. Path I (Affective Fusion Distiller, AFD) performs reverse distillation from audio/visual teachers to a textual student using temporally weighted class evidence, thereby enhancing representation-level calibration and improving fusion when alignment is beneficial. Path II (Affective Discernment Agent, ADA) formulates MER as a contextual bandit that selects among fusion and unimodal predictions based on a dual-view state and a calibration-aware reward, enabling decision-level arbitration under irreconcilable conflicts without requiring per-modality reliability labels. By taking into account the full multimodal context and coupling soft calibration with hard arbitration, DCR reconciles conflicts that can be aligned while bypassing misleading modalities when fusion is harmful. Across five benchmarks covering both dialogue-level and clip-level MER, DCR consistently outperforms competitive baselines or achieves highly competitive results. Further ablations, conflict-specific subset evaluation, and modality-selection analysis verify that AFD and ADA are complementary and jointly improve robust conflict-aware emotion recognition.

Institutional Affiliations

Primary: Hefei University of Technology

All Institutions: Hefei University of Technology, Singapore Management University, Nanyang Technological University, MIT Media Lab

GitHub

ML Relevance Analysis (83)

The main contribution of this paper is the introduction of the Dual-Path Conflict Resolution framework, which innovatively addresses modality conflicts in multimodal emotion recognition by employing a dual-path approach that distinguishes between benign and severe conflicts. This comprehensive analysis highlights the technical contributions, methodological rigor, and potential impact of the research on the field of affective computing.

Comprehensive Analysis

Methodology Assessment

The proposed Dual-Path Conflict Resolution (DCR) framework is a significant advancement in multimodal emotion recognition (MER). It effectively distinguishes between benign and severe modality conflicts, employing two distinct paths (AFD and ADA) to handle these conflicts appropriately. AFD utilizes knowledge distillation to enhance textual representations with non-verbal cues, while ADA employs a contextual bandit approach for decision-level arbitration. This dual-path strategy is innovative as it shifts the focus from traditional fusion methods that may amplify errors to a more nuanced conflict-aware approach. The methodology is well-structured, with clear definitions of conflict types and a comprehensive explanation of how each path operates.

Experimental Evaluation

The experimental evaluation is robust, covering five diverse benchmarks that include both dialogue-level and clip-level datasets. The results consistently demonstrate that DCR outperforms competitive baselines, indicating its effectiveness across different contexts. The paper includes detailed ablation studies that validate the contributions of each component within the DCR framework, further strengthening the findings. The use of multiple evaluation metrics enhances the reliability of the results.

Reproducibility

The paper provides sufficient implementation details, including the architecture, training protocols, and datasets used. However, the absence of a public demo or detailed code repository at the time of review limits reproducibility. The authors mention that the source code and models will be released, which is a positive step towards enhancing reproducibility.

Limitations

One limitation of the study is the reliance on heuristic approximations for defining conflict severity, which may not capture the full complexity of modality interactions in real-world scenarios. Additionally, while the framework shows strong performance, its effectiveness in highly nuanced or ambiguous emotional contexts remains to be fully explored.

Broader Impact

The DCR framework has significant implications for various applications, including human-computer interaction, healthcare, and robotics, where accurate emotion recognition is crucial. By addressing modality conflicts more effectively, this work could lead to more reliable affective computing systems that better understand human emotions. The main contribution of this paper is the introduction of the Dual-Path Conflict Resolution framework, which innovatively addresses modality conflicts in multimodal emotion recognition by employing a dual-path approach that distinguishes between benign and severe conflicts. This comprehensive analysis highlights the technical contributions, methodological rigor, and potential impact of the research on the field of affective computing.

Analysis: Full Paper • Full text: 48,371 characters

Stage-adaptive audio diffusion modeling

Xuanhao Zhang, Chang Li · arXiv

Recent progress in diffusion-based audio generation and restoration has substantially improved performance across heterogeneous conditioning regimes, including text-conditioned audio generation and audio-conditioned super-resolution. However, training audio diffusion models remai...

Recent progress in diffusion-based audio generation and restoration has substantially improved performance across heterogeneous conditioning regimes, including text-conditioned audio generation and audio-conditioned super-resolution. However, training audio diffusion models remains computationally expensive, and most existing pipelines still rely on static optimization recipes that treat the relative importance of training signals as fixed throughout learning. In this work, we argue that a major source of inefficiency lies in the evolving balance between semantic acquisition and generation-oriented refinement. Early training places stronger emphasis on acquiring condition-aligned semantic structure and coarse global organization, whereas later training increasingly emphasizes temporal consistency, perceptual fidelity, and fine-detail refinement. To characterize this evolving balance, we introduce a progress-based regime variable derived from the training-time slope of an SSL-space discrepancy, which measures semantic progress during training. Based on this signal, we develop three complementary stage-aware mechanisms: decayed SSL guidance for early semantic bootstrapping, self-adaptive timestep sampling driven by the regime variable, and structure-aware regularization activated from convergent grouped organization in parameter space. We evaluate these mechanisms on text-conditioned audio generation and audio-conditioned super-resolution. Across both settings, the proposed stage-aware strategies improve convergence behavior and yield gains on the primary generation and spectral reconstruction metrics over standard static baselines. These results support the view that efficient audio diffusion training can benefit from treating external guidance, internal organization, and optimization emphasis as stage-dependent components rather than fixed ingredients.

Institutional Affiliations

Primary: China Pharmaceutical University

All Institutions: China Pharmaceutical University, University of Science and Technology of China

ML Relevance Analysis (82)

The paper presents a novel stage-adaptive framework for audio diffusion modeling, significantly enhancing training efficiency and model performance. The comprehensive methodology and experimental validation contribute valuable insights to the field, although concerns regarding reproducibility and the need for broader applicability remain.

Comprehensive Analysis

Methodology Assessment

The paper introduces a stage-aware perspective on audio diffusion training, which is a significant methodological innovation. The authors propose three complementary mechanisms—decayed SSL guidance, self-adaptive timestep sampling, and structure-aware regularization—each designed to adapt the training process based on the evolving needs of the model. This approach is well-justified and supported by a clear theoretical framework, utilizing a regime variable to monitor semantic progress. The proposed methods are distinct from traditional static optimization techniques, marking a notable advancement in the field of audio diffusion modeling.

Experimental Evaluation

The experiments are comprehensive, evaluating the proposed methods on both text-conditioned audio generation and audio-conditioned super-resolution. The use of multiple metrics (e.g., FAD, KL divergence, and spectral reconstruction metrics) provides a robust assessment of performance improvements. The results consistently demonstrate that the stage-aware mechanisms outperform static baselines, highlighting their effectiveness. However, the paper could benefit from additional experiments to further validate the findings across diverse datasets and conditions.

Reproducibility

The paper lacks explicit details regarding the implementation and availability of code or datasets, which raises concerns about reproducibility. While the methodology is well-documented, the absence of a project URL or demo limits the ability of other researchers to replicate the results or build upon the work.

Limitations

One limitation is the reliance on a single frozen SSL encoder, which may restrict the generalizability of the findings. Additionally, while the results show improvements in convergence and quality metrics, the paper does not sufficiently address the computational overhead introduced by the proposed mechanisms. The authors also acknowledge that gains in certain metrics (e.g., SISNR) were less pronounced, suggesting that the approach may not uniformly enhance all aspects of audio quality.

Broader Impact

The findings have significant implications for the development of efficient audio generation systems, particularly in applications requiring high-quality audio synthesis and restoration. By demonstrating that training efficiency can be improved through a stage-aware approach, this work may influence future research directions in generative modeling and audio processing. The insights gained could also be applicable to other domains where dynamic adaptation of training strategies is beneficial. The paper presents a novel stage-adaptive framework for audio diffusion modeling, significantly enhancing training efficiency and model performance. The comprehensive methodology and experimental validation contribute valuable insights to the field, although concerns regarding reproducibility and the need for broader applicability remain.

Analysis: Full Paper • Full text: 32,063 characters

VocalParse: Towards Unified and Scalable Singing Voice Transcription with Large Audio Language Models

Yukun Chen, Tianrui Wang, Zhaoxi Mu ... · arXiv

High-quality singing annotations are fundamental to modern Singing Voice Synthesis (SVS) systems. However, obtaining these annotations at scale through manual labeling is unrealistic due to the substantial labor and musical expertise required, making automatic annotation highly n...

High-quality singing annotations are fundamental to modern Singing Voice Synthesis (SVS) systems. However, obtaining these annotations at scale through manual labeling is unrealistic due to the substantial labor and musical expertise required, making automatic annotation highly necessary. Despite their utility, current automatic transcription systems face significant challenges: they often rely on complex multi-stage pipelines, struggle to recover text-note alignments, and exhibit poor generalization to out-of-distribution (OOD) singing data. To alleviate these issues, we present VocalParse, a unified singing voice transcription (SVT) model built upon a Large Audio Language Model (LALM). Specifically, our novel contribution is to introduce an interleaved prompting formulation that jointly models lyrics, melody, and word-note correspondence, yielding a generated sequence that directly maps to a structured musical score. Furthermore, we propose a Chain-of-Thought (CoT) style prompting strategy, which decodes lyrics first as a semantic scaffold, significantly mitigating the context disruption problem while preserving the structural benefits of interleaved generation. Experiments demonstrate that VocalParse achieves state-of-the-art SVT performance on multiple singing datasets. The source code and checkpoint are available at https://github.com/pymaster17/VocalParse.

Institutional Affiliations

Primary: Xi'an Jiaotong University

All Institutions: Xi'an Jiaotong University, Nanyang Technological University, Tianjin University, Ant Group, Zhejiang University

GitHub

ML Relevance Analysis (82)

The main contribution of this paper is the development of VocalParse, a unified and scalable singing voice transcription framework that effectively integrates lyrics and melody transcription using advanced prompting strategies and a novel data collection pipeline. This work represents a significant step forward in addressing the challenges of automatic singing voice transcription, with implications for both academic research and practical applications in music technology.

Comprehensive Analysis

Methodology Assessment

The paper introduces VocalParse, a unified singing voice transcription model leveraging a Large Audio Language Model (LALM). The methodology is innovative, particularly with the interleaved prompting formulation that integrates lyrics and melody in a structured manner, addressing the challenges of traditional multi-stage pipelines. The Chain-of-Thought (CoT) prompting strategy is a significant advancement, allowing for better semantic continuity in the transcription process. The introduction of the SingCrawl data pipeline for large-scale data collection is also a noteworthy contribution, enhancing the model's training data quality and quantity.

Experimental Evaluation

The experiments demonstrate VocalParse's state-of-the-art performance across multiple datasets, showcasing its effectiveness in both Automatic Melody Transcription (AMT) and Automatic Lyric Transcription (ALT). The results are robust, with clear metrics provided for evaluation, including Mean Absolute Error (MAE) for melody and Word Error Rate (WER) for lyrics. The ablation studies effectively highlight the importance of the CoT prompting and the SingCrawl pipeline, providing insights into the model's performance drivers.

Reproducibility

The paper provides sufficient implementation details, including training configurations and data processing steps. The availability of source code and checkpoints on GitHub enhances reproducibility, although the lack of a demo or interactive component may limit accessibility for some researchers.

Limitations

The paper acknowledges limitations such as the assumption of a single global tempo for songs, which may not capture variations in performance. Additionally, the model's performance is constrained by the quality of the teacher pipeline used for data annotation. The focus on Mandarin data may also limit generalizability to other languages without further adaptation.

Broader Impact

VocalParse has the potential to significantly impact the field of music information retrieval (MIR) and singing voice synthesis (SVS) by providing a scalable solution for automatic singing voice transcription. This could lead to advancements in music generation, annotation, and analysis, facilitating broader applications in music technology and AI-driven creative processes. The main contribution of this paper is the development of VocalParse, a unified and scalable singing voice transcription framework that effectively integrates lyrics and melody transcription using advanced prompting strategies and a novel data collection pipeline. This work represents a significant step forward in addressing the challenges of automatic singing voice transcription, with implications for both academic research and practical applications in music technology.

Analysis: Full Paper • Full text: 40,665 characters

APEX: Large-scale Multi-task Aesthetic-Informed Popularity Prediction for AI-Generated Music

Jaavid Aktar Husain, Dorien Herremans · arXiv

Music popularity prediction has attracted growing research interest, with relevance to artists, platforms, and recommendation systems. However, the explosive rise of AI-generated music platforms has created an entirely new and largely unexplored landscape, where a surge of songs ...

Music popularity prediction has attracted growing research interest, with relevance to artists, platforms, and recommendation systems. However, the explosive rise of AI-generated music platforms has created an entirely new and largely unexplored landscape, where a surge of songs is produced and consumed daily without the traditional markers of artist reputation or label backing. Key, yet unexplored in this pursuit is aesthetic quality. We propose APEX, the first large-scale multi-task learning framework for AI-generated music, trained on over 211k songs (10k hours of audio) from Suno and Udio, that jointly predicts engagement-based popularity signals - streams and likes scores - alongside five perceptual aesthetic quality dimensions from frozen audio embeddings extracted from MERT, a self-supervised music understanding model. Aesthetic quality and popularity capture complementary aspects of music that together prove valuable: in an out-of-distribution evaluation on the Music Arena dataset, comprising pairwise human preference battles across eleven generative music systems unseen during training, including aesthetic features consistently improves preference prediction, demonstrating strong generalisation of the learned representations across generative architectures.

Institutional Affiliations

Primary: Singapore University of Technology and Design

All Institutions: Singapore University of Technology and Design

GitHub

ML Relevance Analysis (83)

The main contribution of this paper is the introduction of APEX, a multi-task learning framework that predicts both popularity and aesthetic quality in AI-generated music, demonstrating the complementary relationship between these dimensions and providing a foundation for future research in music recommendation systems. The comprehensive analysis of the technical contributions, methodology, and significance to the field highlights the innovative approach taken and its potential impact on the landscape of music technology.

Comprehensive Analysis

Methodology Assessment

The proposed APEX framework is a significant advancement in the field of music popularity prediction, particularly for AI-generated music. It employs a multi-task learning approach that integrates aesthetic quality dimensions with engagement-based popularity signals, which is a novel combination in this domain. The use of MERT embeddings for audio representation is well-justified, and the systematic exploration of loss strategies and task configurations demonstrates a rigorous approach to model design. The methodology is comprehensive, addressing both the technical aspects of model training and the theoretical underpinnings of the relationship between aesthetic quality and popularity.

Experimental Evaluation

The experiments are robust, utilizing a large-scale dataset of over 211k songs and including a thorough ablation study across 24 experimental conditions. The evaluation on the Music Arena dataset with unseen generative music systems adds significant value, demonstrating the model's generalization capabilities. The results indicate that aesthetic features enhance preference prediction, which is a critical insight for future research in music recommendation systems.

Reproducibility

The paper provides detailed implementation details, including dataset construction, embedding extraction, training procedures, and model architectures. The open-source release of the APEX model and its code on GitHub further supports reproducibility, allowing other researchers to validate and build upon this work.

Limitations

One limitation is the potential bias introduced by the dataset, as it primarily focuses on AI-generated music from specific platforms (Udio and Suno), which may not generalize to all music genres or styles. Additionally, the performance gap on vocal tracks suggests that further refinement is needed for models that incorporate vocal elements, which could be explored in future work.

Broader Impact

The implications of this research are substantial, as it addresses the growing domain of AI-generated music and its integration into popular music consumption. By providing a framework that predicts popularity based on intrinsic audio features and aesthetic quality, it opens avenues for improved music recommendation systems and enhances the understanding of how listeners perceive AI-generated music. This work could influence artists, music platforms, and researchers alike, fostering a deeper appreciation for the aesthetic dimensions of music. The main contribution of this paper is the introduction of APEX, a multi-task learning framework that predicts both popularity and aesthetic quality in AI-generated music, demonstrating the complementary relationship between these dimensions and providing a foundation for future research in music recommendation systems. The comprehensive analysis of the technical contributions, methodology, and significance to the field highlights the innovative approach taken and its potential impact on the landscape of music technology.

Analysis: Full Paper • Full text: 27,500 characters

Ecologically-Constrained Task Arithmetic for Multi-Taxa Bioacoustic Classifiers Without Shared Data

Ragib Amin Nihal, Benjamin Yen, Runwu Shi ... · arXiv

Training data for bioacoustics is scattered across taxa, regions, and institutions. Centralizing it all is often infeasible. We show that independently fine-tuned BEATs encoders can be composed into a unified 661-species classifier via task vector arithmetic without sharing data....

Training data for bioacoustics is scattered across taxa, regions, and institutions. Centralizing it all is often infeasible. We show that independently fine-tuned BEATs encoders can be composed into a unified 661-species classifier via task vector arithmetic without sharing data. We find that bioacoustic task vectors are near-orthogonal (cosine 0.01-0.09). Their separation aligns closely with spectral distribution distance, a gradient consistent with the acoustic niche hypothesis. This geometry makes simple averaging optimal while sign-conflict methods reduce accuracy by one to six percentage points. Composition also creates an asymmetric gap: species-rich groups lose accuracy relative to joint training while underrepresented taxa gain, a redistribution useful for equitable biodiversity monitoring. We verify linear mode connectivity across all taxonomic pairs, demonstrate zero-shot transfer to new regions, and identify domain negation as a boundary condition where composition fails. These results enable a collaborative paradigm for bioacoustics where institutions share only task vectors to assemble multi-taxa classifiers, preserving data privacy.

Institutional Affiliations

Primary: Institute of Science Tokyo

All Institutions: Institute of Science Tokyo, RIKEN BDR

GitHub

ML Relevance Analysis (83)

This paper presents a significant advancement in the field of bioacoustics by introducing a novel method for composing multi-taxa classifiers using task vector arithmetic, enabling collaborative model building while preserving data privacy. The combination of ecological principles with machine learning techniques offers a fresh perspective on model merging, with the potential to impact biodiversity monitoring and conservation efforts.

Comprehensive Analysis

Methodology Assessment

The methodology presented in this paper is innovative, leveraging task vector arithmetic to combine independently fine-tuned models into a unified classifier without sharing data. The authors provide a clear and systematic approach to the problem, including the definition of task vectors and the exploration of their geometric properties in weight space. The use of ecological principles, specifically the acoustic niche hypothesis, to predict the geometry of task vectors is a novel angle that adds depth to the analysis. The methodology is well-structured, with detailed descriptions of the merging strategies and the assumptions made during the process.

Experimental Evaluation

The experimental evaluation is robust, with multiple experiments validating the proposed approach. The authors demonstrate linear mode connectivity, analyze task vector geometry, and assess the impact of merging on species classification accuracy. The results are comprehensive, showing both the advantages and limitations of the proposed method. The use of various datasets and the exploration of zero-shot transfer capabilities further strengthen the findings. However, the paper could benefit from more extensive comparisons with existing methods to highlight the advantages of their approach.

Reproducibility

The paper provides a thorough account of the experimental setup, including hyperparameters and training protocols, which aids reproducibility. The authors mention using SHA-256 hashes for configuration validation, ensuring that the experiments can be replicated. However, the lack of a detailed description of the datasets used and their preprocessing steps may pose challenges for complete reproducibility.

Limitations

One limitation of the study is the potential for overfitting to the specific datasets used, which may not generalize to other bioacoustic contexts. Additionally, the assumption of disjoint species sets may not hold in all real-world scenarios, potentially affecting the performance of the proposed method. The authors also acknowledge that domain negation fails, indicating that there are boundaries to the applicability of their approach.

Broader Impact

The proposed method has significant implications for biodiversity monitoring and conservation efforts, allowing institutions to collaborate without compromising data privacy. This collaborative paradigm could enhance the development of multi-taxa classifiers, making it easier to monitor diverse ecosystems and respond to conservation needs. The findings could also inspire further research into task vector arithmetic in other domains of machine learning, potentially leading to more efficient and privacy-preserving model training techniques. This paper presents a significant advancement in the field of bioacoustics by introducing a novel method for composing multi-taxa classifiers using task vector arithmetic, enabling collaborative model building while preserving data privacy. The combination of ecological principles with machine learning techniques offers a fresh perspective on model merging, with the potential to impact biodiversity monitoring and conservation efforts.

Analysis: Full Paper • Full text: 43,031 characters

Stage Light is Sequence$^2$: Multi-Light Control via Imitation Learning

Zijian Zhao, Dian Jin, Zijing Zhou ... · arXiv

Music-inspired Automatic Stage Lighting Control (ASLC) has gained increasing attention in recent years due to the substantial time and financial costs associated with hiring and training professional lighting engineers. However, existing methods suffer from several notable limita...

Music-inspired Automatic Stage Lighting Control (ASLC) has gained increasing attention in recent years due to the substantial time and financial costs associated with hiring and training professional lighting engineers. However, existing methods suffer from several notable limitations: the low interpretability of rule-based approaches, the restriction to single-primary-light control in music-to-color-space methods, and the limited transferability of music-to-controlling-parameter frameworks. To address these gaps, we propose SeqLight, a hierarchical deep learning framework that maps music to multi-light Hue-Saturation-Value (HSV) space. Our approach first customizes SkipBART, an end-to-end single primary light generation model, to predict the full light color distribution for each frame, followed by hybrid Imitation Learning (IL) techniques to derive an effective decomposition strategy that distributes the global color distribution among individual lights. Notably, the light decomposition module can be trained under varying venue-specific lighting configurations using only mixed light data and no professional demonstrations, thereby flexibly adapting across diverse venues. In this stage, we formulate the light decomposition task as a Goal-Conditioned Markov Decision Process (GCMDP), construct an expert demonstration set inspired by Hindsight Experience Replay (HER), and introduce a three-phase IL training pipeline, achieving strong generalization capability. To validate our IL solution for the proposed GCMDP, we conduct a series of quantitative analysis and human study. The code and trained models are provided at https://github.com/RS2002/SeqLight .

Institutional Affiliations

Primary: The Hong Kong University of Science and Technology

All Institutions: The Hong Kong Polytechnic University, The University of Hong Kong, City University of Hong Kong

GitHub

ML Relevance Analysis (83)

The main contribution of this paper is the introduction of SeqLight, a novel hierarchical framework for automatic stage lighting control that effectively combines music analysis with advanced machine learning techniques to enhance the quality and adaptability of lighting in live performances. This work represents a significant advancement in the intersection of machine learning and performing arts, addressing practical challenges while providing a robust methodological framework.

Comprehensive Analysis

Methodology Assessment

The paper introduces SeqLight, a hierarchical deep learning framework that innovatively combines a modified Skip-BART model with imitation learning techniques to address the challenges of automatic stage lighting control. The methodology is well-structured, separating the problem into two stages: predicting light distributions and decomposing these into individual light controls via a Goal-Conditioned Markov Decision Process (GCMDP). The use of hybrid imitation learning, particularly the incorporation of Hindsight Experience Replay (HER) and Adversarial Inverse Reinforcement Learning (AIRL), is a notable strength, as it allows for effective learning without the need for extensive expert demonstrations. The approach is comprehensive, addressing both the technical challenges of multi-light control and the practical limitations of data collection in diverse venues.

Experimental Evaluation

The experiments are robust, utilizing both quantitative metrics (L1 distance, Wasserstein distance, etc.) and qualitative assessments through human evaluations. The results demonstrate that SeqLight outperforms competitive baselines, including Skip-BART and rule-based methods, in various music styles. The inclusion of a human study adds significant value, providing insights into user preferences and the system's generalization capabilities across different music genres. However, the paper could benefit from clearer presentation of experimental setups and results, particularly in terms of dataset descriptions and evaluation metrics.

Reproducibility

The authors provide a GitHub repository with code and trained models, which is a positive aspect for reproducibility. However, the paper lacks detailed descriptions of the datasets used and the specific configurations for training, which could hinder full reproducibility. More explicit guidelines on running the experiments and the environment setup would enhance this aspect.

Limitations

The paper acknowledges the limitations of its approach, particularly regarding the reliance on mixed light data without professional demonstrations. While the method shows promise in diverse venues, the generalization to real-world scenarios may still face challenges due to the variability in lighting setups and music styles. Additionally, the performance in out-of-domain settings, while improved, still shows some degradation compared to in-domain results.

Broader Impact

The proposed method has significant implications for the fields of live music performance and event production, potentially reducing the need for professional lighting engineers and making stage lighting more accessible to amateurs. The ability to adapt to various venues and music styles opens up new avenues for creative expression in live performances. Furthermore, the integration of machine learning in artistic domains like lighting design could inspire further research and applications in related fields. The main contribution of this paper is the introduction of SeqLight, a novel hierarchical framework for automatic stage lighting control that effectively combines music analysis with advanced machine learning techniques to enhance the quality and adaptability of lighting in live performances. This work represents a significant advancement in the intersection of machine learning and performing arts, addressing practical challenges while providing a robust methodological framework.

Analysis: Full Paper • Full text: 50,026 characters

Towards Open World Sound Event Detection

P. H. Hai, L. T. Minh, L. H. Son · arXiv

Sound Event Detection (SED) plays a vital role in audio understanding, with applications in surveillance, smart cities, healthcare, and multimedia indexing. However, conventional SED systems operate under a closed-world assumption, limiting their effectiveness in real-world envir...

Sound Event Detection (SED) plays a vital role in audio understanding, with applications in surveillance, smart cities, healthcare, and multimedia indexing. However, conventional SED systems operate under a closed-world assumption, limiting their effectiveness in real-world environments where novel acoustic events frequently emerge. Inspired by the success of open-world learning in computer vision, we introduce the Open-World Sound Event Detection (OW-SED) paradigm, where models must detect known events, identify unseen ones, and incrementally learn from them. To tackle the unique challenges of OW-SED, such as overlapping and ambiguous events, we propose a 1D Deformable architecture that leverages deformable attention to adaptively focus on salient temporal regions. Furthermore, we design a novel Open-World Deformable Sound Event Detection Transformer (WOOT) framework incorporating feature disentanglement to separate class-specific and class-agnostic representations, together with a one-to-many matching strategy and a diversity loss to enhance representation diversity. Experimental results demonstrate that our method achieves marginally superior performance compared to existing leading techniques in closed-world settings and significantly improves over existing baselines in open-world scenarios.

Institutional Affiliations

Primary: VNU University of Engineering and Technology

All Institutions: VNU University of Engineering and Technology, Artificial Intelligence Research Center, VNU Information Technology Institute

ML Relevance Analysis (82)

The main contribution of this paper is the introduction of the OW-SED paradigm and the WOOT framework, which significantly advances the field of sound event detection by enabling models to detect known events, identify unseen ones, and incrementally learn from them. The methodology and experimental results demonstrate a strong technical contribution that addresses real-world challenges in audio understanding.

Comprehensive Analysis

Methodology Assessment

The paper introduces a novel Open-World Sound Event Detection (OW-SED) paradigm, which is a significant shift from traditional closed-world approaches. The proposed 1D Deformable architecture leverages deformable attention mechanisms to focus on salient temporal regions, addressing the unique challenges posed by overlapping and ambiguous sound events. The introduction of the WOOT framework, which incorporates feature disentanglement and a two-stage training strategy, is innovative and effectively enhances the model's ability to generalize to unseen classes while mitigating catastrophic forgetting. The methodology is well-structured, with clear explanations of the architecture and training processes.

Experimental Evaluation

The experimental results demonstrate that the proposed methods outperform existing state-of-the-art techniques in both closed-world and open-world settings. The authors provide comprehensive evaluations using two datasets (URBAN-SED and DESED), showcasing significant improvements in performance metrics such as U-Recall and F1 scores. The experiments are robust, with multiple random seeds used to ensure reliability, and the results are well-presented in comparative tables.

Reproducibility

The paper includes detailed implementation details, including the architecture, training protocols, and hyperparameters. However, the absence of a public code repository or demo URL limits the reproducibility of the results. The authors should consider making their code available to facilitate further research and validation of their findings.

Limitations

One limitation is the reliance on human annotation for labeling unknown events, which may introduce subjectivity and variability in the training process. Additionally, while the model shows strong performance, the paper does not extensively discuss the computational efficiency or scalability of the proposed framework in real-world applications.

Broader Impact

The OW-SED paradigm has significant implications for various applications, including surveillance, smart cities, and healthcare, where the ability to detect and learn from novel sound events in dynamic environments is crucial. This work paves the way for more adaptive audio understanding systems that can continuously evolve with their environments. The main contribution of this paper is the introduction of the OW-SED paradigm and the WOOT framework, which significantly advances the field of sound event detection by enabling models to detect known events, identify unseen ones, and incrementally learn from them. The methodology and experimental results demonstrate a strong technical contribution that addresses real-world challenges in audio understanding.

Analysis: Full Paper • Full text: 50,026 characters

MiniMind-O Technical Report: An Open Small-Scale Speech-Native Omni Model

Jingyao Gong · arXiv

MiniMind-O is an open 0.1B-scale omni model built on the MiniMind language model. It accepts text, speech, and image inputs, and returns both text and streaming speech. The release includes model code, checkpoints, and the main Parquet training datasets for text-to-audio, image-t...

MiniMind-O is an open 0.1B-scale omni model built on the MiniMind language model. It accepts text, speech, and image inputs, and returns both text and streaming speech. The release includes model code, checkpoints, and the main Parquet training datasets for text-to-audio, image-to-text, and audio-to-audio training, making the complete interaction loop directly inspectable. The model uses a full MiniMind backbone as the Thinker and an independent four-layer Talker made from MiniMind blocks. Frozen SenseVoice-Small and SigLIP2 encoders provide speech and image features, which are mapped by lightweight MLP projectors and injected at modality-placeholder positions. The Talker reads a middle-layer Thinker state together with an autoregressive eight-layer Mimi-code buffer. Speaker control is handled by a dedicated speaker token, right-aligned reference codec prompts, and precomputed CAM++ speaker embeddings, so voice conditioning remains part of the audio-code context rather than a separate TTS module. With a 768-dimensional Talker, the dense and MoE variants reach average CERs of 0.0897 and 0.0900 in Thinker--Talker consistency evaluation, with overall voice-cloning similarities of 0.5995 and 0.5937. Beyond reporting a working system, the paper identifies three scale-critical design choices for small omni models: middle-layer semantic bridging, a released multimodal sequence format, and a parameter-efficient eight-codebook interface.

Institutional Affiliations

Primary: Independent Researcher

All Institutions: Independent Researcher

Demo · GitHub

ML Relevance Analysis (75)

The main contribution of this paper is the introduction of MiniMind-O, a small-scale omni model that effectively integrates text, speech, and image modalities while maintaining a focus on reproducibility and parameter efficiency. This work represents a meaningful step towards creating accessible and controllable multimodal systems, with implications for various applications in machine learning and human-computer interaction.

Comprehensive Analysis

Methodology Assessment

The methodology employed in MiniMind-O is innovative, leveraging a compact architecture that integrates text, speech, and image modalities within a 0.1B parameter framework. The separation of the Thinker and Talker components allows for a more efficient processing pipeline, while the use of middle-layer semantic bridging and low-rank codebook interfaces enhances parameter efficiency. The decision to release the training datasets and model code further promotes transparency and reproducibility in research.

Experimental Evaluation

The experimental evaluation is robust, with clear metrics for assessing consistency between the Thinker and Talker outputs. The use of Character Error Rate (CER) and voice-cloning similarity scores provides a quantitative basis for evaluating performance. However, the evaluation metrics focus primarily on consistency rather than subjective quality measures, which could limit the understanding of the model's performance in real-world applications.

Reproducibility

The paper emphasizes reproducibility by providing detailed descriptions of the model architecture, training pipeline, and the datasets used. The release of both the model code and training datasets is a significant step towards enabling other researchers to replicate the results. However, the lack of a specific institution may raise questions about the long-term support and maintenance of the project.

Limitations

The main limitations include the model's performance in generating natural-sounding speech, especially for longer responses, which may not match the quality of larger models. Additionally, the visual pathway relies on a frozen encoder, which may not capture the full complexity of visual inputs. The narrow evaluation focus on consistency may overlook other important aspects of model performance, such as user experience and adaptability.

Broader Impact

The potential applications of MiniMind-O are significant, particularly in areas requiring multimodal interaction, such as virtual assistants, educational tools, and accessibility technologies. By providing an open-source framework, the work encourages further research and development in the field of speech-native omni models, potentially leading to advancements in human-computer interaction. The main contribution of this paper is the introduction of MiniMind-O, a small-scale omni model that effectively integrates text, speech, and image modalities while maintaining a focus on reproducibility and parameter efficiency. This work represents a meaningful step towards creating accessible and controllable multimodal systems, with implications for various applications in machine learning and human-computer interaction.

Analysis: Full Paper • Full text: 36,972 characters

Dimensionality-Aware Anomaly Detection in Learned Representations of Self-Supervised Speech Models

Sandra Arcos-Holzinger, Sarah M. Erfani, James Bailey ... · arXiv

Self-supervised speech models (S3Ms) achieve strong downstream performance, yet their learned representations remain poorly understood under natural and adversarial perturbations. Prior studies rely on representation similarity or global dimensionality, offering limited visibilit...

Self-supervised speech models (S3Ms) achieve strong downstream performance, yet their learned representations remain poorly understood under natural and adversarial perturbations. Prior studies rely on representation similarity or global dimensionality, offering limited visibility into local geometric changes. We ask: how do perturbations deform local geometry, and do these shifts track downstream automatic speech recognition (ASR) degradation? To address this, we present GRIDS, a framework using Local Intrinsic Dimensionality (LID) across layer-wise representations in WavLM and wav2vec 2.0. We find that LID increases for all low signal-to noise ratio (SNR) perturbations and diverges at high SNR: benign noise converges toward the clean profile, while adversarial inputs retain early-layer LID elevation. We show LID elevation co-occurs with increased WER, and that layer-wise LID features enable anomaly detection (AUROC 0.78-1.00), opening the door to transcript-free monitoring in S3Ms.

Institutional Affiliations

Primary: University of Melbourne

All Institutions: University of Melbourne, Monash University, Johns Hopkins University

ML Relevance Analysis (83)

The paper presents a comprehensive and innovative framework for analyzing the geometric properties of learned representations in self-supervised speech models, contributing valuable insights into the robustness of these models under various perturbations. The methodology effectively links local geometric changes to performance degradation, marking a significant advancement in the understanding of S3Ms.

Comprehensive Analysis

Methodology Assessment

The paper introduces a novel framework, GRIDS, which utilizes Local Intrinsic Dimensionality (LID) to analyze the geometric properties of learned representations in self-supervised speech models (S3Ms) under various perturbations. The methodology is well-structured, employing a layer-wise analysis that captures local changes in representation geometry, which is a significant advancement over traditional global measures. The approach effectively links geometric shifts to downstream performance metrics, specifically word error rate (WER), and provides a robust mechanism for anomaly detection without reliance on ground-truth transcripts. The use of k-nearest neighbors (kNN) for LID estimation is appropriate, although the choice of neighborhood size could introduce variability in results.

Experimental Evaluation

The experiments are comprehensive, utilizing a well-defined dataset (LibriSpeech) and a variety of perturbation types (benign and adversarial) under controlled signal-to-noise ratio (SNR) conditions. The results demonstrate a clear correlation between LID changes and ASR degradation, reinforcing the framework's validity. The performance metrics, including AUROC for anomaly detection, indicate strong results, particularly for the WavLM model. However, the paper could benefit from additional comparative analysis against existing methods to further establish the effectiveness of GRIDS.

Reproducibility

The paper provides detailed descriptions of the experimental setup, including the generation of perturbations and the evaluation protocols. However, the lack of publicly available code or data limits reproducibility. Providing access to the GRIDS framework and datasets would enhance the ability of other researchers to validate and build upon these findings.

Limitations

The study is limited to specific self-supervised speech models (WavLM and wav2vec 2.0) and does not explore the implications of the findings on other architectures or tasks. Additionally, the focus on layer-wise analysis may overlook global representation dynamics that could be informative. The anomaly detection performance decreases at higher SNRs, suggesting that the method may not be universally applicable across all conditions.

Broader Impact

The findings have significant implications for the robustness and interpretability of self-supervised speech models, particularly in real-world applications where adversarial attacks and noise are prevalent. The ability to monitor representation geometry could lead to improved robustness in automatic speech recognition systems and other related fields, such as speaker verification and emotion recognition. The paper presents a comprehensive and innovative framework for analyzing the geometric properties of learned representations in self-supervised speech models, contributing valuable insights into the robustness of these models under various perturbations. The methodology effectively links local geometric changes to performance degradation, marking a significant advancement in the understanding of S3Ms.

Analysis: Full Paper • Full text: 37,134 characters

Multi-Axis Speech Similarity via Factor-Partitioned Embeddings

Jim O'Regan, Jens Edlund · Odyssey 2026

Speech encodes multiple simultaneous attributes--linguistic content, speaker identity, dialect, gender--that conventional single-vector embeddings conflate. We present a factor-partitioned embedding framework that maps each utterance into a single vector whose subspaces corresp...

Speech encodes multiple simultaneous attributes--linguistic content, speaker identity, dialect, gender--that conventional single-vector embeddings conflate. We present a factor-partitioned embedding framework that maps each utterance into a single vector whose subspaces correspond to distinct axes of variation. A shared acoustic encoder feeds per-axis linear projection heads, each trained via distillation from a specialist teacher or a contrastive objective over shared-label pairs. The resulting embeddings support attribute-conditioned retrieval: similarity is computed as a signed weighted sum over per-axis cosine scores, allowing retrieval that jointly considers what was said and how --or explicitly suppresses one attribute to surface another. We evaluate on cross-corpus retrieval over corpora sharing the Harvard sentence prompts, demonstrating that signed axis weighting can suppress same-speaker bias and surface semantically matched utterances across recording conditions.

Institutional Affiliations

Primary: Department of Speech, Music & Hearing

All Institutions: Department of Speech, Music & Hearing

ML Relevance Analysis (83)

The main contribution of this paper is the introduction of a novel factor-partitioned embedding framework for speech that allows for controllable multi-axis similarity searches. This work represents a meaningful advancement in the field of speech representation learning, addressing the complexities of speech attributes and providing a robust solution for attribute-conditioned retrieval tasks.

Comprehensive Analysis

Methodology Assessment

The paper introduces a factor-partitioned embedding framework that effectively separates various attributes of speech into distinct subspaces, allowing for nuanced similarity searches. The methodology is well-structured, employing a shared acoustic encoder and per-axis linear projection heads trained through distillation and contrastive objectives. This approach is innovative in its use of signed axis weighting to control retrieval outcomes, providing a significant advancement over traditional single-vector embeddings that conflate multiple attributes.

Experimental Evaluation

The experiments conducted are thorough, utilizing cross-corpus retrieval tasks with well-defined metrics such as Precision@k and preference-flip tests. The evaluation on datasets sharing the Harvard sentence prompts is appropriate, demonstrating the framework's ability to suppress same-speaker bias and surface semantically matched utterances. The results are compelling, showing that the proposed method outperforms baseline models significantly.

Reproducibility

The paper provides a clear description of the architecture, data sources, and training processes, which enhances reproducibility. However, the absence of a publicly available implementation or code repository limits the ability for other researchers to replicate the results fully.

Limitations

One limitation identified is the reliance on specific datasets, which may not generalize well to other speech domains. Additionally, the paper notes challenges in separating the gender axis due to its correlation with speaker identity, suggesting that further exploration is needed to improve this aspect. The potential for mode collapse in models trained without auxiliary tasks is also a concern.

Broader Impact

The proposed framework has significant implications for applications in speech retrieval, speaker recognition, and potentially in areas such as voice conversion and personalized speech synthesis. By enabling controllable retrieval based on multiple axes, it opens avenues for more sophisticated user interactions with speech technologies. The main contribution of this paper is the introduction of a novel factor-partitioned embedding framework for speech that allows for controllable multi-axis similarity searches. This work represents a meaningful advancement in the field of speech representation learning, addressing the complexities of speech attributes and providing a robust solution for attribute-conditioned retrieval tasks.

Analysis: Full Paper • Full text: 30,104 characters

Neck-Learn: Attention-Based Multiple Instance Learning and Ensemble Framework for Ecological Momentary Assessment

Ahsan Jamal Cheema · arXiv

Vocal hyperfunction (VH) is a prevalent voice disorder whose ambulatory detection remains challenging despite extensive daily voice data. Prior approaches capture week-long neck-surface accelerometer recordings but collapse them into fixed-length subject-level feature vectors, di...

Vocal hyperfunction (VH) is a prevalent voice disorder whose ambulatory detection remains challenging despite extensive daily voice data. Prior approaches capture week-long neck-surface accelerometer recordings but collapse them into fixed-length subject-level feature vectors, discarding within-day temporal dynamics encoding nuanced voicing feature interactions. We introduce a novel hybrid architecture combining gradient-boosted trees on day-level distributional features with a CNN-based multiple instance learning (MIL) framework that preserves and learns from from temporal dynamics throughout each day. On the held-out test set, our model exceeds the challenge baselines (AUC: 0.82 PVH, 0.77 NPVH), achieving AUCs of 0.879 for PVH (Rank 5) and 0.848 for NPVH (Rank 3), while also providing insights into clinically relevant information about both pathologies.

Institutional Affiliations

Primary: Harvard University

All Institutions: Harvard University, Eaton Peabody Laboratories, Massachusetts Eye and Ear Infirmary

ML Relevance Analysis (83)

The paper presents a novel hybrid architecture for detecting vocal hyperfunction using attention-based multiple instance learning, significantly advancing the state of the art in ambulatory voice monitoring. The methodology effectively addresses the limitations of previous approaches by preserving temporal dynamics, leading to improved diagnostic accuracy and insights into voice disorders.

Comprehensive Analysis

Methodology Assessment

The paper introduces a hybrid architecture that effectively combines gradient-boosted trees with a CNN-based multiple instance learning (MIL) framework, which is innovative in the context of ecological momentary assessment for voice disorders. The dual-representation framework allows the model to leverage both global distributional features and local temporal dynamics, addressing the shortcomings of previous methods that relied on fixed-length feature vectors. The attention mechanism in the CNN-MIL architecture is particularly noteworthy, as it enables the model to learn which time segments are most discriminative for vocal hyperfunction detection.

Experimental Evaluation

The experiments are robust, utilizing the NeckVibe Challenge dataset, which is the largest publicly available dataset for this task. The reported AUC scores demonstrate significant improvements over existing baselines, particularly for the more challenging non-structural vocal hyperfunction (NPVH) classification. The systematic ablation studies provide strong evidence for the contributions of each model component, confirming the effectiveness of the proposed ensemble strategy.

Reproducibility

The paper provides detailed implementation information, including hyperparameters and software versions used, which enhances reproducibility. However, the absence of a publicly accessible code repository limits the ability for other researchers to replicate the results fully.

Limitations

The study acknowledges that the CNN-MIL framework processes each day independently, which may overlook week-level trends in vocal behavior. Additionally, while the attention mechanism offers some interpretability, the complexity of the model may hinder full understanding of the feature interactions. Future work could explore hierarchical models and causal inference approaches.

Broader Impact

The findings have significant implications for the field of voice disorder diagnostics, particularly in enhancing ambulatory monitoring techniques. By capturing temporal dynamics, the proposed approach could lead to more accurate and timely interventions for individuals suffering from vocal hyperfunction, potentially improving patient outcomes. The paper presents a novel hybrid architecture for detecting vocal hyperfunction using attention-based multiple instance learning, significantly advancing the state of the art in ambulatory voice monitoring. The methodology effectively addresses the limitations of previous approaches by preserving temporal dynamics, leading to improved diagnostic accuracy and insights into voice disorders.

Analysis: Full Paper • Full text: 23,143 characters

Phoneme-Level Deepfake Detection Across Emotional Conditions Using Self-Supervised Embeddings

Vamshi Nallaguntla, Shruti Kshirsagar, Anderson R. Avila · arXiv

Recent advances in emotional voice conversion (EVC) have enabled the generation of expressive synthetic speech, raising new concerns in audio deepfake detection. Existing approaches treat speech as a homogeneous signal and largely overlook its internal phonetic structure, limitin...

Recent advances in emotional voice conversion (EVC) have enabled the generation of expressive synthetic speech, raising new concerns in audio deepfake detection. Existing approaches treat speech as a homogeneous signal and largely overlook its internal phonetic structure, limiting their interpretability in emotionally conditioned settings. In this work, we propose a phoneme-level framework to analyze emotionally manipulated synthetic speech using real and EVC-generated speech under matched emotional conditions with shared transcripts, phoneme-aligned TextGrids, and WavLM-based embeddings. Our results show that phoneme behavior varies across categories, with complex vowels and fricatives exhibiting higher divergence while simpler phonemes remain more stable. Phonemes with larger distributional differences are also found to be more easily detected, consistently across multiple emotions and synthesis systems. These findings demonstrate that phoneme-level analysis is an effective and interpretable approach for detecting emotionally manipulated synthetic speech.

Institutional Affiliations

Primary: Wichita State University

All Institutions: Wichita State University, Institut national de la recherche scientifique (INRS--EMT), INRS-UQO Mixed Research Unit on Cybersecurity

ML Relevance Analysis (83)

This paper presents a novel phoneme-level framework for detecting emotionally manipulated synthetic speech, demonstrating that phoneme behavior varies significantly across categories and can be effectively analyzed for deepfake detection. The methodology is sound, the experimental design is robust, and the results contribute valuable insights to the field of audio deepfake detection.

Comprehensive Analysis

Methodology Assessment

The proposed phoneme-level framework is innovative, addressing the limitations of existing deepfake detection methods that overlook phonetic structures in emotionally manipulated speech. The use of WavLM embeddings and Kullback-Leibler divergence for phoneme-level analysis is a strong methodological choice, allowing for nuanced insights into how emotional voice conversion affects phoneme characteristics. The controlled dataset and phoneme-aligned TextGrids enhance the robustness of the analysis.

Experimental Evaluation

The experiments are well-structured, utilizing a curated dataset that allows for direct comparisons between real and synthetic speech across multiple emotions. The results demonstrate clear patterns in phoneme sensitivity to emotional manipulation, with statistical analyses supporting the findings. The use of SVM for classification and the correlation analysis between KLD and classification accuracy are appropriate and yield meaningful insights.

Reproducibility

The paper mentions the release of a curated dataset with aligned transcripts and phoneme-level annotations, which supports reproducibility. However, specific implementation details regarding the SVM training and evaluation metrics could be more explicitly stated to facilitate replication.

Limitations

One limitation is the focus on only two EVC systems and a limited number of speakers, which may restrict the generalizability of the findings. Additionally, the paper does not explore the potential impact of background noise or other real-world conditions on detection performance.

Broader Impact

The findings have significant implications for audio deepfake detection, particularly in high-stakes environments such as journalism and legal proceedings. By providing a phoneme-level analysis, the work contributes to the development of more interpretable and robust detection systems, potentially enhancing trust in audio communications. This paper presents a novel phoneme-level framework for detecting emotionally manipulated synthetic speech, demonstrating that phoneme behavior varies significantly across categories and can be effectively analyzed for deepfake detection. The methodology is sound, the experimental design is robust, and the results contribute valuable insights to the field of audio deepfake detection.

Analysis: Full Paper • Full text: 29,558 characters

Tibetan-TTS:Low-Resource Tibetan Speech Synthesis with Large Model Adaptation

Jiaxu He, Chao Wang, Jie Lian ... · arXiv

Tibetan text-to-speech (TTS) has long been challenged by scarce speech resources, significant dialectal variation, and the complex mapping between written text and spoken pronunciation. To address these issues, this work presents, to the best of our knowledge, the first large-mod...

Tibetan text-to-speech (TTS) has long been challenged by scarce speech resources, significant dialectal variation, and the complex mapping between written text and spoken pronunciation. To address these issues, this work presents, to the best of our knowledge, the first large-model-based Tibetan TTS system in the industry, built upon a large speech synthesis model developed by Xingchen AGI Lab. The proposed system integrates data quality enhancement, Tibetan-oriented text representation and tokenizer adaptation, and cross-lingual adaptive training for low-resource Tibetan speech synthesis. Experimental results show that the system can generate stable, natural, and intelligible Tibetan speech under low-resource conditions. In subjective evaluation, the MOS scores of the syllable-level and BPE-based systems reach 4.28 and 4.35, while their pronunciation accuracies reach 97.6% and 96.6%, respectively, outperforming an external commercial Tibetan TTS interface. These results demonstrate that combining a large-model backbone with Tibetan-oriented text representation adaptation and cross-lingual adaptive training enables highly usable low-resource Tibetan speech synthesis, and also provides a technical foundation for future unified multi-dialect Tibetan speech synthesis.

Institutional Affiliations

Primary: Xingchen AGI Lab

All Institutions: Xingchen AGI Lab, China Telecom Artificial Intelligence Technology Co. Ltd, Xizang University, Qinghai Normal University, University of Electronic Science and Technology of China

ML Relevance Analysis (83)

This paper establishes a pioneering framework for low-resource Tibetan speech synthesis, combining innovative methodologies with practical applications. The integration of data quality enhancement, tailored text representation, and cross-lingual training marks a significant advancement in the field of speech synthesis, particularly for minority languages facing resource constraints.

Comprehensive Analysis

Methodology Assessment

The paper presents a comprehensive approach to Tibetan TTS by integrating a data quality enhancement pipeline, a Tibetan-oriented text representation, and a cross-lingual adaptive training strategy. The methodology is well-structured, addressing the unique challenges of Tibetan speech synthesis, such as dialectal variation and resource scarcity. The use of a large-model backbone and the adaptation of tokenization strategies specifically for Tibetan linguistic characteristics are notable innovations that enhance the model's performance under low-resource conditions.

Experimental Evaluation

The experiments are robust, with both subjective (MOS scores) and objective (pronunciation accuracy) evaluations demonstrating the effectiveness of the proposed system. The reported MOS scores of 4.28 and 4.35 for the syllable-level and BPE-based systems, respectively, indicate a high level of naturalness and intelligibility. The comparison with an external commercial TTS interface further validates the system's performance, showcasing its potential for practical applications.

Reproducibility

The paper lacks detailed implementation specifics, such as code availability or dataset access, which could hinder reproducibility. While the methodology is clearly described, the absence of a public repository or demo limits the ability for other researchers to replicate the results.

Limitations

The study primarily focuses on the U-Tsang dialect, which may limit the generalizability of the findings to other Tibetan dialects. Additionally, while the proposed methods show promise, the reliance on a large pretrained model may not be feasible in all low-resource scenarios, particularly where such models are not available.

Broader Impact

The development of a Tibetan TTS system has significant implications for cultural preservation, education, and accessibility in Tibetan-speaking regions. By providing a framework for low-resource language synthesis, this work could serve as a model for similar efforts in other underrepresented languages, promoting linguistic diversity in technology. This paper establishes a pioneering framework for low-resource Tibetan speech synthesis, combining innovative methodologies with practical applications. The integration of data quality enhancement, tailored text representation, and cross-lingual training marks a significant advancement in the field of speech synthesis, particularly for minority languages facing resource constraints.

Analysis: Full Paper • Full text: 29,223 characters

Period-conscious Time-series Reconstruction under Local Differential Privacy

Yaxuan Wang, Tianxin Li, Enji Liang ... · ICME 2026 Workshop

Periodic patterns are fundamental cues in multimedia signals and systems, including repetitive motion in video (e.g., gait cycles), rhythmic and pitch-related structure in audio, and recurring textures in image sequences. When such user-generated streams are collected from edge d...

Periodic patterns are fundamental cues in multimedia signals and systems, including repetitive motion in video (e.g., gait cycles), rhythmic and pitch-related structure in audio, and recurring textures in image sequences. When such user-generated streams are collected from edge devices, local differential privacy (LDP) is appealing because it perturbs data before upload; however, the injected noise can corrupt spectral peaks and induce phase drift, making period estimation unreliable and degrading reconstruction quality. We propose \textbf{CPR} (\textit{Cycle and Phase Recovery}), a period-aware reconstruction framework for periodic time series under LDP. CPR performs multi-scale period probing and multi-consensus selection to suppress noise-induced spectral interference, then aggregates perturbed samples at matched within-cycle phase positions to stabilize phase alignment across cycles. To recover the underlying per-phase values, CPR combines EM-based denoising with kernel density estimation, improving robustness under tight privacy budgets. Experiments on two real-world periodic datasets demonstrate that CPR better preserves periodic structure and consistently achieves lower reconstruction error than representative LDP baselines, especially in the low-$ε$ regime.

Institutional Affiliations

Primary: Taiyuan University of Technology

All Institutions: Taiyuan University of Technology, University of Michigan, Ann Arbor

ML Relevance Analysis (82)

The main contribution of this paper is the development of the CPR framework, which effectively addresses the challenges of reconstructing periodic time series under local differential privacy. This work significantly advances the field by providing a robust solution that preserves critical periodic structures while ensuring privacy, thus opening new avenues for research and application in multimedia signal processing.

Comprehensive Analysis

Methodology Assessment

The proposed CPR framework introduces a novel approach to reconstruct periodic time series under local differential privacy (LDP) by addressing the specific challenges posed by noise-induced spectral interference and phase drift. The methodology is robust, employing multi-scale period probing and phase-aware aggregation, which are well-justified in the context of multimedia signals. The integration of EM-based denoising with kernel density estimation is a significant technical contribution, enhancing the reconstruction quality while maintaining privacy. The paper effectively articulates the rationale behind each methodological choice and demonstrates a clear understanding of the underlying challenges.

Experimental Evaluation

The experiments conducted on real-world datasets are comprehensive and well-structured, showcasing the effectiveness of CPR compared to several baseline methods. The evaluation metrics, particularly the cosine distance for reconstruction accuracy, are appropriate for the task. The results convincingly demonstrate that CPR outperforms existing methods, especially under tight privacy constraints, which is crucial for practical applications. However, the paper could benefit from a more detailed discussion of the experimental setup, including hyperparameter choices and potential variations in results across different datasets.

Reproducibility

The paper provides a clear description of the experimental setup and methodologies used, but it lacks specific implementation details that would facilitate reproducibility. Key parameters and configurations for the experiments are mentioned, yet the absence of a publicly available code repository limits the ability for others to replicate the results. Including a link to a GitHub repository or similar would greatly enhance reproducibility.

Limitations

One limitation of the study is the reliance on specific datasets that may not fully represent the diversity of periodic signals encountered in real-world applications. Additionally, while the proposed method shows significant improvements, the paper does not address the computational complexity of the CPR framework, which may affect its applicability in resource-constrained environments. The potential trade-offs between privacy and reconstruction accuracy under varying conditions could also be explored further.

Broader Impact

The implications of this research are significant, particularly in fields where privacy-preserving data collection is paramount, such as healthcare and personal monitoring systems. By enabling accurate reconstruction of periodic signals while adhering to strict privacy constraints, this work has the potential to enhance the usability of sensitive data in various applications, including motion analysis and behavioral monitoring. The findings may also inspire further research into privacy-aware signal processing techniques. The main contribution of this paper is the development of the CPR framework, which effectively addresses the challenges of reconstructing periodic time series under local differential privacy. This work significantly advances the field by providing a robust solution that preserves critical periodic structures while ensuring privacy, thus opening new avenues for research and application in multimedia signal processing.

Analysis: Full Paper • Full text: 24,206 characters

Private Speech Classification without Collapse: Stabilized DP Training and Offline Distillation

Yadi Wen, Tianxin Li, Enji Liang ... · ICME 2026 Workshop

We study example-level private supervised speech classification under a practical release constraint: training may access privileged side information, but the released model must be audio-only. This setting is important because speech systems can often exploit richer side informa...

We study example-level private supervised speech classification under a practical release constraint: training may access privileged side information, but the released model must be audio-only. This setting is important because speech systems can often exploit richer side information during development, whereas deployment and release require a lightweight unimodal model with auditable privacy guarantees. Using DP-SGD on the private dataset $D_{\text{priv}}$, we identify a strong-privacy failure mode ($ε\le 1$) on imbalanced tasks, where training may collapse to a near single-class predictor, a phenomenon that overall accuracy can obscure. We therefore emphasize Macro-F1, balanced accuracy, and a simple collapse diagnostic. This failure is especially problematic in our release setting because a collapsed private teacher cannot provide useful supervision for the downstream audio-only student. To address this setting under strong privacy, we propose a two-stage protocol: (i) train a (possibly multimodal) DP teacher on $D_{\text{priv}}$, and (ii) distill an audio-only student on a fixed, recording-disjoint auxiliary dataset $D_{\text{aux}}$ using one-shot offline teacher probability outputs, releasing only the student. The DP guarantee applies only to $D_{\text{priv}}$; we make no DP claim for $D_{\text{aux}}$, and privacy of the released student with respect to $D_{\text{priv}}$ follows by post-processing. We frame this setting as involving four coupled bottlenecks: speech-induced optimization instability under DP-SGD, minority-class erosion under clipping and noise, teacher over-reliance on privileged modalities unavailable at deployment, and train--deploy modality mismatch. We address them with a DP-stabilizing acoustic front-end (DSAF), minibatch-adaptive bounded loss reweighting (AW-DP), privileged-modality dropout, and offline teacher-to-student distillation.

Institutional Affiliations

Primary: Taiyuan University of Technology

All Institutions: Shanxi Key Laboratory of Industrial Internet Security, University of Michigan, Taiyuan University of Technology

ML Relevance Analysis (82)

The main contribution of this paper is a novel two-stage protocol for private speech classification that mitigates prediction collapse and class imbalance while ensuring differential privacy. This work significantly advances the field of privacy-preserving machine learning by providing effective solutions to critical challenges in deploying speech classification systems.

Comprehensive Analysis

Methodology Assessment

The paper presents a novel two-stage protocol for private speech classification that effectively addresses the challenges of differential privacy (DP) in imbalanced datasets. The methodology is well-structured, incorporating a DP teacher trained on private data followed by offline distillation to create an audio-only student model. The authors introduce several innovative techniques such as the DP-Stabilizing Acoustic Front-End (DSAF) and Imbalance-aware Weighted DP-SGD (AW-DP) to mitigate issues related to prediction collapse and class imbalance. The use of privileged-modality dropout further enhances the robustness of the model by discouraging reliance on privileged information during deployment. Overall, the methodology is comprehensive and addresses critical bottlenecks in private speech classification.

Experimental Evaluation

The experiments are thorough, utilizing the Mozilla Common Voice dataset to evaluate the proposed methods. The authors provide a clear comparison of their two-stage distillation approach against a single-stage DP audio-only model. The results demonstrate significant improvements in Macro-F1 and balanced accuracy metrics, particularly under strong privacy constraints, highlighting the effectiveness of their approach. The inclusion of various metrics to diagnose collapse, such as Maj-Pred, adds depth to the evaluation. However, further exploration of the impact of different auxiliary dataset sizes could enhance the robustness of the findings.

Reproducibility

The paper provides sufficient details regarding the experimental setup, including the use of Python and PyTorch for implementation. However, the absence of a public code repository or demo URL limits reproducibility. Future work should consider making the code available to facilitate validation of results and encourage further research in this area.

Limitations

One limitation of the study is the lack of a comprehensive evaluation of the model's performance across diverse datasets beyond the Mozilla Common Voice dataset. Additionally, while the paper addresses the issue of prediction collapse, it does not fully explore the implications of using auxiliary datasets that may not be representative of the deployment environment. The privacy guarantee for the auxiliary dataset is also not claimed, which could raise concerns about potential data leakage.

Broader Impact

The proposed methods have significant implications for the deployment of speech classification systems in privacy-sensitive applications, such as voice assistants and transcription services. By ensuring that models can be trained with rich side information while maintaining privacy in the released artifacts, this work contributes to the development of more secure and trustworthy AI systems. The approach could be extended to other modalities and applications, enhancing the overall impact on the field of machine learning and privacy-preserving technologies. The main contribution of this paper is a novel two-stage protocol for private speech classification that mitigates prediction collapse and class imbalance while ensuring differential privacy. This work significantly advances the field of privacy-preserving machine learning by providing effective solutions to critical challenges in deploying speech classification systems.

Analysis: Full Paper • Full text: 25,194 characters

The Streaming Reservoir Convergence Theorem: A Prospect-Theoretic Framework for Multi-Provider Adaptive Streaming

Justice Owusu Agyemang, Jerry John Kponyo, Kwame Opuni-Boachie Obour Agyekum ... · arXiv

We present the Streaming Reservoir Convergence Theorem (SRCT), a novel mathematical framework for multi-provider adaptive bitrate streaming that addresses three fundamental structural weaknesses in current systems: linear provider probing, reactive failover, and cold standby tran...

We present the Streaming Reservoir Convergence Theorem (SRCT), a novel mathematical framework for multi-provider adaptive bitrate streaming that addresses three fundamental structural weaknesses in current systems: linear provider probing, reactive failover, and cold standby transitions. SRCT models stream acquisition as a concurrent reservoir filling problem$-$probing all $N$ providers simultaneously rather than in batches$-$and maintains $k$ pre-verified, pre-fetched standby streams alongside the active stream to enable sub-second failover with zero user-visible disruption. We prove four principal results: (1) a harmonic lower bound on reservoir safety showing that $k$ independent streams provide $H_k / \barλ$ expected uptime where $H_k$ is the $k$-th harmonic number; (2) a concurrent acquisition speedup $S(N,b) = (N/b) \cdot (1-F^b)/(1-F^N)$ over batched probing, yielding $3$-$5\times$ practical improvement; (3) monotonic non-decreasing quality under lazy-refill with convergence to the Pareto-optimal frontier; and (4) a prospect-weighted switching rule$-$using Kahneman-Tversky value functions with $α=β=0.88$, $λ=2.25$ $-$ that provably eliminates thrashing between similar-quality streams via a no-thrash bound on the expected switch count. We implement SRCT across two production streaming pipelines: a primary movie/TV system serving 12+ HLS providers with $k=3$ reservoir slots, and a live sports system with multi-format DASH/HLS failover. Empirical verification via Monte Carlo simulation (5000 trials) confirms all four theorems across 22 independent checks. The reservoir of $k=3$ streams achieves $9.15\times$ mean time to depletion versus a single stream, and concurrent probing of 12 providers at 40% failure rate yields a $4.27\times$ speedup over the current batched-by-3 default.

Institutional Affiliations

Primary: Sperix Labs

All Institutions: Sperix Labs, KNUST

ML Relevance Analysis (81)

The main contribution of this paper is the introduction of the Streaming Reservoir Convergence Theorem, which provides a comprehensive framework for improving multi-provider adaptive streaming. The technical contributions, including theoretical proofs and empirical validation, position this work as a notable advancement in the field of audio streaming and adaptive bitrate technologies.

Comprehensive Analysis

Methodology Assessment

The paper introduces the Streaming Reservoir Convergence Theorem (SRCT), which presents a novel mathematical framework for adaptive bitrate streaming across multiple providers. The methodology is robust, employing a combination of theoretical proofs and practical implementations. The authors effectively unify several aspects of streaming (provider probing, failover, and quality selection) into a single reservoir model, which is a significant advancement over traditional methods. The use of prospect theory to inform the switching rules adds a unique psychological dimension to the algorithm, enhancing its applicability in real-world scenarios.

Experimental Evaluation

The experimental evaluation is thorough, with empirical verification conducted through Monte Carlo simulations and deterministic checks. The results demonstrate significant improvements in mean time to depletion and speedup in stream acquisition, validating the theoretical claims made in the paper. The experiments are well-designed, with a clear connection between the theoretical framework and practical outcomes.

Reproducibility

While the paper provides a detailed description of the algorithm and its implementation, it lacks a publicly accessible code repository or demo URL, which would enhance reproducibility. The absence of shared code limits the ability of other researchers to validate the findings independently.

Limitations

The paper acknowledges the limitations of the conditional independence assumption, which may not hold during large-scale outages. Additionally, the reliance on a Markov model for stream viability may not capture more complex, time-varying availability patterns. The parameters derived from prospect theory are also noted as potentially needing further calibration for specific streaming contexts.

Broader Impact

The proposed framework has significant implications for the future of adaptive streaming technologies, particularly in enhancing user experience through reduced buffering and improved quality. The integration of psychological principles into automated decision-making processes may influence other domains beyond streaming, such as network management and real-time systems. The main contribution of this paper is the introduction of the Streaming Reservoir Convergence Theorem, which provides a comprehensive framework for improving multi-provider adaptive streaming. The technical contributions, including theoretical proofs and empirical validation, position this work as a notable advancement in the field of audio streaming and adaptive bitrate technologies.

Analysis: Full Paper • Full text: 34,396 characters

Toward Fine-Grained Speech Inpainting Forensics:A Dataset, Method, and Metric for Multi-Region Tampering Localization

Tung Vu, Yen Nguyen, Hai Nguyen ... · arXiv

Recent advances in voice cloning and text-to-speech synthesis have made partial speech manipulation - where an adversary replaces a few words within an utterance to alter its meaning while preserving the speaker's identity - an increasingly realistic threat. Existing audio deepfa...

Recent advances in voice cloning and text-to-speech synthesis have made partial speech manipulation - where an adversary replaces a few words within an utterance to alter its meaning while preserving the speaker's identity - an increasingly realistic threat. Existing audio deepfake detection benchmarks focus on utterance-level binary classification or single-region tampering, leaving a critical gap in detecting and localizing multiple inpainted segments whose count is unknown a priori. We address this gap with three contributions. First, we introduce MIST (Multiregion Inpainting Speech Tampering), a large-scale multilingual dataset spanning 6 languages with 1-3 independently inpainted word-level segments per utterance, generated via LLM-guided semantic replacement and neural voice cloning, with fake content constituting only 2-7% of each utterance. Second, we propose ISA (Iterative Segment Analysis), a backbone-agnostic framework that performs coarse-to-fine sliding-window classification with gap-tolerant region proposal and boundary refinement to recover all tampered regions without prior knowledge of their count. Third, we define SF1@tau, a segment-level F1 metric based on temporal IoU matching that jointly evaluates region count accuracy and localization precision. Zero-shot evaluation reveals that partial inpainting at word granularity remains unsolved by existing deepfake detectors: utterance-level classifiers trained on fully synthesized speech assign near zero fake probability to MIST utterances where only 2-7% of content is manipulated. ISA consistently outperforms non-iterative baselines in this challenging setting, and the dataset, code, and evaluation toolkit are publicly released.

Institutional Affiliations

Primary: Posts and Telecommunications Institute of Technology

All Institutions: Posts and Telecommunications Institute of Technology

GitHub

ML Relevance Analysis (81)

The paper presents a comprehensive approach to addressing the challenges of multi-region speech inpainting detection, contributing valuable resources and methodologies to the field of audio forensics. The introduction of the MIST dataset and the ISA framework represents a meaningful step forward in the ongoing battle against audio deepfakes and misinformation.

Comprehensive Analysis

Methodology Assessment

The paper introduces a novel dataset (MIST) designed specifically for multi-region speech inpainting detection, addressing significant gaps in existing benchmarks that focus primarily on single-region tampering. The methodology includes a comprehensive generation pipeline for creating the dataset, which utilizes LLM-guided semantic replacements and neural voice cloning to produce high-quality tampered audio. The proposed Iterative Segment Analysis (ISA) framework is robust and backbone-agnostic, allowing for effective localization of tampered segments without prior knowledge of their count. The introduction of the SF1@tau metric is a significant advancement, providing a more nuanced evaluation of detection performance by accounting for both segment count and localization precision.

Experimental Evaluation

The experiments conducted are thorough, employing a zero-shot evaluation strategy that highlights the challenges of detecting partial inpainting. The results demonstrate that ISA outperforms baseline methods in terms of localization accuracy, even when using a classifier not specifically trained on the dataset. The paper includes a detailed breakdown of results by language and variant, providing insights into the performance across different linguistic contexts and manipulation types. However, the overall SF1 scores remain low, indicating that the task is inherently challenging and that further work is needed to improve detection capabilities.

Reproducibility

The authors provide sufficient details regarding the dataset generation process, the ISA framework, and the evaluation metrics, which supports reproducibility. The dataset and code are publicly available, which is a positive aspect for the research community. However, the implementation details of the ISA framework could benefit from additional clarity regarding hyperparameter tuning and specific configurations used in experiments.

Limitations

The paper acknowledges the limitations of existing audio deepfake detection systems, particularly their inability to handle partial inpainting effectively. The ISA method, while innovative, still relies on a backbone classifier that was not trained on the specific task of multi-region tampering, which may limit its performance. Additionally, the low absolute scores in the experiments suggest that the problem remains challenging, and the proposed methods may require further refinement and optimization.

Broader Impact

The implications of this research are significant, particularly in the context of audio forensics and misinformation detection. As voice cloning technology advances, the ability to detect and localize tampered speech becomes increasingly critical for maintaining trust in audio communications. The MIST dataset and ISA framework can serve as foundational tools for future research in this area, potentially leading to improved detection methods and better understanding of audio manipulation threats. The paper presents a comprehensive approach to addressing the challenges of multi-region speech inpainting detection, contributing valuable resources and methodologies to the field of audio forensics. The introduction of the MIST dataset and the ISA framework represents a meaningful step forward in the ongoing battle against audio deepfakes and misinformation.

Analysis: Full Paper • Full text: 50,026 characters

Delayed Commitment for Representation Readiness in Stage-wise Audio-Visual Learning

Xinmeng Xu, Haoran Xie, S. Joe Qin ... · arXiv

Stage-wise audio-visual encoders propagate fused intermediate states across layers, making the formation of later representations depend on the readiness of earlier fusion states. Strong local audio-visual agreement provides useful correspondence evidence, yet a fused state also ...

Stage-wise audio-visual encoders propagate fused intermediate states across layers, making the formation of later representations depend on the readiness of earlier fusion states. Strong local audio-visual agreement provides useful correspondence evidence, yet a fused state also needs sufficient cross-layer and cross-modal support before it can reliably guide later fusion. This paper studies this issue through propagation-aware representation readiness and formulates premature perceptual commitment as a readiness-deficiency problem, where local plausibility, propagation influence, and support insufficiency jointly appear at an intermediate stage. We propose the Delayed Perceptual Commitment Network (DPC-Net), an encoder-level framework that estimates an observable readiness-deficiency surrogate, localizes the intervention-sensitive bottleneck, and applies support-aware correction with cross-layer and cross-modal evidence. DPC-Net preserves task-specific heads, losses, decoding modules, and evaluation protocols, making it applicable to different audio-visual tasks through encoder-side intervention. Experiments on audio-visual speech separation, audio-visual event localization, and audio-visual speech recognition show consistent improvements across reconstruction, localization, and recognition regimes. Further analyses on component contribution, selection criteria, counterfactual intervention, and readiness trajectories support the effectiveness of readiness-guided bottleneck correction.

Institutional Affiliations

Primary: Lingnan University

All Institutions: Lingnan University, University of Southern Queensland, Wuhan University of Technology, Hong Kong Metropolitan University

ML Relevance Analysis (83)

The main contribution of this work is the introduction of DPC-Net, a novel framework that enhances representation readiness in audio-visual learning by addressing premature perceptual commitment through a readiness-deficiency approach. This research significantly advances the understanding of audio-visual representation learning, providing a robust mechanism to improve performance across various tasks while preserving the integrity of task-specific architectures.

Comprehensive Analysis

Methodology Assessment

The paper introduces the Delayed Perceptual Commitment Network (DPC-Net), which innovatively addresses the issue of representation readiness in stage-wise audio-visual learning by formulating premature perceptual commitment as a readiness-deficiency problem. The methodology is well-structured, utilizing a readiness-deficiency surrogate to localize bottlenecks and applying support-aware corrections, thus enhancing the robustness of audio-visual representations. The approach is grounded in theoretical insights from human perception, which strengthens its conceptual foundation.

Experimental Evaluation

The experiments are comprehensive, covering three distinct audio-visual tasks: speech separation, event localization, and speech recognition. The results demonstrate consistent improvements across various metrics, indicating the effectiveness of the proposed method. The use of controlled comparisons with baseline models adds rigor to the evaluation, although the paper could benefit from additional qualitative assessments of the generated outputs.

Reproducibility

The paper outlines the implementation details, including the architecture and training protocols, which aids reproducibility. However, the absence of a publicly available code repository limits the ability for other researchers to replicate the findings fully.

Limitations

One limitation is the reliance on specific datasets for evaluation, which may not generalize across all audio-visual tasks. Additionally, the paper does not address potential computational overhead introduced by the DPC-Net framework, which could impact its practical deployment in real-time systems.

Broader Impact

The proposed framework has significant implications for improving audio-visual learning systems, particularly in applications requiring robust performance under adverse conditions, such as speech recognition in noisy environments. The insights gained from this research could lead to advancements in multimodal AI systems, enhancing their ability to process and integrate diverse sensory inputs effectively. The main contribution of this work is the introduction of DPC-Net, a novel framework that enhances representation readiness in audio-visual learning by addressing premature perceptual commitment through a readiness-deficiency approach. This research significantly advances the understanding of audio-visual representation learning, providing a robust mechanism to improve performance across various tasks while preserving the integrity of task-specific architectures.

Analysis: Full Paper • Full text: 49,165 characters

Khala: Scaling Acoustic Token Language Models Toward High-Fidelity Music Generation

Jiafeng Liu, Yuanliang Dong, Hongjia Liu ... · arXiv

A common design pattern in high-quality music generation is to handle structure and fidelity in different representation spaces: a generator first models high-level structure, followed by diffusion-based or neural decoding stages that reconstruct fine details. In this work, we ex...

A common design pattern in high-quality music generation is to handle structure and fidelity in different representation spaces: a generator first models high-level structure, followed by diffusion-based or neural decoding stages that reconstruct fine details. In this work, we explore an alternative view: both may be progressively modeled within a single deep acoustic-token hierarchy. To study this, we build a 64-layer residual vector quantization (RVQ) acoustic representation and propose a two-stage coarse-to-fine generation framework. A backbone model first generates coarse acoustic tokens for the full track, and a super-resolution model then completes finer tokens within the same acoustic token space. The super-resolution stage works at full-track scale and refines tokens layer by layer while running in parallel over time, leading to a fixed 62-step inference process. To jointly improve lyric alignment and fine-detail reconstruction, we further introduce hybrid-attention training: the alignment objective uses causal attention, while layer-wise refinement uses full attention. A key finding is that text--vocal alignment can emerge within pure acoustic-token language modeling, without requiring a separate semantic token stage. Moreover, initializing the super-resolution model from the trained backbone significantly improves convergence and final quality. Taken together, our results suggest that high-quality music generation can be effectively pursued without separating structure and fidelity into heterogeneous representation spaces. Instead, both can be progressively modeled within a unified acoustic-token hierarchy, pointing toward a simpler and more unified path to high-quality music generation.

Institutional Affiliations

Primary: Central Conservatory of Music

All Institutions: Central Conservatory of Music, Tsinghua University

Demo · GitHub

ML Relevance Analysis (83)

The main contribution of this paper is the introduction of Khala, a high-fidelity music generation system that effectively models both musical structure and acoustic fidelity within a unified acoustic-token hierarchy, demonstrating competitive performance against existing systems. This work significantly advances the field of music generation by providing a novel methodology that integrates lyric alignment and acoustic detail refinement in a single framework, showcasing the potential for future developments in this area.

Comprehensive Analysis

Methodology Assessment

The paper presents a novel approach to music generation by using a unified acoustic-token hierarchy, which contrasts with existing methods that typically separate structure and fidelity into different stages. The introduction of a two-stage coarse-to-fine generation framework and hybrid-attention training for lyric alignment is particularly innovative. The methodology is well-structured, with clear explanations of the architecture and training processes, including the use of a 64-layer residual vector quantization (RVQ) acoustic representation. The authors effectively address the challenges of generating coherent and high-fidelity music while maintaining lyric alignment, which is a significant advancement in the field.

Experimental Evaluation

The experiments conducted in the paper include a large-scale human preference evaluation, which is crucial for assessing the quality of generated music. The use of both subjective (mean Overall Score and BT-derived Elo) and objective metrics provides a comprehensive evaluation framework. The results indicate that Khala performs competitively against both commercial and open-source systems, showcasing its effectiveness in real-world applications. The ablation studies further validate the importance of the proposed training strategies, particularly the backbone initialization and hybrid-attention training.

Reproducibility

The paper provides detailed implementation information, including model architectures, training strategies, and dataset descriptions. The availability of code and model checkpoints on GitHub enhances reproducibility, allowing other researchers to replicate the experiments and build upon the work. However, the paper could benefit from more explicit details on hyperparameter settings and training configurations to facilitate easier reproduction.

Limitations

One limitation noted is the reliance on a two-stage model design, which, while practical, may not fully exploit the potential of a unified model that integrates both coarse generation and fine-layer refinement. Additionally, the paper acknowledges that the current tokenizer, while effective, could be improved for even higher fidelity, suggesting that future work could focus on enhancing the acoustic representation further.

Broader Impact

The work has significant implications for the field of music generation, particularly in developing systems that can produce high-quality music with coherent structure and fidelity without relying on separate semantic stages. This approach could pave the way for more integrated and efficient music generation systems, potentially impacting various applications in entertainment, education, and creative industries. The findings also suggest a promising direction for future research in audio modeling and machine learning. The main contribution of this paper is the introduction of Khala, a high-fidelity music generation system that effectively models both musical structure and acoustic fidelity within a unified acoustic-token hierarchy, demonstrating competitive performance against existing systems. This work significantly advances the field of music generation by providing a novel methodology that integrates lyric alignment and acoustic detail refinement in a single framework, showcasing the potential for future developments in this area.

Analysis: Full Paper • Full text: 37,110 characters

Audio ML Papers

🏆 Top Papers This Week

Institutional Affiliations

ML Relevance Analysis (93)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (91)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (91)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (83)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (83)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (83)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (82)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (86)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (83)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (83)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility