While multimodal large language models have advanced across text, image, and audio, personalization research has remained primarily vision-language, with unified omnimodal benchmarking that jointly covers text, image, and audio still limited, and lacking the methodological rigor to account for absent-persona scenarios or systematic grounding studies. We introduce Omni-Persona, the first comprehensive benchmark for omnimodal personalization. We formalize the task as cross-modal routing over the Persona Modality Graph, encompassing 4 task groups and 18 fine-grained tasks across {sim}750 items. To rigorously diagnose grounding behavior, we propose Calibrated Accuracy (mathrm{Cal)}, which jointly rewards correct grounding and appropriate abstention, incorporating absent-persona queries within a unified evaluation framework. On our dedicated experiments, three diagnostic findings emerge: (i) open-source models show a consistent audio-vs-visual grounding gap that RLVR partially narrows via dense rule-based supervision; (ii) answerable recall and parameter scale are incomplete diagnostics, since strong recall can coexist with absent-persona hallucination and larger models do not always achieve higher Cal, exposing calibration as a separate evaluation axis; and (iii) SFT is bounded by the difficulty of constructing annotated ground-truth supervision at scale, while RLVR generalizes more consistently through outcome-level verifiable feedback yet drifts toward conservative behavior and lower generation quality under our reward design. Omni-Persona thus serves as a diagnostic framework that surfaces the pitfalls of omnimodal personalization, guiding future post-training and reward design.
Primary: Seoul National University
All Institutions: Seoul National University, University of Seoul
The paper presents the Omni-Persona benchmark, a pioneering framework for evaluating omnimodal personalization that significantly enhances the understanding of model performance across audio, text, and visual modalities. The comprehensive methodology and experimental rigor contribute valuable insights into the challenges and potential solutions in the field of multimodal AI systems.
The paper introduces the Omni-Persona benchmark, a novel framework for evaluating omnimodal personalization that incorporates audio, text, and visual modalities. The methodology is rigorous, employing the Persona Modality Graph (PMG) to formalize user profiles and cross-modal routing tasks. The introduction of Calibrated Accuracy (Cal) as a metric is particularly noteworthy, as it addresses the limitations of traditional recall metrics by incorporating both correct grounding and appropriate abstention. The systematic treatment of absent-persona scenarios is a significant advancement in the field, allowing for a more realistic evaluation of model performance in real-world conditions.
The experiments are well-structured, comparing various models under different training regimes (SFT and RLVR). The findings reveal critical insights into the performance of open-source models, particularly highlighting the audio-visual grounding gap and the limitations of scaling SFT datasets. The use of diverse models and the detailed analysis of their performance across multiple tasks provide a comprehensive understanding of the challenges in omnimodal personalization. The results are robust, demonstrating the effectiveness of the proposed benchmark in revealing model weaknesses that traditional metrics may overlook.
The paper provides sufficient details regarding the experimental setup, including model architectures, training regimes, and evaluation metrics. However, the reliance on synthetic data and the use of LLM-as-a-judge for evaluation may introduce biases that could affect reproducibility. Future work should aim to validate these findings with human-annotated datasets to enhance generalizability.
The benchmark relies on synthetic audio and text data, which may not fully capture the complexities of real-world scenarios. Additionally, the use of LLM-as-a-judge could introduce biases in evaluation, and the paper acknowledges the need for further human verification. The trade-off observed in RLVR, where models may become overly conservative in their abstention behavior, also highlights a potential area for improvement in reward design.
The Omni-Persona benchmark has significant implications for the development of personalized AI systems, particularly in applications requiring nuanced understanding across multiple modalities. By addressing the challenges of absent-persona scenarios and grounding accuracy, this work could lead to more reliable and effective personal assistants that better serve user needs in real-world contexts. The paper presents the Omni-Persona benchmark, a pioneering framework for evaluating omnimodal personalization that significantly enhances the understanding of model performance across audio, text, and visual modalities. The comprehensive methodology and experimental rigor contribute valuable insights into the challenges and potential solutions in the field of multimodal AI systems.
Joint audio-video generation models are rapidly approaching professional production quality, raising a central question: do they understand audio-visual physics, or merely generate plausible sounds and frames that violate real-world consistency? We introduce AV-Phys Bench, a benchmark for evaluating physical commonsense in joint audio-video generation. AV-Phys Bench tests models across three scene categories: Steady State, Event Transition, and Environment Transition. It covers physics-grounded subcategories drawn from real-world scenes, plus Anti-AV-Physics prompts that deliberately request physically inconsistent audio-video behavior. Each generation is evaluated along five dimensions: visual semantic adherence, audio semantic adherence, visual physical commonsense, audio physical commonsense, and cross-modal physical commonsense. Across three proprietary and four open-source models, we find that Seedance 2.0 performs best overall, but all models remain far from robust physical understanding. Performance drops sharply on event-driven and environment-driven transitions, and even strong proprietary systems collapse on Anti-AV-Physics prompts. We further introduce AV-Phys Agent, a ReAct-style evaluator that combines a multimodal language model with deterministic acoustic measurement tools, producing rankings that closely align with human ratings. Our results identify cross-modal physical consistency and transition-driven scene dynamics as key open challenges for joint audio-video generation.
Primary: University of Texas at Dallas
All Institutions: University of Texas at Dallas, University of Washington, University of California, Los Angeles
The main contribution of this paper is the establishment of a comprehensive benchmark for evaluating physical commonsense in joint audio-video generation models, which addresses a critical gap in the current evaluation landscape. The innovative methodology and thorough experimental evaluation provide valuable insights into the limitations of existing models and pave the way for future advancements in the field.
The paper introduces AV-Phys Bench, a novel benchmark for evaluating physical commonsense in joint audio-video generation models. The methodology is robust, employing a structured evaluation rubric that assesses five dimensions of performance across three scene categories. The use of both human evaluation and an automated evaluator (AV-Phys Agent) that integrates multimodal reasoning with deterministic audio measurement tools is particularly innovative. This dual approach enhances the reliability of the assessments and provides a comprehensive framework for understanding model performance beyond mere perceptual quality.
The experiments conducted are thorough, evaluating seven models across various categories and dimensions. The results reveal significant gaps in physical commonsense understanding among current models, particularly in transition scenarios. The performance metrics are well-defined, and the analysis is detailed, providing insights into the strengths and weaknesses of the evaluated models. The findings are significant, highlighting the challenges in achieving physical consistency in audio-video generation.
The paper provides sufficient details regarding the evaluation setup, including the datasets and scoring mechanisms. The availability of the dataset and the code repository enhances reproducibility. However, the reliance on specific models for evaluation may limit the generalizability of the findings to other models not included in the study.
The paper acknowledges limitations such as the focus on English prompts and the binary nature of the evaluation rubric, which may not capture the nuances of model performance. Additionally, the study is constrained to eight-second clips, which may not represent longer or more complex scenarios effectively.
The introduction of AV-Phys Bench has the potential to significantly influence the development of joint audio-video generation models by providing a clear framework for assessing physical commonsense. This could lead to improvements in model architectures and training methodologies, ultimately enhancing the applicability of these models in real-world scenarios, such as virtual environments and educational content. The main contribution of this paper is the establishment of a comprehensive benchmark for evaluating physical commonsense in joint audio-video generation models, which addresses a critical gap in the current evaluation landscape. The innovative methodology and thorough experimental evaluation provide valuable insights into the limitations of existing models and pave the way for future advancements in the field.
Inspired by the development of OpenClaw, there is a growing demand for mobile-based personal agents capable of handling complex and intuitive interactions. In this technical report, we introduce X-OmniClaw, a unified mobile agent designed for multimodal understanding and interaction in the Android ecosystem. This unified architecture of perception, memory, and action enables the agent to handle complex mobile tasks with high contextual awareness. Specifically, Omni Perception provides a unified multimodal ingress pipeline that integrates UI states, real-world visual contexts, and speech inputs, leveraging a temporal alignment module to decompose raw data into structured multimodal intent representations. Omni Memory leverages multimodal memory optimization to enhance personalized intelligence by integrating runtime working memory for task continuity with long-term personal memory distilled from local data, enabling highly context-aware and personalized interactions. Finally, Omni Action employs a hybrid grounding strategy that combines structural XML metadata with visual perception for robust interaction. Through Behavior Cloning and Trajectory Replay, the system captures user navigation as reusable skills, enabling precise direct-access execution. Demonstrations across diverse scenarios show that X-OmniClaw effectively enhances interaction efficiency and task reliability, providing a practical architectural blueprint for the next generation of mobile-native personal assistants.
Primary: OPPO AI Center
All Institutions: OPPO AI Center
The main contribution of this paper is the introduction of X-OmniClaw, a unified mobile agent architecture that enhances multimodal understanding and interaction on Android devices. This work represents a significant step towards creating more intelligent and context-aware personal assistants, although it would benefit from more rigorous experimental validation and reproducibility measures.
The methodology of X-OmniClaw is robust, integrating multimodal perception, memory, and action into a cohesive framework designed for mobile agents. The use of a unified ingress pipeline for multimodal data and the incorporation of behavior cloning and trajectory replay for skill execution are significant advancements. The architecture is well thought out, allowing for real-time contextual awareness and personalized interactions, which is crucial for mobile applications. However, while the proposed methods are innovative, they build upon existing frameworks like OpenClaw and Hermes, which may limit the perceived novelty.
The paper provides a thorough overview of the system's capabilities through various demo scenarios, showcasing practical applications of the proposed architecture. However, it lacks quantitative evaluation metrics or comparative analysis against existing systems, which would strengthen the claims of enhanced interaction efficiency and task reliability. The absence of rigorous experimental validation is a notable gap.
The paper does not provide specific implementation details or code availability, which raises concerns about reproducibility. While it mentions the intention to release code and materials as open source, the current lack of access limits the ability for others to replicate the findings.
The paper does not address potential limitations of the system, such as the challenges of real-time processing on mobile devices, privacy concerns regarding data collection, and the dependency on the accuracy of the multimodal inputs. Additionally, the reliance on local device capabilities may restrict the system's performance in resource-constrained environments.
The development of X-OmniClaw has the potential to significantly enhance user interaction with mobile devices, making them more intuitive and responsive to user needs. Its applications could extend beyond personal assistants to areas such as accessibility technology, education, and smart home integration, thereby impacting a wide range of fields. The main contribution of this paper is the introduction of X-OmniClaw, a unified mobile agent architecture that enhances multimodal understanding and interaction on Android devices. This work represents a significant step towards creating more intelligent and context-aware personal assistants, although it would benefit from more rigorous experimental validation and reproducibility measures.
Generating realistic drum audio directly from symbolic representations is a challenging task at the intersection of music perception and machine learning. We propose a system that transforms an expressive drum grid, a time-aligned MIDI representation with microtiming and velocity information, into drum audio by predicting discrete codes of a neural audio codec. Our approach uses a Transformer-based model to map the drum grid input to a sequence of codec tokens, which are then converted to waveform audio via a pre-trained codec decoder. We experiment with multiple state-of-the-art neural codecs, namely EnCodec, DAC, and X-Codec, to assess how the choice of audio representation impacts the quality of the generated drums. The system is trained and evaluated on the Expanded Groove MIDI Dataset, E-GMD, a large collection of human drum performances with paired MIDI and audio. We evaluate the fidelity and musical alignment of the generated audio using objective metrics. Overall, our results establish codec-token prediction as an effective route for drum grid-to-audio generation and provide practical insights into selecting audio tokenizers for percussive synthesis.
Primary: Hellenic Mediterranean University
All Institutions: Hellenic Mediterranean University, Athena RC
The paper presents a novel approach to drum synthesis that effectively combines symbolic MIDI representations with neural audio codecs. This work contributes to the field by providing a structured methodology for generating high-quality drum audio, with implications for both music technology and machine learning research.
The paper introduces a novel approach to drum synthesis by leveraging a Transformer-based model to predict neural audio codec tokens from expressive drum grids. This method is innovative as it employs a two-stage process where the first stage involves mapping MIDI-derived grids to token sequences, and the second stage decodes these tokens into audio waveforms. The use of state-of-the-art neural codecs (EnCodec, DAC, and X-Codec) for this task is particularly noteworthy, as it allows for a controlled comparison of different audio representations and their impact on synthesis quality. The methodology is well-structured, with clear definitions of the expressive drum grid representation and the training process.
The experiments are comprehensive, utilizing the Expanded Groove MIDI Dataset (E-GMD), which is a substantial dataset with aligned MIDI and audio data. The evaluation metrics are robust, including both token-level and audio-level assessments, which provide a multifaceted view of the model's performance. The results demonstrate that EnCodec consistently outperforms the other codecs in terms of audio quality and token accuracy, indicating the effectiveness of the proposed approach. However, the paper could benefit from additional qualitative evaluations, such as user studies, to complement the objective metrics.
The paper provides detailed information about the methodology, including the architecture of the Transformer model, training procedures, and evaluation metrics. However, the absence of a direct implementation or demo could hinder reproducibility for researchers who wish to replicate the results. The project URL does provide access to the code, which is a positive aspect for reproducibility.
One limitation is the lack of subjective evaluations, such as listening tests, which are crucial for assessing the perceptual quality of the generated audio. Additionally, the paper notes that increasing model capacity can lead to instability, which is a significant concern for future work. The reliance on objective metrics alone may not fully capture the nuances of audio quality and musicality.
The proposed system has the potential to significantly impact music production by automating the generation of realistic drum audio from MIDI inputs, thereby facilitating creative processes for musicians and producers. The insights gained from codec comparisons could also inform future developments in audio synthesis and machine learning applications in music. The paper presents a novel approach to drum synthesis that effectively combines symbolic MIDI representations with neural audio codecs. This work contributes to the field by providing a structured methodology for generating high-quality drum audio, with implications for both music technology and machine learning research.
As video becomes increasingly central to information dissemination and multimodal large language models (MLLMs) continue to advance, evaluating video retrieval has become increasingly important. In realistic search scenarios, this requires matching short user queries to long-form content using both visual and auditory evidence. Yet existing retrieval benchmarks are still dominated by short clips, single modalities, and caption-based evaluation. We introduce FLARE, a full-modality long-video audiovisual retrieval benchmark with user-simulated queries. Built from 399 carefully screened Video-MME videos (10--60 min, 225.4 h) to ensure source quality and diversity, FLARE contains 87,697 clips annotated with vision, audio, and unified audiovisual captions, together with 274,933 user-style queries. Cross-modal queries are further filtered by a hard bimodal constraint, requiring retrieval to fail under either modality alone but succeed when both are combined. FLARE evaluates models under two regimes, caption-based and query-based retrieval, across vision, audio, and unified audiovisual settings. Experiments with 15 representative retrievers show that user-style queries substantially change model behavior, strong caption-based performance does not always transfer to query-based retrieval, and audio--language alignment remains a key bottleneck for unified audiovisual retrieval. Our code and data are released at https://flarebench.github.io/
Primary: University of Science and Technology Beijing
All Institutions: University of Science and Technology Beijing, Peking University, Institute of Automation, Chinese Academy of Sciences, Zhongguancun Academy
The paper presents FLARE, a pioneering benchmark for long-video audiovisual retrieval that integrates user-simulated queries and a rigorous bimodal constraint. This work is significant as it addresses critical gaps in existing benchmarks and offers valuable insights into the performance of multimodal retrieval systems, ultimately pushing the boundaries of research in this area.
The paper introduces FLARE, a comprehensive benchmark for long-video audiovisual retrieval that incorporates user-simulated queries and a hard bimodal constraint. The methodology is robust, combining automated processes with human review to ensure high-quality data generation and annotation. The segmentation of videos into coherent clips based on both visual and auditory cues is particularly noteworthy, as is the dual-regime evaluation protocol that isolates the impact of query formulation on model performance.
The experiments are well-structured, evaluating 15 representative models under both caption-based and query-based regimes. The findings reveal significant differences in model performance based on the type of query used, highlighting the limitations of existing models in handling user-style queries. The results are comprehensive, covering various modalities and retrieval directions, and they effectively demonstrate the benchmark's utility in exposing the strengths and weaknesses of current retrieval systems.
The paper provides sufficient details on the benchmark construction, evaluation protocols, and model configurations to allow for reproducibility. The authors have also released their code and data, which enhances the reproducibility of their findings.
The paper acknowledges several limitations, including the potential biases introduced by the automated annotation process and the reliance on a specific set of video sources. The user-simulated queries, while innovative, may not fully capture the diversity of real-world user queries. Additionally, the benchmark may not cover all relevant domains or languages.
The FLARE benchmark has the potential to significantly advance the field of audiovisual retrieval by providing a more realistic evaluation framework that aligns with user behavior. However, the authors also caution against potential misuse of advanced retrieval systems for invasive surveillance or biased content discovery. The paper presents FLARE, a pioneering benchmark for long-video audiovisual retrieval that integrates user-simulated queries and a rigorous bimodal constraint. This work is significant as it addresses critical gaps in existing benchmarks and offers valuable insights into the performance of multimodal retrieval systems, ultimately pushing the boundaries of research in this area.
The advancement of diffusion-based text-to-music generation has opened new avenues for zero-shot music editing. However, existing methods fail to achieve stem-specific timbre transfer, which requires altering specific stems while strictly preserving the background accompaniment. This limitation severely hinders practical application, since real-world production necessitates precise manipulation of components within dense mixtures. Our key finding is that, while vanilla cross-attention captures semantic features of stems, it lacks the spectral resolution to strictly localize targets in dense mixtures, leading to boundary leakage. To resolve this dilemma, we propose Polyphonia, a zero-shot editing framework with Acoustic-Informed Attention Calibration. Rather than relying solely on diffuse semantic attention, Polyphonia leverages a probabilistic acoustic prior to establish coarse boundaries, enabling non-target stems preserved precise semantic synthesis. For evaluation, we propose PolyEvalPrompts, a standardized prompt set with 1,170 timbre transfer tasks in polyphonic music. Specifically, Polyphonia achieves an increase of 15.5% in target alignment compared to baselines, while maintaining competitive music fidelity and non-target integrity.
Primary: South China University of Technology
All Institutions: South China University of Technology
The main contribution of this work is the introduction of Polyphonia, a framework that advances controllable music generation by enabling precise intra-stem editing in polyphonic music. This paper significantly enhances the state-of-the-art in audio processing by providing a novel methodology that effectively tackles the challenges of timbre transfer while maintaining the integrity of complex audio mixtures.
The methodology presented in Polyphonia is innovative, addressing the critical challenge of stem-specific timbre transfer in polyphonic music through a novel framework that combines Acoustic-Informed Attention Calibration with a probabilistic acoustic prior. This dual-path mechanism effectively resolves semantic-acoustic misalignment, allowing for precise editing while maintaining the integrity of non-target stems. The use of Ideal Ratio Mask (IRM) as an acoustic prior is particularly noteworthy, as it enhances the model's ability to discern and manipulate specific audio components within complex mixtures. The paper also introduces a comprehensive evaluation framework, PolyEvalPrompts, which is essential for assessing the performance of the proposed method across a wide range of timbre transfer tasks.
The experiments conducted are robust, utilizing well-established datasets (MUSDB18-HQ and MusicDelta) and a diverse set of state-of-the-art baselines for comparison. The results demonstrate significant improvements in target alignment and non-target integrity, with quantitative metrics such as CLAP, LPAPS, and CQT1-PCC providing a thorough assessment of the model's performance. The inclusion of subjective evaluations through Mean Opinion Scores (MOS) adds depth to the experimental analysis, confirming the effectiveness of the proposed method in real-world applications.
The paper provides detailed implementation information, including the architecture used (AudioLDM 2) and the specifics of the acoustic prior extraction process. However, the absence of a public code repository limits full reproducibility. While the methodology is well-documented, access to the actual implementation would enhance the ability of other researchers to replicate the results.
One limitation noted is the reliance on pre-trained Blind Source Separation (BSS) models, which may introduce biases based on the training data used, potentially affecting the generalizability of the approach to non-Western music or less common instruments. Additionally, the method's performance may degrade in extremely dense mixtures, although it shows graceful degradation rather than catastrophic failure.
The potential applications of Polyphonia are significant, particularly in the realm of music production and editing, where precise control over individual audio stems is crucial. However, ethical considerations regarding intellectual property and the potential for misuse in unauthorized remixes must be addressed. The authors advocate for the development of provenance tracking and audio watermarking technologies to safeguard artistic integrity. The main contribution of this work is the introduction of Polyphonia, a framework that advances controllable music generation by enabling precise intra-stem editing in polyphonic music. This paper significantly enhances the state-of-the-art in audio processing by providing a novel methodology that effectively tackles the challenges of timbre transfer while maintaining the integrity of complex audio mixtures.
Reconstructing a 3D sound field from sparse microphone measurements is a fundamental yet ill-posed problem, which we address through Acoustic Transfer Function (ATF) magnitude estimation. ATF magnitude encapsulates key perceptual and acoustic properties of a physical space with applications in room characterization and correction. Although recent generative paradigms such as Flow Matching (FM) have achieved state-of-the-art performance in speech and music generation, their potential in spatial audio remains underexplored. We propose a novel framework for 3D ATF magnitude reconstruction as a guided generation task, with a 3D U-Net conditioned by a permutation-invariant set encoder. This architecture enables reconstruction from an arbitrary number of sparse inputs while leveraging the stable and efficient training properties of FM. Experimental results demonstrate that SF-Flow achieves accurate reconstruction up to \SI{1}{kHz}, trains substantially faster than the autoencoder baseline, and improves significantly with dataset size.
Primary: National Institute of Informatics
All Institutions: King's College London, National Institute of Informatics, National Institute of Advanced Industrial Science and Technology
The main contribution of this paper is the introduction of SF-Flow, a novel framework for 3D Acoustic Transfer Function magnitude estimation that leverages Flow Matching for efficient and accurate reconstruction from sparse microphone measurements. This work significantly advances the field of spatial audio by providing a new methodology that outperforms traditional approaches while maintaining computational efficiency.
The proposed SF-Flow method introduces a novel approach to sound field magnitude estimation by framing it as a guided generative task using Flow Matching (FM). The architecture employs a 3D U-Net conditioned by a permutation-invariant set encoder, which allows for reconstruction from an arbitrary number of sparse inputs. This methodology is innovative as it leverages the advantages of FM, such as simulation-free training and stable dynamics, to tackle the challenges of sparse measurements in spatial audio. The use of a permutation-invariant encoder is particularly noteworthy, as it addresses the unordered nature of the input data effectively.
The experimental setup is robust, utilizing simulated Room Impulse Responses (RIRs) to evaluate the performance of SF-Flow against established baselines such as autoencoders and kernel ridge regression. The results demonstrate that SF-Flow achieves lower Log-Spectral Distortion (LSD) than the autoencoder up to 468 Hz, while maintaining faster training times. The experiments also explore the impact of dataset size on performance, showing that larger datasets significantly improve model accuracy. The comprehensive evaluation across different frequencies and observation counts provides strong evidence of the method's effectiveness.
The paper provides clear details regarding the training procedure, dataset generation, and evaluation metrics, which enhances reproducibility. However, the lack of a direct link to the experimental code or a demo could hinder full reproducibility for other researchers. The authors do mention that the source code and dataset are available, which is a positive aspect.
One limitation of the study is that the results are primarily based on simulated data, which may not fully capture the complexities of real-world environments. Additionally, while the method shows strong performance in magnitude estimation, the authors acknowledge the need for future work to jointly model magnitude and phase, which is crucial for applications like immersive audio rendering.
The implications of this research are significant for various applications in spatial audio, including room acoustics analysis, immersive audio rendering in AR/VR, and audio correction systems. By improving the accuracy and efficiency of sound field reconstruction from sparse measurements, this work could enhance the quality of audio experiences in both consumer and professional settings. The main contribution of this paper is the introduction of SF-Flow, a novel framework for 3D Acoustic Transfer Function magnitude estimation that leverages Flow Matching for efficient and accurate reconstruction from sparse microphone measurements. This work significantly advances the field of spatial audio by providing a new methodology that outperforms traditional approaches while maintaining computational efficiency.
Most recent advances in audio dereverberation focus almost exclusively on speech, leaving percussive and drum signals largely unexplored despite their importance in music production. Percussive dereverberation poses distinct challenges due to sharp transients and dense temporal structure. In this work, we propose a cold diffusion framework for dereverberating stereo drum stems (downmixes), modeling reverberation as a deterministic degradation process that progressively transforms anechoic signals into reverberant ones. We investigate two reverse-process parameterizations, Direct (next-state) and a Delta-normalized residual (velocity-style) prediction, and implement the framework using both a UNet and a diffusion Transformer backbone. The models are trained and evaluated on curated datasets comprising both acoustic and electronic drum recordings, with reverberation generated using a combination of synthetic and real room impulse responses. Extensive experiments on in-domain and fully out-of-domain test sets demonstrate that the proposed method consistently outperforms strong score-based and conditional diffusion baselines, evaluated using signal-based and perceptual metrics tailored to percussive audio.
Primary: Hellenic Mediterranean University
All Institutions: Hellenic Mediterranean University, XLN Audio
The paper presents a novel cold diffusion framework for percussive dereverberation, addressing a critical gap in audio enhancement research. Its innovative methodology, rigorous experimental evaluation, and practical implications for the music industry underscore its significance in advancing the field of audio processing.
The proposed cold diffusion framework is a novel approach to dereverberation specifically tailored for percussive audio, which has been largely overlooked in previous research focused on speech. The methodology effectively models reverberation as a deterministic degradation process and introduces two reverse-process parameterizations that leverage both UNet and diffusion Transformer architectures. The use of a structured forward process and the careful design of training objectives tailored to the unique characteristics of percussive signals demonstrate a thoughtful and innovative approach to the problem.
The experiments are comprehensive, utilizing a well-curated dataset that combines both acoustic and electronic drum recordings. The authors provide extensive evaluations on both in-domain and out-of-domain test sets, showcasing the robustness of their method against strong baselines. The results indicate significant improvements across various signal-based and perceptual metrics, particularly in transient preservation and reduction of late reverberation, highlighting the effectiveness of the proposed approach.
The paper includes sufficient implementation details, including the architecture of the models, training configurations, and the datasets used. The availability of code and audio examples on GitHub further enhances reproducibility, allowing other researchers to validate and build upon this work.
While the results are promising, the paper acknowledges that the proposed models still face challenges under strong domain shifts and may not generalize well to all types of reverberation, particularly production-style effects. Additionally, the reliance on a curated dataset may limit the generalizability of the findings.
This research has significant implications for music production and audio engineering, providing tools that can enhance the quality of percussive recordings. The focus on percussive dereverberation opens up new avenues for research and application in audio enhancement, potentially benefiting musicians, producers, and audio engineers. The paper presents a novel cold diffusion framework for percussive dereverberation, addressing a critical gap in audio enhancement research. Its innovative methodology, rigorous experimental evaluation, and practical implications for the music industry underscore its significance in advancing the field of audio processing.
Explainable AI (XAI) has achieved remarkable success in image classification, yet the audio domain lacks equally mature solutions. Current methods apply vision-based attribution techniques to spectrograms, overlooking fundamental differences between visual and acoustic signals. While prototype reasoning is promising, acoustic similarity remains multidimensional. We introduce APEX (Audio Prototype EXplanations), a post-hoc framework for interpreting pre-trained audio classifiers. Crucially, APEX requires no fine-tuning of the original backbone and strictly preserves output invariance. APEX disentangles explanations into four perspectives: Square-based prototypes to localize transient events, Time-based for temporal patterns, Frequency-based highlighting spectral bands, and Time-Frequency-based integrating both. This yields intuitive, example-based explanations that respect acoustic properties, providing greater semantic clarity than standard gradient-based methods.
Primary: Wroclaw University of Science and Technology
All Institutions: Wroclaw University of Science and Technology, Resemble AI, IDEAS Research Institute, Jagiellonian University
The main contribution of this paper is the introduction of APEX, a post-hoc prototype-based interpretability framework for audio classifiers that preserves model performance while providing intuitive, example-based explanations. This work significantly advances the field of explainable AI in audio processing, addressing critical gaps in existing methodologies and offering a robust framework for future research in audio interpretability.
The proposed APEX framework introduces a novel approach to audio classification interpretability by leveraging prototype-based reasoning without requiring fine-tuning of the original model. The methodology is well-structured, incorporating a Disentanglement Module that effectively separates latent features into interpretable components. The four distinct prototype extraction schemes (Square-based, Time-based, Frequency-based, and Time-Frequency-based) are innovative and tailored to the unique characteristics of audio data, addressing the limitations of existing methods that often apply visual techniques to audio spectrograms. The requirement for output invariance while optimizing the feature space is a significant contribution, ensuring that the interpretability does not compromise the model's performance.
The experiments conducted on the WaveFake dataset for audio deepfake detection and the BirdSet dataset for bioacoustic classification demonstrate the effectiveness of the APEX framework. The results show that APEX maintains the classification performance of the underlying model while providing meaningful and localized explanations. The ablation studies, particularly the targeted masking experiments, validate the importance of the highlighted regions, confirming that the method accurately captures the critical acoustic features used by the classifier.
The paper provides sufficient details regarding the implementation of the APEX framework, including the architecture, training procedures, and evaluation metrics. However, the lack of a publicly available code repository or demo limits the reproducibility of the results. Future work should consider releasing code and datasets to facilitate further research and validation.
A notable limitation of the APEX framework is its applicability to architectures that follow a specific structure, which may restrict its use in more diverse model types. Additionally, while the method shows promise in interpretability, the evaluation metrics primarily focus on classification performance, which may not fully capture the nuances of interpretability in audio contexts.
The APEX framework has significant implications for the deployment of audio classifiers in safety-critical applications, where interpretability is crucial for ethical and legal compliance. By providing clear and semantically meaningful explanations, APEX can enhance trust in audio processing systems, particularly in fields such as healthcare and security. The main contribution of this paper is the introduction of APEX, a post-hoc prototype-based interpretability framework for audio classifiers that preserves model performance while providing intuitive, example-based explanations. This work significantly advances the field of explainable AI in audio processing, addressing critical gaps in existing methodologies and offering a robust framework for future research in audio interpretability.
In new media art creation, the mapping between vision and hearing is often subjective. As a classic carrier of sound visualization, Chladni patterns have great potential in building audio-visual mapping mechanisms. However, existing tools face pain points: high technical barriers for simulation, offline computing failing real-time interaction, and uncontrollable mapping rules in general sonification tools. To address these, this paper proposes ChladniSonify, a real-time visual-acoustic mapping method for Chladni patterns. Based on Kirchhoff-Love plate theory, we build a paired dataset via numerical programming and calibrate it using ANSYS finite element simulation. Focusing on the slender nodal lines of Chladni patterns, we adopt a lightweight CNN with CBAM to achieve high-precision, low-latency pattern classification. Finally, we build an end-to-end system in Python and Max/MSP, mapping recognized patterns to corresponding sine wave frequencies. Results show the system has excellent usability: the classification module achieves 99.33% accuracy on the test set with 7.03 ms inference latency; the mapped frequency matches the theoretical value with zero deviation; the average end-to-end latency is under 50 ms, meeting real-time interactive needs. This work provides a reproducible engineering prototype for Chladni audio-visual art creation.
Primary: Shenyang Conservatory of Music
All Institutions: Shenyang Conservatory of Music
The main contribution of this paper is the development of ChladniSonify, a real-time visual-acoustic mapping system that effectively bridges the gap between visual patterns and sound generation, leveraging advanced machine learning techniques and classical physics principles. The technical contributions are significant, addressing existing challenges in the field of audio-visual mapping and providing a foundation for future artistic and technological innovations.
The methodology is robust, leveraging classical physics principles (Kirchhoff-Love theory) to create a paired dataset for Chladni patterns and vibration frequencies. The use of a lightweight CNN with a CBAM attention mechanism is innovative and tailored for the specific task of recognizing slender nodal lines. The end-to-end system design, integrating Python and Max/MSP for real-time audio-visual mapping, is well thought out and addresses existing gaps in the field.
The experiments are comprehensive, demonstrating high accuracy (99.33%) and low latency (7.03 ms) for pattern recognition, with a full-link latency under 50 ms. The results validate the proposed mapping system's effectiveness, showing a complete match with theoretical values. However, the experiments primarily rely on synthetic data, which may limit the generalizability of the findings.
The paper provides sufficient detail regarding the dataset construction, model architecture, and experimental setup, which should allow for reproducibility. However, the lack of a publicly available dataset or code repository hinders full reproducibility.
The system is currently limited to specific Chladni patterns (square plates with center excitation) and only includes 15 modes. The reliance on synthetic data and the absence of real-world testing may affect the robustness of the findings. Additionally, the system lacks advanced music creation functionalities, limiting its use for non-technical artists.
This work has significant implications for new media art, enabling artists to create interactive installations that link visual patterns to sound in real-time. It opens avenues for further exploration in audio-visual art and could inspire future research in multimodal systems. The main contribution of this paper is the development of ChladniSonify, a real-time visual-acoustic mapping system that effectively bridges the gap between visual patterns and sound generation, leveraging advanced machine learning techniques and classical physics principles. The technical contributions are significant, addressing existing challenges in the field of audio-visual mapping and providing a foundation for future artistic and technological innovations.
The performance of audio latent diffusion models is primarily governed by generator expressivity and the modelability of the underlying latent space. While recent research has focused primarily on the former, as well as improving the reconstruction fidelity of audio codecs, we demonstrate that latent modelability can be significantly improved through explicit factor disentanglement. We present PoDAR (Power-Disentangled Audio Representation), a framework that utilizes a randomized power augmentation and latent consistency objective to decouple signal power from invariant semantic content. This factorization makes the latent space easier to model, which both accelerates the convergence of downstream generative models and improves final overall performance. When applied to a Stable Audio 1.0 VAE with an F5-TTS generator, PoDAR achieves about a $2\times$ acceleration in convergence to match baseline performance, while increasing final speaker similarity by 0.055 and UTMOS by 0.22 on the LibriSpeech-PC dataset. Furthermore, isolating power into dedicated channels enables the application of CFG exclusively to power-invariant content, effectively extending the stable guidance regime to higher scales.
Primary: Descript
All Institutions: Descript
The main contribution of this paper is the introduction of PoDAR, a framework that enhances the modelability of audio latent spaces through power disentanglement, leading to faster convergence and improved generative performance. This work represents a meaningful advancement in the field of audio generative modeling, addressing critical challenges in latent space representation and providing a pathway for future research in multimodal audio systems.
The methodology presented in PoDAR leverages a randomized power augmentation and a latent consistency objective to disentangle signal power from semantic content within the latent space of audio representations. This approach is innovative as it addresses the challenge of latent modelability, which is often overlooked in favor of reconstruction fidelity. By explicitly separating power and semantic content, the authors provide a structured framework that enhances the efficiency of generative models. The use of partial Classifier-Free Guidance (CFG) to selectively apply guidance only to the power-invariant channels further demonstrates a thoughtful approach to improving model robustness.
The experimental evaluation is comprehensive, utilizing well-established benchmarks such as the LibriSpeech-PC dataset. The authors report significant improvements in convergence speed and speaker similarity metrics, demonstrating the effectiveness of their proposed framework. The results are quantitatively supported by metrics like UTMOS and speaker similarity, which are critical in the audio domain. However, the paper could benefit from additional qualitative assessments or user studies to complement the quantitative findings.
The paper provides detailed descriptions of the experimental setup, including the architecture of the autoencoders, training configurations, and the metrics used for evaluation. However, the absence of publicly available code or a demo limits the reproducibility of the results. Providing a GitHub repository or similar resource would greatly enhance the ability of other researchers to replicate the findings.
The primary limitation noted in the paper is the increased computational overhead associated with the dual encoder passes required for the consistency objective. Additionally, the framework has only been tested within the speech domain, which may limit its generalizability to other audio modalities or applications. The authors also acknowledge that the disentanglement focuses solely on power, leaving other potential nuisance variables unaddressed.
The implications of this research are significant, particularly in enhancing the efficiency and quality of audio generation systems. By improving the modelability of latent representations, PoDAR could facilitate advancements in various applications, including text-to-speech synthesis, audio restoration, and music generation. The increased efficiency also suggests a potential reduction in the carbon footprint associated with training large generative models, which is an important consideration in the current landscape of machine learning research. The main contribution of this paper is the introduction of PoDAR, a framework that enhances the modelability of audio latent spaces through power disentanglement, leading to faster convergence and improved generative performance. This work represents a meaningful advancement in the field of audio generative modeling, addressing critical challenges in latent space representation and providing a pathway for future research in multimodal audio systems.
Evaluating expressive speech remains challenging, as existing methods mainly assess emotional intensity and overlook whether a speech sample is expressively appropriate for its contextual setting. This limitation hinders reliable evaluation of speech systems used in narrative-driven and interactive applications, such as audiobooks and conversational agents. We introduce CEAEval, a Context-rich framework for Evaluating Expressive Appropriateness in speech, which assesses whether a speech sample expressively aligns with the underlying communicative intent implied by its discourse-level narrative context. To support this task, we construct CEAEval-D, the first context-rich speech dataset with real human performances in Mandarin conversational speech, providing narrative descriptions together with fifteen dimensions of human annotations covering expressive attributes and expressive appropriateness. We further develop CEAEval-M, a model that integrates knowledge distillation, planner-based multi-model collaboration, adaptive audio attention bias, and reinforcement learning to perform context-rich expressive appropriateness evaluation. Experiments on a human-annotated test set demonstrate that CEAEval-M substantially outperforms existing speech evaluation and analysis systems.
Primary: Tianjin University
All Institutions: Tianjin University, Nanyang Technological University, Shanghai Jiao Tong University, Kuaishou Technology, TeleAI, China Telecom, National University of Singapore, Tencent
This paper introduces a novel framework for evaluating expressive appropriateness in speech that integrates contextual understanding and advanced modeling techniques. The comprehensive approach, robust experimental validation, and potential for broad applications mark a significant contribution to the field of machine learning and audio processing.
The paper presents a comprehensive methodology for evaluating expressive appropriateness in speech, which is a significant advancement over traditional methods that focus primarily on emotional intensity or isolated utterance characteristics. The introduction of the CEAEval framework, along with the CEAEval-D dataset and CEAEval-M model, showcases a well-structured approach that integrates knowledge distillation, adaptive audio attention bias, and reinforcement learning. The separation of the expressive planner and scoring model is particularly innovative, allowing for better handling of long-range contextual information. The use of a multi-dimensional annotation system for expressive attributes is a strong methodological contribution, ensuring a nuanced evaluation of speech expressiveness.
The experimental evaluation is robust, with a clear comparison against existing benchmarks. The authors provide detailed results demonstrating that CEAEval-M significantly outperforms other models across various context sizes. The use of human-annotated data enhances the credibility of the findings, and the reported metrics (LCC and ACC) effectively convey the model's performance. The ablation studies further strengthen the evaluation by isolating the impact of different components of the proposed framework.
The paper outlines a clear methodology for data collection and model training, which supports reproducibility. However, the lack of publicly available raw audio data limits the ability of other researchers to fully replicate the study. The authors do commit to releasing model checkpoints and parameters, which is a positive step towards enhancing reproducibility.
The study is primarily focused on Mandarin speech, which may limit its applicability to other languages and cultural contexts. The authors acknowledge this limitation and express intentions to expand the framework to additional languages in future work. Additionally, the subjective nature of expressive appropriateness could introduce variability in human annotations, although the high inter-annotator agreement suggests reliability.
The proposed framework has significant implications for various applications, including conversational agents, audiobook generation, and interactive storytelling systems. By providing a systematic way to evaluate expressive speech, this research could enhance user experiences in narrative-driven applications and contribute to the development of more emotionally aware AI systems. This paper introduces a novel framework for evaluating expressive appropriateness in speech that integrates contextual understanding and advanced modeling techniques. The comprehensive approach, robust experimental validation, and potential for broad applications mark a significant contribution to the field of machine learning and audio processing.
Metric-induced discrete flow matching (MI-DFM) exploits token-latent geometry for discrete generation, but its practical use is limited by two issues: heuristic schedulers requiring hyperparameter search, and finite-step path-tracking error from its first-order continuous-time Markov chain (CTMC) solver. We address both issues. First, we derive a kinetic-optimal scheduler for prescribed scalar-parameterized probability paths, and instantiate it for MI-DFM as a training-free numerical schedule that traverses the path at constant Fisher-Rao speed. Second, we introduce a finite-step moment correction that adjusts the jump probability while preserving the CTMC jump destination distribution. We validate the resulting method, GibbsTTS, on codec-based zero-shot text-to-speech (TTS). Under controlled comparisons with a unified architecture and large-scale dataset, GibbsTTS achieves the best objective naturalness and is preferred in subjective evaluations over masked discrete generative baselines. Additionally, in comparison with the evaluated state-of-the-art TTS systems, GibbsTTS shows strong speaker similarity, achieving the highest similarity on three of four test sets and ranking second on the fourth. Project page: https://ydqmkkx.github.io/GibbsTTSProject
Primary: The University of Tokyo
All Institutions: The University of Tokyo, Independent Researcher
The paper presents a novel approach to improving zero-shot text-to-speech systems through innovative scheduling and correction techniques. The technical contributions are significant, addressing key limitations in existing methodologies and demonstrating strong experimental results.
The paper introduces a kinetic-optimal scheduler and a finite-step moment correction for metric-induced discrete flow matching (MI-DFM) in zero-shot text-to-speech (TTS). The methodology is well-structured, addressing two significant limitations of MI-DFM: the reliance on heuristic schedulers and finite-step path-tracking errors. The kinetic-optimal scheduler is derived from Fisher-Rao geometry, providing a training-free numerical approach that avoids hyperparameter tuning. The moment correction effectively mitigates discretization errors in CTMC sampling. Overall, the methods are innovative and present a clear advancement in the field of TTS.
The experiments are robust, utilizing a large-scale dataset for evaluation and controlled comparisons against state-of-the-art systems. The results demonstrate that GibbsTTS achieves superior objective naturalness and speaker similarity metrics, with subjective evaluations further validating the model's performance. The paper effectively communicates the experimental setup and results, providing a comprehensive analysis of the proposed methods.
The paper provides sufficient detail on the model architecture, training procedures, and evaluation metrics, which facilitates reproducibility. However, the lack of a public code repository or demo limits the ability for others to fully replicate the results.
The primary limitation is the focus on zero-shot TTS, which may restrict the generalizability of the proposed methods to other domains. Additionally, the reliance on a specific codec and the absence of exploration into alternative distance metrics for token embeddings could limit the applicability of the findings.
The advancements in TTS technology have significant implications for applications in accessibility, entertainment, and human-computer interaction. The proposed methods could enhance the naturalness and speaker similarity in synthesized speech, potentially improving user experiences in various applications. The paper presents a novel approach to improving zero-shot text-to-speech systems through innovative scheduling and correction techniques. The technical contributions are significant, addressing key limitations in existing methodologies and demonstrating strong experimental results.
Multimodal Intent Recognition (MIR) aims to understand complex user intentions by leveraging text, video, and audio signals. However, existing approaches face two key challenges: (1) overlooking intricate cross-modal interactions for distinguishing consistent and inconsistent cues, and (2) ineffectively modeling multimodal conflicts, leading to semantic cancellation. To address these, we propose a novel Cognitive Dual-Pathway Reasoning (CDPR) framework, which constructs a stable semantic foundation via the intuition pathway and mitigates high-level semantic conflicts through the reasoning pathway, cooperatively establishing deep semantic relations. Specifically, we first employ a representation disentanglement strategy to extract modality-invariant and specific features. Subsequently, the intuition pathway aggregates cross-modal consensus using shared features for solid global representations. The reasoning pathway introduces an inconsistency perception mechanism, combining semantic prototype matching with statistical probability calibration to precisely quantify conflict severity, and dynamically adjusting the weights between both pathways. Furthermore, a multi-view loss function is adopted to alleviate modality laziness and learn structured features at different stages. Extensive experiments on two benchmarks show that CDPR achieves SOTA performance and superior robustness in mitigating multimodal inconsistency. The code is available at https://github.com/Hebust-NLP/CDPR.
Primary: Hebei University of Science and Technology
All Institutions: Hebei University of Science and Technology, Hebei University of Economics and Business
The main contribution of this paper is the introduction of the CDPR framework, which effectively addresses the challenges of multimodal inconsistency in intent recognition through a novel dual-pathway reasoning approach. This work significantly advances the state of the art in MIR by providing a robust methodology and demonstrating its effectiveness through rigorous experimentation.
The proposed Cognitive Dual-Pathway Reasoning (CDPR) framework presents a novel approach to Multimodal Intent Recognition (MIR) by introducing a dual-pathway architecture that simulates human cognitive processes. The methodology effectively disentangles modality-invariant and specific features, employs an inconsistency perception mechanism, and utilizes a multi-view loss function to enhance learning. The integration of intuitive and reasoning pathways is innovative, allowing for adaptive regulation based on conflict levels, thereby addressing significant challenges in existing MIR approaches. The detailed explanation of feature extraction, decoupling, and the dual-pathway mechanism demonstrates a robust theoretical foundation.
The experiments conducted on two benchmark datasets (MIntRec and MIntRec2.0) are comprehensive, showcasing the effectiveness of the CDPR framework. The reported state-of-the-art (SOTA) performance across various metrics (accuracy, F1-score, etc.) substantiates the claims made in the paper. The ablation studies further validate the contributions of individual components, indicating that each part of the proposed method plays a crucial role in achieving superior performance. The robustness tests against noise also highlight the practical applicability of the model in real-world scenarios.
The paper provides sufficient implementation details, including the architecture, datasets, training protocols, and hyperparameters, which facilitate reproducibility. The availability of the code on GitHub enhances the potential for other researchers to replicate and build upon the work.
While the CDPR framework shows promising results, the paper does not extensively discuss the limitations of the proposed method. Potential weaknesses could include the reliance on specific datasets for training and evaluation, which may not generalize well to other multimodal contexts. Additionally, the complexity of the model may pose challenges in terms of computational efficiency in real-time applications.
The advancements in multimodal intent recognition have significant implications for various applications, including human-computer interaction, autonomous systems, and multimedia retrieval. The ability to effectively handle multimodal inconsistencies can enhance user experience in interactive systems, making them more intuitive and responsive to user intentions. The main contribution of this paper is the introduction of the CDPR framework, which effectively addresses the challenges of multimodal inconsistency in intent recognition through a novel dual-pathway reasoning approach. This work significantly advances the state of the art in MIR by providing a robust methodology and demonstrating its effectiveness through rigorous experimentation.
Timbre transfer aims to modify the timbral identity of a musical recording while preserving the original melody and rhythm. While single-instrument timbre transfer has made substantial progress, existing approaches to multi-instrument settings rely on separate-then-transfer pipelines that propagate source separation artifacts and produce incoherent synthesized timbres across stems. This paper proposes MixtureTT, to the best of our knowledge the first system for flexible per-stem timbre transfer directly from a polyphonic mixture. Given a mixture and a separate timbre reference for each target voice, MixtureTT jointly transfers all stems to the specified instruments through a shared diffusion process. Modeling the dependencies across the per-stem content and cross-stem harmonic, the proposed joint stem diffusion transformer eliminates cascaded separation error, reduces inference cost by a factor equal to the number of stems, and yields more coherent multi-stem outputs. Despite operating under a strictly harder input condition, evaluations on the SATB choral dataset show that MixtureTT outperforms single-instrument baselines on both objective and subjective metrics demonstrating the necessity of dedicated multi-instrument timbre transfer over the naive separate-then-transfer pipelines. As a result, this work confirms that the cross-stem modeling is essential for mixture-level timbre transfer as the proposed joint setting consistently exceeds an equivalent single-stem ablation.
Primary: unknown
All Institutions: unknown
This paper presents MixtureTT, a novel approach for per-stem timbre transfer directly from polyphonic mixtures, significantly advancing the field of music audio processing. The technical contributions, particularly the joint diffusion model and its evaluation, mark a meaningful step towards more coherent and efficient audio manipulation techniques.
The proposed methodology, MixtureTT, innovatively addresses the challenge of timbre transfer in polyphonic music by employing a joint stem diffusion transformer that operates directly on polyphonic mixtures without requiring explicit source separation. This approach is a significant departure from traditional separate-then-transfer pipelines, which are prone to artifacts and inconsistencies. The architecture effectively balances per-stem independence with cross-stem coordination, allowing for coherent audio generation. The use of a shared diffusion process and the introduction of disentanglement losses further enhance the model's ability to maintain timbral fidelity while preserving content integrity.
The experimental evaluation is robust, utilizing both objective metrics (e.g., Fréchet Audio Distance, Jaccard Distance, Chroma Cosine Similarity) and subjective assessments through a listening test with human participants. The results consistently demonstrate that MixtureTT outperforms single-instrument baselines, even when those baselines are provided with isolated stems. This is a strong validation of the proposed method's effectiveness. The use of the SATB choral dataset is appropriate, though it may limit generalizability to other musical contexts.
While the paper details the training process and architecture, the lack of specific implementation details, such as hyperparameters and the exact training environment, may hinder reproducibility. The authors mention a demo URL, which could provide additional insights into the model's performance, but the absence of a public code repository is a drawback.
One limitation is the reliance on a specific dataset (CocoChorales), which may not fully represent the diversity of musical styles and genres. Additionally, while the model shows promise, the scalability to larger ensembles or more complex musical structures remains untested. The paper also does not address potential computational costs associated with training and inference, which could be a barrier for broader adoption.
The implications of this work are significant for music production and audio engineering, as it enables more flexible and coherent manipulation of musical recordings. This could streamline workflows for musicians and producers, allowing for innovative creative possibilities in music composition and arrangement. Furthermore, the findings encourage further exploration of mixture-level modeling in generative music tasks, potentially influencing future research directions. This paper presents MixtureTT, a novel approach for per-stem timbre transfer directly from polyphonic mixtures, significantly advancing the field of music audio processing. The technical contributions, particularly the joint diffusion model and its evaluation, mark a meaningful step towards more coherent and efficient audio manipulation techniques.
Text-to-image (T2I) generation using multiple conditions enables fine-grained user control on the generated image. Yet, incorporating multi-condition inputs incurs substantial computation and communication overhead, due to additional preprocessing subtasks and control optimizations. It hence leads to unacceptable generation latency. In this paper, we propose an end-edge collaborative system design to accelerate multi-condition T2I generation through adaptive condition offloading and pruning. Extensive offline profiling reveal that, different conditions exhibit significant diversity in computation and communication costs. To this end, we propose a \textit{Subtask Manager} that jointly optimizes condition inference offloading and bandwidth allocation using a heuristic algorithm, balancing local and edge execution delays to minimize overall preprocessing latency. Then, we design a lightweight feature-driven \textit{Conditioning Scale Estimator} that evaluates the contribution of each condition by analyzing its feature activation strength and overlap with other conditions. This allows adaptive conditioning scale selection and pruning of insignificant conditions, thereby accelerating the denoising process. Extensive experimental results show that our system reduces latency by nearly 25\% and improves 6\% average generation quality, outperforming other benchmarks.
Primary: Huazhong University of Science and Technology
All Institutions: Huazhong University of Science and Technology, Central South University
The main contribution of this paper is the introduction of an end-edge collaborative system that effectively accelerates multi-condition T2I generation through adaptive condition offloading and pruning. This work represents a meaningful advancement in the field, addressing critical challenges in computational efficiency and user control in AI-generated content.
The paper presents a novel end-edge collaborative system design that addresses the computational and communication overhead associated with multi-condition text-to-image (T2I) generation. The proposed Subtask Manager optimizes condition inference offloading and bandwidth allocation using a heuristic algorithm, which is a significant improvement over existing methods. The Conditioning Scale Estimator further enhances the system by evaluating the contribution of each condition, allowing for adaptive pruning of insignificant conditions. This dual approach effectively reduces latency while maintaining image quality, showcasing a well-thought-out methodology that balances local and edge processing.
The experimental results are robust, demonstrating a 25% reduction in latency and a 6% improvement in average generation quality compared to existing benchmarks. The authors conduct extensive profiling and performance evaluations across various hardware setups, which strengthens the validity of their claims. However, the paper could benefit from more detailed comparisons with a broader range of existing methods to contextualize the improvements more effectively.
The paper provides a clear description of the experimental setup, including the hardware used and the specific configurations for the algorithms. However, the lack of a publicly accessible code repository or demo limits the reproducibility of the results. Future work should consider making the implementation available to facilitate further research and validation.
One limitation of the proposed system is its reliance on specific hardware configurations, which may not generalize to all user devices. Additionally, the heuristic nature of the optimization may not guarantee the absolute best performance in all scenarios, particularly in highly variable network conditions. The paper also does not address potential scalability issues when the number of users or conditions increases significantly.
The proposed system has significant implications for real-time applications in AI-generated content, particularly in scenarios where user interaction and control are paramount. By reducing latency and improving generation quality, this work could enhance user experiences in creative industries, gaming, and virtual reality. The approach also opens avenues for further research in edge computing and collaborative AI systems. The main contribution of this paper is the introduction of an end-edge collaborative system that effectively accelerates multi-condition T2I generation through adaptive condition offloading and pruning. This work represents a meaningful advancement in the field, addressing critical challenges in computational efficiency and user control in AI-generated content.
Current omni-modal benchmarks mainly evaluate models under settings where multiple modalities are provided simultaneously, while the ability to start from audio alone and actively search for cross-modal evidence remains underexplored. In this paper, we introduce \textbf{Omni-DeepSearch}, a benchmark for audio-driven omni-modal deep search. Given one or more audio clips and a related question, models must infer useful clues from audio, invoke text, image, and video search tools, and perform multi-hop reasoning to produce a short, objective, and verifiable answer. Omni-DeepSearch contains 640 samples across 15 fine-grained categories, covering four retrieval target modalities and four audio content types. A multi-stage filtering pipeline ensures audio dependence, retrieval necessity, visual modality necessity, and answer uniqueness. Experiments on recent closed-source and open-source omni-modal models show that this task remains highly challenging: the strongest evaluated model, Gemini-3-Pro, achieves only 43.44\% average accuracy. Further analyses illustrate key bottlenecks in audio entity inference, query formulation, tool-use reliability, multi-hop retrieval, and cross-modal verification. These results highlight audio-driven omni-modal deep search as an important and underexplored direction for future multimodal agents.
Primary: Chinese Academy of Sciences (CASIA)
All Institutions: Chinese Academy of Sciences (CASIA), University of Chinese Academy of Sciences (UCAS), Beijing Academy of Artificial Intelligence (BAAI), Peking University, Tsinghua University
The paper presents Omni-DeepSearch, a benchmark for audio-driven omni-modal deep search, highlighting the challenges and limitations of current models while providing a structured methodology for future research in this underexplored area.
The paper introduces a novel benchmark, Omni-DeepSearch, which focuses on audio-driven omni-modal deep search, a largely unexplored area in multimodal learning. The methodology is well-structured, with a clear definition of the task and a multi-stage filtering pipeline that ensures the quality and relevance of the dataset. The authors emphasize audio dependence and multi-hop reasoning, which are critical for evaluating models that must infer and retrieve information across different modalities based solely on audio input. The task taxonomy and dataset construction are thorough, providing a solid foundation for future research.
The experiments conducted on various models, including both closed-source and open-source, reveal significant challenges in the task, with the best-performing model achieving only 43.44% accuracy. This highlights the complexity of audio-driven retrieval and reasoning, as well as the limitations of current models. The ablation studies and case analyses provide valuable insights into specific failure modes, such as dominant clue bias and misclassification, which are critical for understanding the limitations of existing approaches.
While the paper provides a comprehensive description of the dataset construction and evaluation metrics, it lacks detailed implementation specifics that would facilitate reproducibility. The absence of a publicly available dataset or code repository further limits the ability of other researchers to replicate the results or build upon this work.
The paper acknowledges several limitations, including the inherent ambiguity of audio signals and the reliance on external knowledge for retrieval. Additionally, the performance gap between closed-source and open-source models suggests that there is still much work to be done in improving model capabilities in this domain. The lack of a publicly available dataset or code also hinders broader adoption and experimentation.
The introduction of Omni-DeepSearch has the potential to significantly impact the field of multimodal learning by providing a new benchmark that emphasizes audio as a primary modality for information retrieval. This could lead to advancements in various applications, including voice-activated assistants, audio-based search engines, and enhanced human-computer interaction systems. By addressing the challenges of audio-driven reasoning, this work opens up new avenues for research and development in multimodal AI. The paper presents Omni-DeepSearch, a benchmark for audio-driven omni-modal deep search, highlighting the challenges and limitations of current models while providing a structured methodology for future research in this underexplored area.
Language model (LM)-based speech enhancement (SE) can generate natural-sounding speech, but under severe noise it often suffers from unreliable conditioning, leading to perceptually plausible yet linguistically incorrect outputs. To address this issue, we propose L3-SE, a noise-invariant acoustic-semantic distillation framework for reducing linguistic hallucination in LM-based SE. The proposed method learns a noise-invariant conditioning encoder from noisy speech by jointly distilling two complementary clean-speech targets: an acoustic target for reconstruction fidelity and a semantic target for linguistic consistency. The resulting noise-invariant acoustic-semantic representations are used to condition a decoder-only autoregressive language model, which predicts clean acoustic tokens that are decoded into enhanced speech. To support high-quality generation, we further employ a high-fidelity codec built on learnable weighted WavLM layer representations as the discrete acoustic interface. By improving the reliability of conditioning under adverse conditions, the proposed framework substantially reduces hallucination and improves content faithfulness. Experiments show that the proposed method consistently outperforms prior LM-based speech enhancement baselines on linguistic consistency metrics, with especially clear gains under low-SNR and reverberant conditions, while maintaining competitive perceptual quality. Audio samples are available at https://max1wz.github.io/L3-SE-Demo-Page/. The complete source code will be released after the manuscript is accepted.
Primary: Nanjing University
All Institutions: Nanjing University, MiLM Plus, Xiaomi Inc.
The paper presents L3-SE, a novel framework for reducing linguistic hallucination in LM-based speech enhancement through noise-invariant acoustic-semantic distillation, demonstrating significant improvements in linguistic consistency and perceptual quality under challenging conditions.
The proposed L3-SE framework introduces a novel approach to speech enhancement by utilizing a noise-invariant acoustic-semantic distillation strategy. This dual-target distillation method, which leverages both acoustic fidelity and semantic consistency, is innovative in addressing the issue of linguistic hallucination in generative speech models. The architecture effectively combines a shared backbone with task-specific heads, allowing for robust conditioning that enhances the model's performance under noisy conditions. The integration of a high-fidelity codec further supports the quality of the generated speech, making the methodology both comprehensive and well-structured.
The experiments are thorough, utilizing a variety of datasets and evaluation metrics that cover perceptual quality, linguistic consistency, and speaker preservation. The results demonstrate that L3-SE outperforms existing baselines, particularly in challenging conditions such as low-SNR and reverberation. The use of both objective and subjective metrics strengthens the evaluation, providing a well-rounded assessment of the framework's capabilities.
The paper mentions that the complete source code will be released upon acceptance, which is a positive aspect for reproducibility. However, the implementation details are somewhat dense, and while they provide a comprehensive overview of the training process, clearer guidelines or supplementary materials could enhance reproducibility further.
One limitation is the reliance on specific datasets for training and evaluation, which may affect generalizability to other speech enhancement scenarios. Additionally, while the framework shows improvements in linguistic consistency, the perceptual quality metrics could still be further optimized to match or exceed the best-performing models in all conditions.
The proposed framework has significant implications for applications in speech recognition, communication technologies, and assistive devices, where clarity and accuracy in speech are crucial. By addressing linguistic hallucination effectively, it could enhance user experience in various real-world applications, making it a valuable contribution to the field of audio processing and machine learning. The paper presents L3-SE, a novel framework for reducing linguistic hallucination in LM-based speech enhancement through noise-invariant acoustic-semantic distillation, demonstrating significant improvements in linguistic consistency and perceptual quality under challenging conditions.
Audio deepfake detection systems are increasingly deployed in high-stakes security applications, yet their fairness across demographic groups remains critically underexamined. Prior work measures gender disparity but does not investigate where it comes from or how to fix it systematically. We present the first diagnosis-first framework that identifies bias source before applying targeted mitigation, evaluated on two models, AASIST and Wav2Vec2+ResNet18, on ASVSpoof5. Our diagnosis shows that bias does not stem from imbalanced training data but from acoustic representation differences, gender leakage in learned features, and structural evaluation asymmetry. We test mitigation strategies across in-processing, post-processing and combined families, including novel methods introduced in this work. Adjusting the decision threshold separately per gender reduces unfairness by 54% to 75% at no cost to detection accuracy, and our new epoch-level fairness regularisation method outperforms existing per-batch approaches. Adversarial debiasing succeeds only when gender leakage is localised, and fails when it is diffuse, an outcome correctly predicted by our diagnosis before training. No single method fully closes the fairness gap, confirming that bias sources must be identified before fixes are applied and that fairer benchmark design is equally important
Primary: Wichita State University
All Institutions: Wichita State University, Institut national de la recherche scientifique (INRS-EMT), INRS-UQO Mixed Research Unit on Cybersecurity
This paper presents a pioneering diagnosis-first framework for addressing gender bias in audio deepfake detection systems, significantly advancing the understanding and mitigation of bias in machine learning applications. The comprehensive methodology and rigorous experimental evaluation contribute valuable insights to the field, highlighting the importance of systematic bias diagnosis before applying mitigation strategies.
The paper introduces a systematic diagnosis-first framework for identifying and mitigating gender bias in audio deepfake detection. This approach is innovative as it emphasizes understanding the sources of bias before applying mitigation strategies, which is a significant departure from existing methods that often apply fixes without thorough diagnosis. The methodology is well-structured, detailing a comprehensive evaluation of bias sources at data, model, and decision levels, and it introduces novel mitigation techniques such as EAFR, SGFS, and GNEA. Each method is clearly defined, and the rationale for their implementation is well-articulated.
The experimental setup is robust, utilizing the ASVSpoof5 dataset, which is appropriate for the study's focus on gender fairness in audio deepfake detection. The paper conducts extensive experiments across multiple models (AASIST and Wav2Vec2+ResNet18) and evaluates various mitigation strategies, providing a thorough analysis of their effectiveness. The results are presented clearly, with a focus on multiple fairness metrics, which enhances the credibility of the findings. However, the reliance on a single dataset may limit the generalizability of the results.
The paper provides sufficient detail regarding the experimental setup, model architectures, and evaluation protocols, which supports reproducibility. However, the absence of publicly available code or a project repository limits the ability for others to reproduce the findings directly. Including a demo or project URL would enhance the reproducibility aspect significantly.
The study is limited to a single dataset (ASVSpoof5) and focuses on binary gender labels, which may not capture the full spectrum of gender representation. Additionally, while the paper identifies multiple sources of bias, it acknowledges that no single method completely closes the fairness gap, indicating that further research is needed to address these issues comprehensively.
The implications of this work are significant, particularly in high-stakes applications such as security and identity verification, where fairness and bias in detection systems can have profound societal impacts. By addressing gender bias in audio deepfake detection, the paper contributes to the broader discourse on fairness in AI systems, emphasizing the need for equitable treatment across demographic groups. This paper presents a pioneering diagnosis-first framework for addressing gender bias in audio deepfake detection systems, significantly advancing the understanding and mitigation of bias in machine learning applications. The comprehensive methodology and rigorous experimental evaluation contribute valuable insights to the field, highlighting the importance of systematic bias diagnosis before applying mitigation strategies.
Chord generation is an inherently constrained creative task that requires balancing stylistic diversity with music-theoretic feasibility. Existing approaches typically entangle candidate generation and constraint enforcement within a single model, making the diversity-feasibility trade-off difficult to control and interpret. In this work, we approach chord generation from a system-level perspective, introducing a Retrieval-Edit-Rerank (RER) framework that decomposes the task into three explicit stages: i) retrieval, which defines a stylistically plausible candidate space; ii) editing, which enforces music-theoretic feasibility through minimal modifications; and iii) reranking, which resolves soft preferences among feasible candidates. This separation provides a controllable pipeline, where each component addresses a distinct aspect of the generation process, thereby enhancing both the interpretability and adjustability of the output chords. Through objective metrics and subjective evaluation, our decomposed system outperforms all end-to-end chord generation baselines in balancing chord diversity and music-theoretic feasibility. Ablation studies further confirm the complementary roles of each stage in creative exploration and constraint satisfaction.
Primary: NetEase Cloud Music
All Institutions: NetEase Cloud Music, Individual Researcher
The paper introduces a novel Retrieval-Edit-Rerank framework for chord generation that effectively balances stylistic diversity and music-theoretic feasibility. This work is significant as it provides a structured approach to a complex creative task, advancing the field of music generation by offering a system that is both interpretable and adaptable.
The proposed Retrieval-Edit-Rerank (RER) framework effectively decomposes the chord generation task into three distinct stages, allowing for a clear separation of concerns that enhances both interpretability and control over the generation process. The methodology is well-structured, with a focus on leveraging a melody-chord memory for retrieval, followed by an editing stage that enforces music-theoretic constraints, and a reranking stage that resolves preferences among feasible candidates. This approach is innovative in the context of music generation, as it combines stylistic diversity with theoretical validity in a systematic manner. The use of a contrastive learning framework for memory construction is a notable strength, as it allows for the retrieval of stylistically relevant chord progressions without sacrificing harmonic integrity.
The experiments are comprehensive, utilizing multiple datasets and a variety of metrics for both objective and subjective evaluation. The inclusion of ablation studies strengthens the findings by demonstrating the importance of each stage in the RER framework. The results show that the proposed method outperforms existing end-to-end models in terms of balancing diversity and feasibility, which is a critical aspect of chord generation. The subjective evaluations involving human participants provide additional validation of the system's effectiveness, indicating a well-rounded experimental design.
The paper provides a clear description of the methodology and experimental setup, which facilitates reproducibility. However, the lack of publicly available code or datasets limits the ability of other researchers to replicate the results fully. Including a GitHub repository or links to datasets would significantly enhance the reproducibility of the work.
One limitation is the reliance on a fixed set of music-theoretic constraints, which may not capture the full range of stylistic diversity present in various musical genres. Additionally, the system's performance may vary depending on the quality and diversity of the melody-chord memory constructed during training. The paper also notes that the editing stage can sometimes lead to overly conservative outputs, which may limit creative exploration.
The RER framework has the potential to significantly impact music generation applications, particularly in contexts where adherence to music theory is essential, such as in music education, composition tools, and automated music production systems. By providing a controllable and interpretable approach to chord generation, this work could facilitate more nuanced interactions between musicians and AI systems, enhancing creativity while respecting musical traditions. The paper introduces a novel Retrieval-Edit-Rerank framework for chord generation that effectively balances stylistic diversity and music-theoretic feasibility. This work is significant as it provides a structured approach to a complex creative task, advancing the field of music generation by offering a system that is both interpretable and adaptable.
Training multimodal large language models has long been limited by the scarcity of high-quality paired multimodal data. Recent studies show that the shared representation space of pretrained multimodal contrastive models can serve as a bridge, enabling models to perform multimodal training with unimodal data. However, the key premise of this paradigm remains insufficiently understood: can representations from different modalities be reliably interchanged? The core obstacle lies in the persistent Modality Gap in the shared space. In this work, we revisit the geometric nature of the modality gap. We find that modality representations already share compatible dominant semantic geometry. What truly hinders modality interchangeability is not a simple global shift, but an anisotropic residual structure concentrated along a small number of dominant directions. Based on this finding, we further propose the principle of anisotropic modality gap alignment: effective modality alignment should align with the target-modality distribution while preserving the semantic structure of the source modality. Guided by this principle, we propose an anisotropic geometric correction framework, AnisoAlign, for unpaired modality alignment. This framework leverages the internal geometric prior of the target modality and performs bounded correction on source-modality representations, thereby constructing substitute representations in the target modality. Experiments confirm its benefits in both geometric diagnostics and text-only MLLM training. Overall, this work recasts the modality gap from an empirical observation into a correctable, structured geometric phenomenon and provides a new representation alignment perspective for training multimodal models with unimodal data.
Primary: Hong Kong University of Science and Technology (HKUST)
All Institutions: Hong Kong University of Science and Technology (HKUST), Stanford University
The main contribution of this paper is the introduction of AnisoAlign, a structured geometric correction framework that addresses the modality gap in multimodal learning, allowing for effective training of multimodal models using unimodal data. This work significantly advances the understanding of modality alignment and provides a robust methodology that can be applied in various multimodal applications.
The proposed methodology, AnisoAlign, presents a novel approach to addressing the modality gap in multimodal learning by focusing on the geometric structure of modality representations. The authors effectively identify that the modality gap is not merely a centroid shift but an anisotropic residual, which is a significant insight. The method involves a two-stage process that includes a target-modality prior pretraining and a bounded refinement step to ensure that the source modality's semantic structure is preserved while aligning with the target modality. This structured approach is well-justified and theoretically supported, making it a strong contribution to the field.
The experiments are comprehensive, evaluating both geometric diagnostics and the performance of the model in multimodal large language model (MLLM) training. The results demonstrate that AnisoAlign outperforms existing methods in terms of both representation alignment and MLLM training effectiveness. The use of various metrics to assess performance adds rigor to the evaluation, although the paper could benefit from more extensive ablation studies to further clarify the contributions of individual components.
The paper provides a detailed description of the methodology and experimental setup, which aids in reproducibility. However, the lack of a public code repository or demo limits the ability for external validation of results. Future work should consider making the implementation available to enhance reproducibility.
One limitation is the reliance on high-quality unimodal data, which may not always be available. Additionally, while the paper discusses the geometric aspects of modality alignment, it does not fully explore the implications of varying data quality on the effectiveness of the proposed method.
The findings have significant implications for the development of multimodal models, particularly in scenarios where paired data is scarce. By enabling the use of unimodal data for training, this work could facilitate advancements in applications such as image captioning, visual question answering, and other areas where multimodal understanding is crucial. The main contribution of this paper is the introduction of AnisoAlign, a structured geometric correction framework that addresses the modality gap in multimodal learning, allowing for effective training of multimodal models using unimodal data. This work significantly advances the understanding of modality alignment and provides a robust methodology that can be applied in various multimodal applications.
Discovering structure in biological signals without supervision is a fundamental problem in computational intelligence, yet existing bioacoustic methods assume vocal production models or predefined semantic units, leaving non-vocal species poorly served. This work introduces BeeVe, an unsupervised framework for acoustic state discovery in collective honey bee buzzing. BeeVe uses the self-supervised Patchout Spectrogram Transformer (PaSST) as a frozen feature extractor, then trains a Vector-Quantized Variational Autoencoder (VQ-VAE) without labels on those embeddings, learning a finite discrete codebook of acoustic tokens directly from unlabelled hive audio. No labels, pretext tasks, or contrastive objectives are used at any stage. Post-hoc evaluation against known queen status reveals that the learned tokens separate queenright and queenless conditions with Jensen-Shannon Divergence values between 0.609 and 0.688, and that the queenless condition further decomposes into three internally coherent sub-states stable across experiments with different codebook sizes and random seeds. Token transition analysis confirms non-random sequential structure (p << 0.001) across all experiments. Generalisation to unseen recordings preserves both token overlap (Jaccard = 0.947) and global manifold topology. These results demonstrate that unsupervised discrete codebook learning can recover repeatable acoustic structure from a non-vocal biological signal without annotation, opening a path toward non-invasive acoustic hive health monitoring.
Primary: Heriot-Watt University Dubai
All Institutions: Heriot-Watt University Dubai
The paper presents a significant advancement in unsupervised learning for bioacoustic state discovery, demonstrating the ability to extract structured acoustic patterns from honey bee buzzing without prior assumptions or annotations. The methodology is innovative and the results impactful, contributing to both machine learning and ecological monitoring fields.
The paper introduces BeeVe, a novel unsupervised framework for acoustic state discovery in honey bee buzzing, leveraging a self-supervised Patchout Spectrogram Transformer (PaSST) as a feature extractor and a Vector-Quantized Variational Autoencoder (VQ-VAE) for learning a discrete codebook of acoustic tokens. The methodology is well-structured, employing a rigorous unsupervised learning approach without relying on predefined labels or semantic assumptions. The use of post-hoc evaluation against known queen status to validate the learned tokens adds robustness to the methodology. However, the choice of PaSST as a frozen feature extractor, while justified, may limit the model's adaptability to other non-vocal species.
The experiments are comprehensive, utilizing a dataset of honey bee audio to assess the effectiveness of the proposed method. The results demonstrate significant separation between queenright and queenless conditions, with Jensen-Shannon Divergence values indicating meaningful distinctions. The identification of stable sub-states within the queenless condition and the analysis of token transition patterns provide strong evidence of the model's capability to uncover structured acoustic states. The metrics used for evaluation, including Jaccard overlap and manifold projection, are appropriate and effectively illustrate the model's performance.
The paper provides detailed implementation details, including the architecture of the VQ-VAE, training objectives, and evaluation metrics, which contribute to reproducibility. However, the absence of a publicly accessible code repository or demo limits the ability for other researchers to replicate the findings directly.
The study is limited by its reliance on a controlled subset of the UrBAN dataset, which may not fully capture the diversity of acoustic states across different hives and conditions. Additionally, while the findings are promising, the lack of biological annotation to ground the discovered states raises questions about their true biological relevance. The scalability of the approach to larger datasets and more complex hive conditions remains to be validated.
The implications of this work extend to non-invasive monitoring of honey bee colonies, potentially aiding in the early detection of conditions such as queen loss or swarming. The unsupervised nature of the framework allows for the identification of previously unlabelled states, which could enhance hive management practices and contribute to pollinator conservation efforts. The approach also opens avenues for future research in bioacoustics and machine learning applications in non-vocal species. The paper presents a significant advancement in unsupervised learning for bioacoustic state discovery, demonstrating the ability to extract structured acoustic patterns from honey bee buzzing without prior assumptions or annotations. The methodology is innovative and the results impactful, contributing to both machine learning and ecological monitoring fields.
The evaluation of voice anonymisation remains challenging. Current practice relies on automatic speaker verification metrics such as the equal error rate (EER). Performance estimates dependent on the classifier and operating point provide an incomplete or even misleading characterisation of privacy risk. We investigate the use of similarity rank disclosure (SRD), an information-theoretic metric, which operates on feature representations rather than classifier decisions, providing a threshold-independent assessment of privacy and analysis of both average and worst-case disclosure. We report its application to speaker embeddings, fundamental frequency, and phone embeddings using 2024 VoicePrivacy Challenge systems. The SRD reveals privacy leaks and system-specific weaknesses missed by EER-based evaluation. Findings highlight the merit of representation-level metrics and demonstrate the potential of SRD as a flexible and interpretable tool for the evaluation of voice anonymisation.
Primary: EURECOM
All Institutions: EURECOM, Ruhr-Universität Bochum, Orange Innovation, University of Stuttgart
The main contribution of this paper is the introduction of the Similarity Rank Disclosure (SRD) metric for evaluating voice anonymisation, which provides a more interpretable and comprehensive assessment of privacy risks compared to traditional metrics. The technical contribution is significant as it addresses critical gaps in existing evaluation practices, offering a robust framework for future research and application in voice privacy.
The paper introduces the Similarity Rank Disclosure (SRD) as a novel metric for evaluating voice anonymisation, which operates independently of classifier decisions and provides a more nuanced understanding of privacy risks. The methodology is well-structured, detailing the steps for computing SRD, including ranking, distribution generation, and statistical modeling. The use of empirical probability distributions and beta-binomial fitting enhances the robustness of the evaluation. However, the paper could benefit from clearer explanations of the statistical methods used and their implications for the results.
The experiments leverage a comprehensive dataset from the 2024 VoicePrivacy Challenge, applying the SRD to various anonymisation systems. The results demonstrate that SRD can reveal privacy leaks and weaknesses that traditional metrics like EER miss. The evaluation includes both qualitative and quantitative analyses, providing a thorough comparison of different anonymisation approaches. However, the paper does not provide extensive details on the experimental setup, such as the specific configurations of the anonymisation systems or the exact nature of the datasets used.
The paper lacks sufficient details for full reproducibility. While it describes the methodology and provides some results, it does not include code or data availability, which are critical for other researchers to replicate the findings. Clearer documentation of the experimental setup and access to the datasets used would enhance reproducibility.
One limitation is the reliance on a specific dataset (2024 VoicePrivacy Challenge) which may not generalize to other contexts or datasets. Additionally, the SRD's effectiveness in various real-world scenarios remains to be fully validated. The paper also acknowledges the potential for overestimation of privacy if strong attack models are not used, which is a critical consideration for future work.
The findings have significant implications for the development of privacy-preserving technologies in voice processing, particularly in light of increasing concerns about data privacy and regulation. The SRD could serve as a foundational tool for evaluating voice anonymisation systems, influencing both academic research and industry practices. The flexibility of the SRD to adapt to various feature representations also opens avenues for future research in related domains. The main contribution of this paper is the introduction of the Similarity Rank Disclosure (SRD) metric for evaluating voice anonymisation, which provides a more interpretable and comprehensive assessment of privacy risks compared to traditional metrics. The technical contribution is significant as it addresses critical gaps in existing evaluation practices, offering a robust framework for future research and application in voice privacy.
In dynamic acoustic environments characterized by time-varying interferers and moving sources, effective beamforming requires accurately identifying stationary regions over time. Traditional Capon beamformers rely on the instantaneous ensemble covariance matrix, which is inaccessible in practice. Practical implementations overcome this by estimating the sample covariance matrix (SCM) through averaging over a block of temporal samples. However, in non-stationary settings, a naive batch approach fails. Moving interferers smear the SCM, causing the beamformer to place nulls in outdated locations while failing to track newly active interferers, thereby degrading its nulling capabilities. To address this fundamental limitation, an Online Segmented Beamformer is proposed. This algorithm incorporates data-driven temporal segmentation to causally minimize output power while dynamically adapting the SCM estimation windows to local stationarity. By framing the problem through the lens of dynamic programming, the proposed method tracks abrupt environmental changes and resets covariance estimates in real-time. We validate the performance of this framework in a complex, reverberant simulated acoustic environment and in highly reverberant real world experiments, demonstrating its superiority over fixed-window adaptive methods.
Primary: Stony Brook University
All Institutions: Stony Brook University, University of Illinois Chicago, University of Massachusetts Dartmouth
The main contribution of this paper is the introduction of an Online Segmented Beamformer that dynamically adapts its covariance estimation windows to track changes in non-stationary acoustic environments, significantly enhancing the performance of adaptive beamforming techniques. The comprehensive analysis of the technical contribution, methodology, and significance to the field highlights its potential to advance audio processing applications in complex environments.
The proposed methodology introduces the Online Segmented Beamformer, which innovatively adapts the integration window for covariance matrix estimation based on the statistical characteristics of the incoming data. This dynamic programming approach allows for real-time tracking of environmental changes, addressing the limitations of traditional fixed-window methods. The algorithm's ability to maintain a balance between bias and variance through temporal segmentation is a significant advancement in adaptive beamforming techniques.
The experiments conducted in both simulated and real-world environments demonstrate the proposed method's effectiveness. The use of complex reverberant scenarios and dynamic sources provides a robust testing ground, and the reported performance metrics (SI-SDR, PESQ, STOI) indicate substantial improvements over fixed-window methods. The comprehensive evaluation across various conditions strengthens the validity of the results.
While the paper provides a detailed description of the algorithm and its implementation, the lack of publicly available code or datasets limits reproducibility. Including a demo or project URL would enhance the ability for other researchers to validate and build upon this work.
One limitation is the potential computational complexity associated with maintaining multiple candidate beamformers and the need for efficient real-time processing. Additionally, the algorithm's performance in highly dynamic environments with abrupt changes could be further explored to assess its robustness.
The Online Segmented Beamformer has the potential to significantly impact various applications in audio processing, including speech enhancement, hearing aids, and sonar systems, where dynamic acoustic environments are prevalent. Its ability to adaptively manage interference in real-time could lead to advancements in communication technologies and improve user experiences in noisy settings. The main contribution of this paper is the introduction of an Online Segmented Beamformer that dynamically adapts its covariance estimation windows to track changes in non-stationary acoustic environments, significantly enhancing the performance of adaptive beamforming techniques. The comprehensive analysis of the technical contribution, methodology, and significance to the field highlights its potential to advance audio processing applications in complex environments.
Closed-Set speaker identification aims to assign a speech utterance to one of a predefined set of enrolled speakers and requires robust modeling of speaker-specific characteristics across multiple temporal scales. While recent deep learning approaches have achieved strong performance, many existing architectures provide limited mechanisms for modeling temporal dependencies across different time scales, which can restrict the effective use of complementary short-, mid-, and long-term speaker characteristics. In this paper, we propose TARNet, a lightweight Temporal-Aware Representation Network for closed-set speaker identification. TARNet explicitly models temporal information at multiple time scales using a multi-stage temporal encoder with stage-specific dilation configurations. The resulting multi-scale representations are fused and aggregated via an Attentive Statistics Pooling (ASP) module to produce a discriminative utterance-level speaker embedding. Experiments on the VoxCeleb1 and LibriSpeech datasets show that TARNet outperforms state-of-the-art methods while maintaining competitive computational complexity, making it suitable for practical speaker identification systems. The code is publicly available at https://github.com/YassinTERRAF/TARNet.
Primary: University Mohammed VI Polytechnic
All Institutions: University Mohammed VI Polytechnic, CID Development
The paper presents TARNet, a novel multi-scale architecture for closed-set speaker identification that effectively models temporal dependencies, achieving state-of-the-art performance while maintaining computational efficiency. The comprehensive evaluation of the methodology, experimental results, and potential applications underscores its significance in the field of audio processing and speaker recognition.
The proposed TARNet architecture introduces a multi-scale temporal encoder that effectively captures speaker-specific characteristics across different temporal scales. The use of dilated convolutions allows for the modeling of temporal dependencies while preserving resolution, which is a significant improvement over traditional CNN architectures. The Attentive Statistics Pooling (ASP) module further enhances the model's ability to focus on discriminative features, making the methodology both innovative and practical for real-world applications.
The experiments conducted on VoxCeleb1 and LibriSpeech datasets demonstrate TARNet's superior performance compared to state-of-the-art models. The results are well-presented, showing a clear advantage in accuracy metrics. The paper also includes ablation studies that validate the importance of each component in the architecture, providing a comprehensive evaluation of the model's effectiveness.
The authors have made the code publicly available, which enhances reproducibility. However, the paper could benefit from more detailed descriptions of the experimental setup, including hyperparameters and training procedures, to facilitate easier replication of results by other researchers.
One limitation of the study is the lack of evaluation in noisy or reverberant conditions, which are common in real-world scenarios. Additionally, while TARNet shows strong performance on the evaluated datasets, its generalizability to other speaker identification tasks or languages remains untested.
The advancements presented in TARNet have significant implications for biometric authentication and forensic analysis, where accurate speaker identification is crucial. The lightweight nature of the model also suggests potential applications in mobile and embedded systems, expanding its usability in various domains. The paper presents TARNet, a novel multi-scale architecture for closed-set speaker identification that effectively models temporal dependencies, achieving state-of-the-art performance while maintaining computational efficiency. The comprehensive evaluation of the methodology, experimental results, and potential applications underscores its significance in the field of audio processing and speaker recognition.
We introduce Latent Secret Spin (LSS), a blind speech watermarking method based on geometric operations in codec latent space. Based upon orthogonal rotations to principal components, LSS induces imperceptible but detectable covariance signatures according to a pseudo-random watermarking schedule. The scheme generalises across datasets, preserves perceptual quality and, unlike some learned, neural watermarking schemes, it does not require neural network training, is resistant to common signal manipulations and is flexible to payload size. Analyses show that structured latent-space watermarking is a promising and interpretable alternative to existing approaches.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of Latent Secret Spin (LSS), a novel blind speech watermarking method that utilizes geometric operations in latent spaces to achieve robust and imperceptible watermarking. The technical contributions are significant, providing a new perspective on watermarking methodologies that could influence future research and applications in audio content security.
The proposed Latent Secret Spin (LSS) methodology is innovative in its approach to blind speech watermarking by leveraging geometric operations in the latent space of neural audio codecs. The use of orthogonal rotations in principal component space to induce detectable covariance signatures is a novel contribution that distinguishes it from traditional watermarking techniques. The methodology is well-structured, with a clear explanation of the embedding and detection processes, including the use of a pseudo-random schedule for key management. However, while the geometric principles are sound, the reliance on PCA may limit the approach's adaptability to various audio contexts.
The experiments conducted to evaluate LSS are robust, utilizing two different speech datasets (VoxPopuli and ASVspoof) to assess both in-domain and out-of-domain performance. The evaluation metrics, including AUC-ROC for detection performance and PESQ for perceptual quality, provide a comprehensive view of the method's effectiveness. The results demonstrate strong detection capabilities across various audio manipulations, indicating good robustness. However, the paper could benefit from a more extensive range of experiments to explore the method's performance under more aggressive adversarial conditions.
The paper includes a link to the source code and sample utterances, which facilitates reproducibility. The detailed description of the experimental setup, including the configuration of the encoder and decoder, as well as the parameters used for watermark embedding and detection, enhances the clarity of the methodology. However, the lack of subjective listening tests for perceptual quality assessment is a drawback that could affect reproducibility in practical applications.
The study acknowledges some limitations, such as the focus on bona fide speech and the evaluation under common, non-malicious manipulations. The method's performance against stronger, adaptive attacks remains untested, and the reliance on objective metrics for perceptual quality could overlook important subjective aspects. Additionally, the distribution of watermarks at the chunk level may be vulnerable to temporal manipulations like splicing.
The implications of LSS are significant, particularly in the context of increasing concerns around misinformation and content authenticity in the audio domain. By providing a robust and imperceptible watermarking solution, LSS could play a crucial role in verifying the provenance of speech recordings and enhancing the security of audio content. The method's flexibility and interpretability also suggest potential applications beyond speech, potentially extending to other forms of media where watermarking is essential. The main contribution of this paper is the introduction of Latent Secret Spin (LSS), a novel blind speech watermarking method that utilizes geometric operations in latent spaces to achieve robust and imperceptible watermarking. The technical contributions are significant, providing a new perspective on watermarking methodologies that could influence future research and applications in audio content security.
Music comprises two core structural components, melody and rhythm, that vary widely across cultures. Whether these components coevolve in a coupled way or follow independent trajectories remains unclear. We introduce a novel computational pipeline to extract vocal melodic pitch-interval and percussive inter-onset timing distributions from 27,628 popular songs across 59 countries, enabling large-scale cross-cultural comparison that bypasses traditional music annotations. Musical similarities between countries aligned with geographic and linguistic relationships, validating our approach. Substantial variation emerged in both melodic and rhythmic structures across countries, yet the diversity of the two components was not significantly correlated, challenging assumptions of coupled evolution. Only rhythmic diversity was significantly associated with ethnic and linguistic heterogeneity, while melodic diversity showed no such association. These findings suggest that melody and rhythm constitute partially independent systems shaped by distinct cultural and evolutionary pressures, rather than components of a single monolithic musical style.
Primary: University of Cambridge
All Institutions: University of Cambridge, RITMO Centre for Interdisciplinary Studies in Rhythm, Time and Motion, University of Oslo, Department of Psychology, Goldsmiths College, University of London, Department of Life Sciences, Leipzig University, Division of Social Science, New York University Abu Dhabi, Department of Psychology, Cornell University
This paper presents a significant advancement in understanding the independent evolution of melody and rhythm across cultures through a novel computational approach. The methodology is innovative and the findings challenge existing assumptions in music theory, providing a fresh perspective on cultural music analysis.
The paper introduces a novel computational pipeline that leverages deep learning source separation techniques to extract melodic and rhythmic features from a large dataset of songs. This approach is innovative as it allows for the analysis of music without relying on traditional, often biased, manual annotations. The methodology is well-detailed, including the use of kernel density estimation for summarizing melodic and rhythmic distributions, and the careful consideration of time scales for analyzing pitch intervals. The choice of using distributional profiles rather than higher-level constructs is a significant strength, as it minimizes analytical biases. The operational definitions of melody and rhythm are clear, although they are somewhat limited in scope.
The experimental design is robust, utilizing a large dataset of 27,628 songs from 59 countries, which provides a comprehensive basis for cross-cultural analysis. The authors validate their computational pipeline by demonstrating that the extracted distributions align with known musical patterns, thus establishing face validity. The use of Jensen-Shannon divergence to assess musical similarity between countries is appropriate and effectively highlights the independence of melodic and rhythmic diversity. However, the paper could benefit from additional metrics or qualitative assessments to further substantiate the findings.
The paper provides sufficient detail regarding the methods and algorithms used, including the specific tools and parameters for source separation and feature extraction. The availability of the code and metadata through the provided GitHub link enhances reproducibility. However, the reliance on proprietary audio data from YouTube may limit the ability of others to fully replicate the study, particularly in regions with less representation.
The authors acknowledge several limitations, including the potential biases introduced by using YouTube chart data, which may not capture traditional or non-commercial music. Additionally, the source separation algorithms are primarily trained on Western music, which could affect the accuracy of the extracted features for non-Western genres. The operational definitions of melody and rhythm are also somewhat narrow, potentially overlooking the complexity of musical interactions.
The findings have significant implications for the fields of music cognition and cultural evolution, suggesting that melody and rhythm are shaped by different cultural and evolutionary pressures. This research could influence how music is studied across disciplines, including anthropology, psychology, and musicology. The methodology could also be applied to other forms of cultural expression, providing insights into the interplay between different artistic components. This paper presents a significant advancement in understanding the independent evolution of melody and rhythm across cultures through a novel computational approach. The methodology is innovative and the findings challenge existing assumptions in music theory, providing a fresh perspective on cultural music analysis.
Multimodal Emotion Recognition (MER) has attracted growing attention with the rapid advancement of human-computer interaction. However, different modalities exhibit substantial discrepancies in semantics, quality, and availability, leading to highly heterogeneous modality combinations and posing significant challenges to achieving consistent and reliable emotion understanding. To address this challenge, we propose the Modality-Aware Contrastive and Uncertainty-Regularized (MCUR) framework, which approaches MER from the perspective of representation consistency, aiming to enable robust emotion prediction across heterogeneous modality combinations. MCUR incorporates two core components: (1) Modality Combination-Based and Category-Based Contrastive Learning mechanism (MCB-CL), which encourages samples with the same emotion category and the same available modalities to be close in the representation space; and (2) Sample-wise Uncertainty-Guided Regularization (SUGR), which adaptively assigns sample-wise uncertain weights to samples to optimize training. Extensive experiments demonstrate that MCUR consistently outperforms existing methods, achieving average F1 gains of 2.2% on MOSI, 2.67% on MOSEI, and 4.37% on IEMOCAP.
Primary: University of Electronic Science and Technology of China
All Institutions: University of Electronic Science and Technology of China, Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China
The paper introduces the MCUR framework, which enhances multimodal emotion recognition by promoting representation consistency and addressing uncertainty in predictions. This comprehensive analysis highlights the technical contributions, innovative methodology, and potential impact on the field of machine learning and human-computer interaction.
The proposed MCUR framework presents a novel approach to multimodal emotion recognition (MER) by focusing on representation consistency across heterogeneous modalities. The integration of Modality Combination-Based and Category-Based Contrastive Learning (MCB-CL) and Sample-wise Uncertainty-Guided Regularization (SUGR) is a significant methodological advancement. MCB-CL enhances the discriminative power of representations by enforcing proximity in the embedding space for samples with the same emotion category and modality combination, while SUGR addresses uncertainty in predictions, allowing for adaptive weighting during training. This dual approach is innovative and effectively tackles the challenges posed by modality heterogeneity.
The experiments are comprehensive, utilizing three widely recognized datasets (MOSI, MOSEI, and IEMOCAP) to validate the effectiveness of the MCUR framework. The reported performance improvements over existing state-of-the-art methods, with average F1 gains of 2.2% on MOSI, 2.67% on MOSEI, and 4.37% on IEMOCAP, demonstrate the robustness of the proposed approach. The ablation studies further substantiate the contributions of each component of the framework, revealing the critical role of both MCB-CL and SUGR in enhancing model performance.
The paper provides detailed implementation details, including the training configurations, hyperparameter settings, and evaluation protocols, which are crucial for reproducibility. The authors also mention the use of official implementations for baseline models, ensuring a fair comparison. However, the lack of publicly available code or demo URLs limits the ease of reproduction for external researchers.
While the MCUR framework shows promising results, the paper does not address the potential computational overhead associated with the added complexity of the proposed methods. Additionally, the performance in real-world noisy conditions is not thoroughly evaluated, which could limit the applicability of the framework in practical scenarios. The reliance on specific datasets may also restrict generalizability to other contexts or domains.
The advancements presented in this paper have significant implications for human-computer interaction, particularly in enhancing emotion recognition systems that can adapt to varying modalities. The ability to maintain consistent representations across different modalities can improve the robustness of applications in areas such as virtual assistants, mental health monitoring, and social robotics. The focus on uncertainty in predictions may also lead to more reliable systems that can better handle real-world variability. The paper introduces the MCUR framework, which enhances multimodal emotion recognition by promoting representation consistency and addressing uncertainty in predictions. This comprehensive analysis highlights the technical contributions, innovative methodology, and potential impact on the field of machine learning and human-computer interaction.
Recently, neural directional filtering (NDF) has been introduced as a flexible approach for reconstructing a virtual directional microphone (VDM) with a desired directivity pattern for spatial sound capture. Building on this idea, we propose NDF+, which enables joint neural directional filtering and diffuse sound extraction. NDF+ reformulates VDM estimation into two coupled subtasks: dereverberated VDM reconstruction and diffuse sound extraction. This reformulation enables NDF+ to manipulate diffuse components in the final reconstructed VDM output. We evaluated NDF+ under reverberant conditions and compared it with representative conventional baselines. Results show that NDF+ consistently outperforms the baselines on both subtasks, while maintaining VDM reconstruction quality comparable to that of the original single-task NDF model. These findings indicate that NDF+ introduces an additional degree of freedom for diffuse sound control in the VDM reconstruction. In a stereo recording application, NDF+ provides controllable inter-channel level differences between left and right channels by adjusting the estimated diffuse component.
Primary: International Audio Laboratories Erlangen
All Institutions: International Audio Laboratories Erlangen, Fraunhofer IIS, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU)
The main contribution of this paper is the introduction of NDF+, a joint framework for neural directional filtering and diffuse sound extraction that enhances VDM reconstruction while allowing for effective control of diffuse sound components. This work represents a significant step forward in spatial audio processing, combining innovative methodologies with rigorous experimental validation to address key challenges in the field.
The paper introduces NDF+, a novel framework that combines neural directional filtering with diffuse sound extraction, effectively reformulating the VDM estimation into two coupled subtasks. The methodology employs a dual-mask architecture using LSTM networks to estimate coherent and diffuse components, which is a significant advancement over previous models that focused solely on VDM reconstruction. The approach is well-structured, with a clear explanation of the DNN architecture, training strategy, and loss functions, demonstrating a thoughtful integration of existing techniques with innovative modifications.
The experimental evaluation is comprehensive, comparing NDF+ against conventional baselines under various reverberant conditions. The results indicate that NDF+ consistently outperforms these baselines on both subtasks while maintaining VDM reconstruction quality. The use of objective metrics such as SDR and PESQ to measure performance adds rigor to the evaluation. However, the paper could benefit from more detailed qualitative assessments, such as user studies or subjective listening tests, to further validate the improvements in audio quality.
The paper provides a detailed description of the experimental setup, including the configurations of the microphone array, training data, and evaluation metrics. However, the absence of a public code repository or demo URL limits the reproducibility of the results. Including such resources would enhance the paper's impact and allow other researchers to validate and build upon the findings.
One limitation is the reliance on simulated environments for training and testing, which may not fully capture the complexities of real-world acoustic scenarios. Additionally, while the paper discusses the performance of NDF+ in stereo recording applications, it does not explore its scalability to larger microphone arrays or more complex sound environments.
The advancements presented in NDF+ have significant implications for spatial audio applications, particularly in enhancing the quality of recordings in reverberant environments. The ability to control diffuse sound components can improve immersive audio experiences in various fields, including virtual reality, telecommunications, and music production. The framework could also inspire further research into joint signal processing techniques in audio applications. The main contribution of this paper is the introduction of NDF+, a joint framework for neural directional filtering and diffuse sound extraction that enhances VDM reconstruction while allowing for effective control of diffuse sound components. This work represents a significant step forward in spatial audio processing, combining innovative methodologies with rigorous experimental validation to address key challenges in the field.
In audio generation evaluation, Fréchet Audio Distance (FAD) is a 2-Wasserstein distance with structural constraints for both primitives: the cost is a frozen embedding pullback whose invariance set hides severe artifacts, and the coupling is a Gaussian fit that dilutes rank-1 contamination relative to discrete OT. We propose Optimal Transport Audio Distance (OTAD), which corrects each primitive with one dedicated mechanism -- a residual Riemannian ground-metric adapter for the cost and entropic Sinkhorn optimal transport for the coupling. Across eight encoders under a four-axis protocol, coupling-only comparisons at $ε= 0.05$ show that Sinkhorn's rank-1 sensitivity exceeds FAD's by a factor of 1.9 to 3.6. Furthermore, OTAD achieves a higher mean Spearman correlation with audio-quality MOS (DCASE 2023 Task 7) than baseline metrics. As an intrinsic benefit of the discrete transport plan, OTAD yields per-sample diagnostics with AUROC $\ge 0.86$, a capability that scalar- or kernel-aggregated metrics structurally lack.
Primary: Sogang University
All Institutions: Sogang University
The paper presents a significant advancement in audio evaluation metrics by introducing OTAD, which effectively addresses the limitations of existing methods through innovative methodological contributions and rigorous empirical validation.
The proposed methodology introduces a novel Optimal Transport Audio Distance (OTAD) metric that addresses the limitations of existing metrics like Fréchet Audio Distance (FAD) by employing a dual correction mechanism: a learned Riemannian ground-metric adapter for the cost function and entropic Sinkhorn optimal transport for the coupling. This innovative approach allows for a more sensitive detection of artifacts in audio generation, which is crucial for applications in text-to-audio synthesis. The method is theoretically grounded and systematically validated through a comprehensive experimental design, including a factorial decomposition of the contributions from cost and coupling.
The experiments are robust, utilizing eight different encoders and a four-axis evaluation protocol to assess the performance of OTAD against FAD and KAD. The results indicate a significant improvement in sensitivity to rank-1 contamination and a higher correlation with human Mean Opinion Scores (MOS). The experiments also include per-sample diagnostics, which provide insights into the specific artifacts present in audio samples, highlighting the practical utility of OTAD in real-world applications.
The paper includes sufficient detail regarding the implementation of the OTAD metric and the experimental setup, including the datasets used (FSD50K and ESC-50) and the training of the adapters. The release of the OTAD toolkit on GitHub further enhances reproducibility, allowing other researchers to replicate the findings and utilize the metric in their own work.
The study acknowledges several limitations, including the reliance on a single listening test for MOS validation and the potential biases introduced by training on a specific dataset (FSD50K). Additionally, the performance of OTAD on music and speech domains remains untested, and the scalability of the method for larger datasets is not fully explored.
The introduction of OTAD has significant implications for the field of audio generation evaluation, providing a more nuanced and sensitive metric that can improve the quality of generated audio. This advancement could lead to better user experiences in applications such as music synthesis, sound design, and audio restoration. The methodology can also serve as a blueprint for future research in audio evaluation metrics across different domains. The paper presents a significant advancement in audio evaluation metrics by introducing OTAD, which effectively addresses the limitations of existing methods through innovative methodological contributions and rigorous empirical validation.
Symbolic music datasets with matched scores and performances are essential for many music information retrieval (MIR) tasks. Yet, existing resources often cover a narrow range of composers, lack performance variety, omit note-level alignments, or use inconsistent naming formats. This work presents PianoCoRe, a large-scale piano MIDI dataset that unifies and refines major open-source piano corpora. The dataset contains 250,046 performances of 5,625 pieces written by 483 composers, totaling 21,763 h of performed music. PianoCoRe is released in tiered subsets to support different applications: from large-scale analysis and pre-training (PianoCoRe-C and deduplicated PianoCoRe-B) to expressive performance modeling with note-level score alignment (PianoCoRe-A/A*). The note-aligned subset, PianoCoRe-A, provides the largest open-source collection of 157,207 performances aligned to 1,591 scores to date. In addition to the dataset, the contributions are: (1) a MIDI quality classifier for detecting corrupted and score-like transcriptions and (2) RAScoP, an alignment refinement pipeline that cleans temporal alignment errors and interpolates missing notes. The analysis shows that the refinement reduces temporal noise and eliminates tempo outliers. Moreover, an expressive performance rendering model trained on PianoCoRe demonstrates improved robustness to unseen pieces compared to models trained on raw or smaller datasets. PianoCoRe provides a ready-to-use foundation for the next generation of expressive piano performance research.
Primary: Skolkovo Institute of Science and Technology
All Institutions: Skolkovo Institute of Science and Technology
The main contribution of this paper is the introduction of the PianoCoRe dataset, a refined and comprehensive MIDI dataset that addresses the limitations of existing resources in symbolic music analysis. This work significantly enhances the foundation for future research in expressive piano performance modeling and music information retrieval, showcasing a meticulous approach to dataset curation and quality assessment.
The methodology presented in this paper is robust and comprehensive, detailing a systematic approach to curating and refining a large-scale piano MIDI dataset. The authors employ a multi-tiered strategy that includes deduplication, quality assessment, and note alignment refinement using the RAScoP pipeline. The integration of various existing datasets into a unified collection is particularly noteworthy, as it addresses the inconsistencies and limitations found in previous datasets. The use of a MIDI quality classifier to filter out corrupted transcriptions and the detailed description of the alignment process further enhance the methodological rigor.
The experiments conducted demonstrate the effectiveness of the proposed dataset and methodologies. The authors provide a thorough evaluation of the MIDI quality classifier, achieving a high macro F1 score, which indicates the classifier's reliability in distinguishing between performance qualities. Additionally, the application of the dataset in training an expressive performance rendering model shows significant improvements in robustness, suggesting that the dataset effectively supports advanced modeling tasks. However, specific quantitative results from the expressive performance rendering model could further strengthen the experimental validation.
The paper includes detailed descriptions of the dataset construction process, including data sources, matching methodologies, and quality assessment techniques. The authors provide a GitHub repository link for the project, which enhances reproducibility. However, the paper could benefit from including specific implementation details or code snippets to facilitate replication of the methodologies by other researchers.
One limitation identified is the reliance on existing datasets, which may still contain inherent biases or limitations that could affect the quality of the combined dataset. Additionally, while the RAScoP pipeline improves alignment, the paper does not fully address potential edge cases where alignment might still be problematic. The focus on public domain works may also limit the dataset's applicability to contemporary compositions.
The PianoCoRe dataset has the potential to significantly impact the field of music information retrieval and computational musicology by providing a comprehensive resource for training models in expressive performance rendering and analysis. Its tiered structure allows for diverse applications, from large-scale analysis to specific performance modeling tasks, thus fostering advancements in music generation and understanding. The main contribution of this paper is the introduction of the PianoCoRe dataset, a refined and comprehensive MIDI dataset that addresses the limitations of existing resources in symbolic music analysis. This work significantly enhances the foundation for future research in expressive piano performance modeling and music information retrieval, showcasing a meticulous approach to dataset curation and quality assessment.
We propose a plug-and-play framework for speech enhancement and separation that augments predictive methods with a generative speech prior. Our approach, termed Stochastic Interpolant Prior for Speech (SIPS), builds on stochastic interpolants and leverages their flexibility to bridge predictive and generative modeling. Specifically, we decompose the interpolation dynamics into a task-specific drift and a stochastic denoising component, allowing a predictive estimate to be integrated directly into the generative sampling process. This results in a mathematically grounded framework for combining strong pretrained predictors with the expressive power of generative models. To this end, we train a score model using only clean speech, yielding a degradation-agnostic prior that can be reused across tasks. During inference, the predictor provides a deterministic drift that steers the sampling process toward a task-consistent estimate, while the score model preserves perceptual naturalness. Unlike prior hybrid approaches, which typically rely on architecture-specific conditioning and are tied to particular predictors or degradation settings, SIPS provides a unified framework that generalizes across predictors and additive degradation tasks. We demonstrate its effectiveness for both speech enhancement and speech separation using recent predictors such as SEMamba and FlexIO. The proposed method consistently improves perceptual quality, achieving gains up +1.0 NISQA for speech separation.
Primary: MERL
All Institutions: MERL
The paper presents a novel framework that bridges predictive and generative modeling for speech enhancement and separation, demonstrating significant improvements in perceptual quality while maintaining competitive performance on traditional metrics. The comprehensive methodology and robust experimental validation position this work as a meaningful contribution to the field of machine learning in audio processing.
The proposed Stochastic Interpolant Prior for Speech (SIPS) framework effectively integrates predictive and generative modeling approaches, addressing the limitations of both paradigms by introducing a mathematically grounded decomposition of interpolation dynamics. This innovative methodology allows for a flexible and efficient plug-and-play integration with existing predictors, enhancing the perceptual quality of speech enhancement and separation tasks while maintaining fidelity to the original signals.
The experiments conducted demonstrate the efficacy of SIPS across various tasks, including speech enhancement and separation, using multiple state-of-the-art predictors. The results indicate consistent improvements in non-intrusive perceptual quality metrics, alongside competitive performance in reference-based metrics, showcasing the robustness and versatility of the proposed method.
The paper provides a clear implementation of the proposed method, including detailed descriptions of the experimental setup, data representation, and training procedures. The availability of the implementation on GitHub enhances reproducibility, allowing other researchers to validate and build upon the findings.
One limitation is the reliance on clean speech data for training the generative prior, which may affect performance in real-world scenarios with diverse degradation types. Additionally, while the method shows promise, further exploration of its generalization capabilities across different audio domains is warranted.
The SIPS framework has significant implications for various applications in speech processing, including telecommunications, assistive technologies, and audio content creation. By improving speech quality in challenging conditions, this work can enhance user experiences in voice communication systems and contribute to advancements in automatic speech recognition and natural language processing. The paper presents a novel framework that bridges predictive and generative modeling for speech enhancement and separation, demonstrating significant improvements in perceptual quality while maintaining competitive performance on traditional metrics. The comprehensive methodology and robust experimental validation position this work as a meaningful contribution to the field of machine learning in audio processing.
Large audio language models (LALMs) are increasingly used to reason over long audio clips, yet deployment often compresses audio before inference to reduce memory and latency. The risk is that compression can leave aggregate accuracy acceptable while sharply degrading answers for a deployment-critical query family. We study answer-preserving audio compression, judging a compressor by the excess answer-error it induces, especially for the worst-affected family. We formulate this theoretically as a compressor acceptance-rejection criterion, derive a practical sign-off protocol that returns compression budgets satisfying worst-family checks with statistical confidence, and evaluate it on five multiple-choice audio question-answering benchmarks with two Qwen-based backbones. The protocol exposes hidden family-level damage, shows that the chosen query-family partition can change the approved budget, and identifies regimes where query-conditioned compression helps maintain answer preservation.
Primary: Technion--Israel Institute of Technology
All Institutions: Technion--Israel Institute of Technology
The main contribution of this paper is the introduction of a framework for task-aware answer-preserving audio compression, which addresses the critical challenge of maintaining answer quality in large audio language models under compression constraints. This work significantly advances the understanding of audio compression impacts on model performance and provides a practical methodology for evaluating and ensuring answer preservation across diverse query families.
The methodology presented in this paper is robust and well-structured. The authors introduce a theoretical framework for task-aware answer-preserving audio compression, which is a novel approach to evaluating audio compression techniques in the context of large audio language models (LALMs). The paper formulates a compressor acceptance-rejection criterion and derives a practical sign-off protocol that incorporates statistical confidence, which is a significant contribution to the field. The approach is grounded in a solid theoretical foundation, linking practical deployment configurations to answer preservation metrics. The use of paired evaluations and the focus on worst-family checks are particularly noteworthy, as they address the critical issue of performance degradation across different query families.
The experimental evaluation is comprehensive, utilizing five multiple-choice audio question-answering benchmarks. The authors effectively demonstrate the applicability of their framework and the importance of considering family-level performance rather than relying solely on average metrics. The results reveal significant insights into how compression can affect different query families, showcasing the hidden damage that can occur when using average performance metrics. The experiments are well-designed, and the analysis is thorough, providing empirical support for the theoretical claims made in the paper.
The paper provides detailed descriptions of the experimental setup, including the datasets, models, and evaluation metrics used. However, there are some limitations regarding the availability of code and data, as no URLs for project repositories or demo pages are provided. This lack of resources may hinder reproducibility for other researchers looking to validate or build upon the findings.
The paper acknowledges several limitations, including the potential for query-family coarsening and the challenges of estimating true Bayes risks due to calibration errors and prompt sensitivity. Additionally, the framework's applicability to different languages, longer audio clips, or varying deployment scenarios is not fully established, which may limit its generalizability.
The proposed framework has significant implications for the deployment of audio language models in real-world applications, particularly in scenarios where audio compression is necessary for efficiency. By emphasizing the importance of answer preservation across different query families, this work could influence future research and development in audio processing, machine learning, and multimodal systems. The findings could lead to improved audio compression techniques that better maintain the integrity of information critical for specific tasks. The main contribution of this paper is the introduction of a framework for task-aware answer-preserving audio compression, which addresses the critical challenge of maintaining answer quality in large audio language models under compression constraints. This work significantly advances the understanding of audio compression impacts on model performance and provides a practical methodology for evaluating and ensuring answer preservation across diverse query families.
Integrating speech understanding and generation is a pivotal step toward building unified speech models. However, the different representations required for these two tasks currently pose significant compatibility challenges. Typically, semantics-oriented features are learned from self-supervised learning (SSL), and acoustic-oriented features from reconstruction. Such fragmented representations hinder the realization of truly unified speech systems. We present WavCube, a compact continuous latent derived from an SSL speech encoder that simultaneously supports speech understanding, reconstruction, and generation. WavCube employs a two-stage training scheme. Stage 1 trains a semantic bottleneck to filter off-manifold redundancy that makes raw SSL features intractable for diffusion. Stage 2 injects fine-grained acoustic details via end-to-end reconstruction, while a semantic anchoring loss ensures the representation remains grounded within its original semantic manifold. Comprehensive experiments show that WavCube closely approaches WavLM performance on SUPERB despite an 8x dimensional compression, attains reconstruction quality on par with existing acoustic representations, delivers state-of-the-art zero-shot TTS performance with markedly faster training convergence, and excels in speech enhancement, separation, and voice conversion tasks on the SUPERB-SG benchmark. Systematic ablations reveal that WavCube's two-stage recipe resolves two intrinsic flaws of SSL features for generative modeling, paving the way for future unified speech systems. Codes and checkpoints are available at https://github.com/yanghaha0908/WavCube.
Primary: Shanghai Jiao Tong University
All Institutions: Shanghai Jiao Tong University, Shanghai Innovation Institute, Tencent, Independent Researcher, Peking University, Tianjin University, Zhejiang University
WavCube presents a novel approach to unify speech understanding and generation through a compact continuous latent representation. This paper makes a substantial contribution to the field by addressing the compatibility challenges between semantic and acoustic features, demonstrating its effectiveness through rigorous experimentation across multiple benchmarks.
The methodology employed in WavCube is innovative, utilizing a two-stage training scheme that effectively addresses the challenges of integrating semantic and acoustic representations. The first stage compresses high-dimensional SSL features into a compact latent space, while the second stage enriches this latent space with fine-grained acoustic details. This approach is well-justified and systematically tackles the inherent flaws of existing SSL representations, making it a significant contribution to the field.
The experiments conducted are comprehensive and well-structured, demonstrating WavCube's performance across various tasks, including speech understanding, reconstruction, and generation. The results show that WavCube achieves competitive performance against existing methods, indicating its effectiveness and robustness. The use of benchmarks like SUPERB and the detailed evaluation metrics further enhance the credibility of the findings.
The paper provides sufficient details regarding the methodology and experimental setup, including the datasets and training configurations used. However, the lack of a demo or interactive component may hinder some aspects of reproducibility for practitioners who wish to implement the model.
While the paper presents a strong framework, it does not explicitly discuss potential limitations or assumptions underlying the proposed approach. For instance, the performance drop due to dimensionality reduction and the reliance on specific datasets could be areas of concern that warrant further exploration.
The implications of WavCube are significant, as it offers a unified framework for speech processing that could enhance applications in voice synthesis, speech recognition, and multimodal interactions. By bridging the gap between understanding and generation, WavCube could pave the way for more integrated and efficient speech technologies. WavCube presents a novel approach to unify speech understanding and generation through a compact continuous latent representation. This paper makes a substantial contribution to the field by addressing the compatibility challenges between semantic and acoustic features, demonstrating its effectiveness through rigorous experimentation across multiple benchmarks.
In this paper, we present X-Voice, a 0.4B multilingual zero-shot voice cloning model that clones arbitrary voices and enables everyone to speak 30 languages. X-Voice is trained on a 420K-hour multilingual corpus using the International Phonetic Alphabet (IPA) as a unified representation. To eliminate the reliance on prompt text without complex preprocessing like forced alignment, we design a two-stage training paradigm. In Stage 1, we establish X-Voice$_{\text{s1}}$ through standard conditional flow-matching training and use it to synthesize 10K hours of speaker-consistent segments as audio prompts. In Stage 2, we fine-tune on these audio pairs with prompt text masked to derive X-Voice$_{\text{s2}}$, which enables zero-shot voice cloning without requiring transcripts of audio prompts. Architecturally, we extend F5-TTS by implementing a dual-level injection of language identifiers and decoupling and scheduling of Classifier-Free Guidance to facilitate multilingual speech synthesis. Subjective and objective evaluation results demonstrate that X-Voice outperforms existing flow-matching based multilingual systems like LEMAS-TTS and achieves zero-shot cross-lingual cloning capabilities comparable to billion-scale models such as Qwen3-TTS. To facilitate research transparency and community advancement, we open-source all related resources.
Primary: Zhejiang University
All Institutions: Zhejiang University, Beijing Haitian Ruisheng Science Technology Ltd, Center for Language and Speech Processing, Fudan University, Geely Automobile Research Institute (Ningbo) Company Ltd, MoE Key Lab of Artificial Intelligence, Shanghai Innovation Institute, Shanghai Jiao Tong University, X-LANCE Lab
The paper presents X-Voice, a novel multilingual zero-shot voice cloning model that significantly advances the capabilities of TTS systems across 30 languages. The methodology, which includes a two-stage training process and innovative architectural enhancements, addresses critical limitations in existing systems, making it a valuable contribution to the field of machine learning and audio processing.
The paper introduces a two-stage training paradigm for zero-shot voice cloning, which is a significant advancement in the field. The first stage focuses on building a robust multilingual backbone using a large corpus, while the second stage fine-tunes the model using synthetic audio prompts without the need for reference transcripts. This approach effectively addresses the challenges of multilingual TTS systems, particularly the reliance on aligned text and audio, which is often problematic for low-resource languages. The introduction of dual-level language injection and decoupled classifier-free guidance further enhances the model's ability to maintain speaker identity and prosodic accuracy across languages.
The experimental results are comprehensive, comparing X-Voice against several state-of-the-art models across multiple languages. The use of both subjective and objective evaluation metrics, including WER, SIM-o, IMOS, and SMOS, provides a well-rounded assessment of the model's performance. The results indicate that X-Voice achieves competitive performance, particularly in low-resource languages, while also demonstrating improvements in intelligibility and speaker consistency compared to existing systems. The release of a new evaluation benchmark with human annotations adds significant value to the research community.
The paper provides detailed implementation details, including model configurations, training setups, and evaluation protocols, which enhances reproducibility. The authors have also open-sourced their training corpus and evaluation benchmarks, fostering transparency and allowing other researchers to build upon their work.
Despite its strengths, the model still faces challenges in preserving speaker similarity in certain phonological contexts, indicating a trade-off between accent suppression and timbre preservation. Additionally, the handling of intra-sentential code-switching is noted as an area for future improvement. The reliance on high-quality synthetic data in the fine-tuning stage may also limit the model's applicability in scenarios where such data is not available.
The advancements presented in this paper have the potential to democratize high-fidelity TTS technology, making it accessible for a wider range of languages, including low-resource ones. The implications extend to various applications, such as personalized voice assistants, language learning tools, and accessibility technologies for individuals with speech impairments. The open-sourcing of resources could significantly accelerate research in multilingual TTS systems and contribute to the development of more inclusive technologies. The paper presents X-Voice, a novel multilingual zero-shot voice cloning model that significantly advances the capabilities of TTS systems across 30 languages. The methodology, which includes a two-stage training process and innovative architectural enhancements, addresses critical limitations in existing systems, making it a valuable contribution to the field of machine learning and audio processing.
Quantum machine learning has emerged as a promising tool for pattern recognition, yet many audio-focused approaches still treat spectrograms as generic images and do not explicitly exploit their time-frequency structure. We propose Q-Patch, a quantum feature map tailored to audio that encodes local time-frequency patches from mel-spectrograms into quantum states using shallow, hardware-efficient circuits with adjacency-aware entanglement. Each selected patch is summarized by a compact four-dimensional acoustic descriptor and mapped to a four-qubit circuit with depth at most three, enabling practical quantum kernel construction under near-term constraints. We evaluate Q-Patch on an audio spoofing detection task using a controlled, balanced protocol and compare it with size-matched classical baselines. Q-Patch improves discrimination between bona fide and spoofed samples, achieving an area under the receiver operating characteristic curve (AUROC) of 0.87, compared with 0.82 for a radial basis function support vector machine (RBF-SVM) trained on the same patch-level features. Kernel-space analysis further reveals a clear class structure, with cross-class similarity around 0.615 and within-class self-similarity of 1.00. Overall, Q-Patch provides a practical framework for incorporating time-frequency-aware representations into quantum kernel learning for audio authenticity assessment in low-resource settings.
Primary: Potomac Quantum
All Institutions: Potomac Quantum, United International University, University of Maryland, Monash University, University of the Sunshine Coast
The paper presents Q-Patch, a quantum feature-mapping framework for audio spoofing detection that effectively utilizes time-frequency structures in spectrograms. This innovative approach, combined with rigorous experimental validation, positions the work as a meaningful contribution to the field of audio deepfake detection and quantum machine learning.
The methodology introduces Q-Patch, a novel quantum feature mapping framework specifically designed for audio spoofing detection. It effectively utilizes local time-frequency patches from mel-spectrograms, which is a significant improvement over treating spectrograms as generic images. The use of shallow, hardware-efficient quantum circuits with adjacency-aware entanglement is innovative, as it addresses practical constraints of near-term quantum computing. The approach to summarize patches into compact four-dimensional descriptors before quantum embedding is well thought out, allowing for efficient processing while maintaining relevant information.
The experimental evaluation is conducted on a balanced dataset derived from LJ Speech, which includes both bona fide and spoofed audio samples. The results indicate that Q-Patch outperforms classical baselines, achieving an AUROC of 0.87 compared to 0.82 for RBF-SVM. The analysis of kernel-space structure further supports the effectiveness of the proposed method, showing clear class separability. However, the limited dataset size (100 samples) raises concerns about the generalizability of the results, which should be addressed in future work.
The paper provides a detailed description of the methodology, including data preparation, feature extraction, and quantum embedding processes. However, the absence of code or a project URL limits the reproducibility of the results. Future work should include sharing the implementation details or code to enable other researchers to replicate the findings.
The study's limitations include the small dataset size, which may not capture the full diversity of real-world audio spoofing attacks. The controlled nature of the spoof generation (using additive noise and spectral distortions) may not reflect the complexities of actual spoofing methods. Additionally, the results are based on ideal quantum simulations, which may not translate directly to performance on physical quantum hardware.
The proposed Q-Patch framework has the potential to significantly impact the field of audio deepfake detection by introducing quantum machine learning techniques that leverage time-frequency structures. This could lead to more robust detection methods that are particularly useful in low-resource settings. As quantum computing technology advances, the framework may become increasingly applicable in real-world scenarios, enhancing security against audio spoofing. The paper presents Q-Patch, a quantum feature-mapping framework for audio spoofing detection that effectively utilizes time-frequency structures in spectrograms. This innovative approach, combined with rigorous experimental validation, positions the work as a meaningful contribution to the field of audio deepfake detection and quantum machine learning.
The Massive Sound Embedding Benchmark (MSEB) has emerged as a standard for evaluating the functional breadth of audio models. While initial baselines focused on specialized encoders, the shift toward "audio-native" Large Language Models (LLMs) suggests a new paradigm where a single multimodal backbone may replace complex, task-specific pipelines. This paper provides a rigorous empirical evaluation of leading LLMs - including members from the Gemini and GPT families - across the eight core MSEB capabilities to assess their efficacy and audio-text parity. Our results indicate that while a significant modality gap persists regarding performance and robustness, the empirical evidence for an "optimal" modeling approach remains inconclusive. Ultimately, the choice between audionative and cascaded architectures depends heavily on specific use-case requirements and the underlying assumptions regarding latency, cost, and reasoning depth.
Primary: Google
All Institutions: Google USA & Germany
The main contribution of this paper is the rigorous empirical evaluation of leading audio-native LLMs on the MSEB, providing valuable insights into their performance and the challenges of achieving audio-text parity. The comprehensive analysis of methodologies and results positions this work as a significant step forward in the integration of audio processing within the framework of large language models, addressing both theoretical and practical aspects of the field.
The paper presents a comprehensive methodology for applying large language models (LLMs) to the Massive Sound Embedding Benchmark (MSEB), detailing a systematic approach to task-specific prompting and evaluation across diverse audio tasks. The methodology is well-structured, with clear definitions of tasks, input/output formats, and considerations for model performance. The iterative refinement of prompt templates through interactions with models like Gemini 3 demonstrates a thoughtful approach to optimizing LLMs for audio tasks. However, the paper could benefit from a more detailed discussion of the limitations of the chosen methodologies, particularly regarding the adaptability of LLMs to non-generative tasks.
The experimental evaluation is robust, covering a wide range of models and tasks, with detailed performance metrics provided for each evaluation. The use of a diverse set of datasets, including multilingual and varied acoustic environments, enhances the reliability of the results. The paper effectively compares audio-native LLMs with traditional cascaded systems, providing insights into their relative strengths and weaknesses. However, the analysis of results could be improved by including more visual aids (e.g., graphs) to illustrate performance trends across tasks and models.
The paper mentions the open-source nature of the MSEB toolkit and provides a link to the GitHub repository, which is a positive aspect for reproducibility. However, the paper lacks detailed implementation specifics, such as hyperparameter settings, training protocols, and the exact versions of models used, which could hinder full reproducibility for other researchers.
The paper acknowledges the significant modality gap that persists between audio and text processing, which is a critical limitation. Additionally, the authors note the challenges in achieving consistent performance across different locales and acoustic conditions, indicating that the models may not generalize well in real-world applications. The potential for test data contamination is also a significant concern that could skew results.
The findings of this research have significant implications for the development of audio processing systems, particularly in enhancing the capabilities of LLMs in understanding and reasoning about audio data. The establishment of the MSEB as a benchmark could drive further research and innovation in the field, promoting the development of more robust and versatile audio-native models. The open-source nature of the toolkit encourages community engagement and collaboration, which could accelerate advancements in auditory intelligence. The main contribution of this paper is the rigorous empirical evaluation of leading audio-native LLMs on the MSEB, providing valuable insights into their performance and the challenges of achieving audio-text parity. The comprehensive analysis of methodologies and results positions this work as a significant step forward in the integration of audio processing within the framework of large language models, addressing both theoretical and practical aspects of the field.
This study presents a bio inspired signal processing framework for robust Underwater Acoustic Target Recognition (UATR). The latest state of the art methods often fail to resolve dense low frequency harmonic structures in vessel propulsion signals under high noise conditions, which is addressed by the proposed framework using a biologically inspired Gammatone filter bank that emulates the cochlea nonlinear frequency selectivity. By distributing filters according to the Equivalent Rectangular Bandwidth (ERB) scale, the framework achieves a high fidelity representation of engine radiated tonals while effectively suppressing isotropic ambient interference. The resulting Cochleagram features are processed by a lightweight, custom designed Convolutional Neural Network (CNN) that leverages large receptive fields to integrate spectral-temporal continuities. Experimental results on the VTUAD dataset demonstrate a state of the art classification accuracy of 98.41%, outperforming Continuous Wavelet Transform and Mel Frequency Cepstral Coefficients baselines by 3.5% and 7.7% respectively. Furthermore, the framework achieves an inference latency of only 0.77 ms and a 0.971 Cohen Kappa score, validating its efficacy for real time deployment on autonomous, low-power sonar hardware.
Primary: Centre for Applied Research in Electronics (CARE)
All Institutions: Central Research Laboratory, Bharat Electronics Limited, Ghaziabad, India, Centre for Applied Research in Electronics (CARE), IIT Delhi, India
The main contribution of this paper is the development of a bio-inspired Gammatone-CNN framework for underwater acoustic target classification, achieving state-of-the-art performance through innovative feature extraction techniques. This research significantly advances the field of underwater acoustics by providing a method that combines biological principles with modern machine learning, demonstrating the potential for improved classification accuracy in challenging acoustic environments.
The paper introduces a novel bio-inspired Gammatone filter bank for underwater acoustic target classification, leveraging the non-linear frequency selectivity of the cochlea. The methodology prioritizes feature extraction over architectural complexity, employing a lightweight CNN that effectively integrates spectral-temporal features. The use of the Equivalent Rectangular Bandwidth (ERB) scale for filter distribution is particularly innovative, allowing for high fidelity in low-frequency representation, which is crucial for underwater acoustics. The mathematical foundations of the Gammatone filter and the detailed description of the Cochleagram formation process provide a solid basis for the proposed approach.
The experimental validation is robust, utilizing the VTUAD dataset to demonstrate the framework's effectiveness. Achieving a classification accuracy of 98.41% and a Cohen Kappa score of 0.971 indicates strong performance and reliability. The comparative analysis against established methods like Continuous Wavelet Transform and Mel Frequency Cepstral Coefficients shows significant improvements, reinforcing the proposed method's superiority. The inclusion of diverse metrics such as ROC curves and confusion matrices adds depth to the evaluation.
The paper provides detailed information on the experimental setup, including dataset partitioning, feature extraction parameters, and model architecture. However, the absence of a publicly available code repository limits reproducibility. Future work should consider sharing implementation details to facilitate validation by other researchers.
While the proposed framework shows remarkable performance, it may struggle with class imbalance, particularly for underrepresented classes like Passengership. The reliance on a specific dataset (VTUAD) may also limit generalizability to other underwater environments. Additionally, the computational efficiency on standard CPUs, while acceptable, could be a concern for real-time applications in more constrained environments.
The implications of this research extend to maritime security, ecological monitoring, and autonomous underwater vehicles (AUVs). By improving underwater target recognition, the framework can enhance surveillance capabilities and contribute to the protection of marine ecosystems. The low-power, real-time processing capabilities make it suitable for deployment in resource-constrained environments. The main contribution of this paper is the development of a bio-inspired Gammatone-CNN framework for underwater acoustic target classification, achieving state-of-the-art performance through innovative feature extraction techniques. This research significantly advances the field of underwater acoustics by providing a method that combines biological principles with modern machine learning, demonstrating the potential for improved classification accuracy in challenging acoustic environments.
The rapid advancement of generative audio models has outpaced the development of robust evaluation methodologies. Existing objective metrics and general multimodal large language models (MLLMs) often struggle with domain generalization, zero-shot capabilities, and instructional flexibility. To address these bottlenecks, we propose JASTIN, a generalizable, instruction-driven audio evaluation framework that formulates audio assessment as a self-instructed reasoning task. JASTIN bridges a frozen high-performance audio encoder with a fine-tuned LLM backbone via a trainable audio adapter. To ensure robust zero-shot generalization, we introduce a comprehensive instruction following data preparation pipeline, incorporating Multi-Source, Multi-Task, Multi-Calibration, and Multi-Description data. Experimental results demonstrate that JASTIN achieves state-of-the-art Pearson and Spearman correlations with human subjective ratings. It consistently outperforms general MLLMs across speech, sound, music, and out-of-domain evaluation tasks without the need for task-specific retraining.
Primary: Shanghai Jiao Tong University
All Institutions: Shanghai Jiao Tong University, MoE Key Laboratory of Artificial Intelligence, AI Institute
The main contribution of this paper is the introduction of JASTIN, a novel instruction-driven framework for zero-shot audio evaluation that significantly enhances the evaluation process by integrating multimodal LLMs with advanced audio processing techniques. This work represents a meaningful advancement in the field of audio evaluation, addressing critical challenges and setting a new standard for future research.
The proposed JASTIN framework innovatively integrates a frozen high-performance audio encoder with a fine-tuned LLM backbone through a trainable audio adapter, addressing the limitations of existing evaluation metrics by employing a self-instructed reasoning paradigm. The comprehensive data preparation pipeline, which includes multi-source, multi-task, multi-calibration, and multi-description strategies, enhances the model's zero-shot generalization capabilities, making it adaptable to various audio evaluation tasks without the need for task-specific retraining.
The experimental results demonstrate that JASTIN achieves state-of-the-art Pearson and Spearman correlations with human subjective ratings across diverse audio domains, including speech, sound, and music. The framework consistently outperforms both traditional metrics and general MLLMs, showcasing its robustness and effectiveness in real-world applications. The evaluation on out-of-domain tasks further emphasizes its generalization capabilities, which is a significant advancement in the field.
The authors have provided detailed implementation information, including training configurations, data preparation methods, and evaluation metrics, which enhances the reproducibility of their results. However, the lack of a demo URL limits immediate accessibility for other researchers to test the framework.
While the paper presents a comprehensive framework, it does not address potential biases in the training data or the limitations of the LLMs used. Additionally, the model's performance on highly specialized audio tasks may still require further validation.
The JASTIN framework has the potential to revolutionize audio evaluation methodologies by providing a more flexible and generalizable approach. Its implications extend to various applications in audio synthesis, music generation, and speech processing, enabling more efficient and scalable evaluation processes in these domains. The main contribution of this paper is the introduction of JASTIN, a novel instruction-driven framework for zero-shot audio evaluation that significantly enhances the evaluation process by integrating multimodal LLMs with advanced audio processing techniques. This work represents a meaningful advancement in the field of audio evaluation, addressing critical challenges and setting a new standard for future research.
While the spatial directivity of multichannel speech enhancement algorithms improves with the number of microphones, fitting large capture arrays into real-world edge devices is typically limited by physical constraints. To overcome this limitation, we propose Spatial-Magnifier, a neural network designed to generate virtual microphone (VM) signals from a limited set of real microphone (RM) measurements. Moreover, we introduce the Spatial Audio Representation Learning (SARL) framework, which leverages estimated VM signals and features to condition a downstream speech enhancement system. Experimental results demonstrate that the proposed framework outperforms existing spatial upsampling baselines across various speech extraction systems, including end-to-end multichannel speech enhancement and neural beamforming. The proposed method nearly recovers the oracle performance achieved when all microphones are available.
Primary: Korea Advanced Institute of Science and Technology (KAIST)
All Institutions: Korea Advanced Institute of Science and Technology (KAIST), Meta Reality Labs Research
The paper introduces Spatial-Magnifier, a neural network for spatial upsampling in multichannel speech enhancement, and the SARL framework, significantly enhancing downstream speech processing tasks. The innovative approach and rigorous experimental validation position this work as a valuable contribution to the field of audio signal processing and machine learning.
The paper presents a novel neural network architecture, Spatial-Magnifier, which effectively generates virtual microphone signals from real microphone measurements. It introduces the Spatial Audio Representation Learning (SARL) framework, which enhances the conditioning of downstream speech enhancement tasks by leveraging both estimated virtual microphone signals and features. The use of a GAN-based approach and the incorporation of selection and dynamic channel allocation modules are innovative aspects that contribute to the flexibility and efficiency of the model. The methodology is well-structured, with clear definitions and a logical flow from problem identification to proposed solutions.
The experiments are comprehensive, utilizing a well-defined dataset and a robust experimental setup to evaluate the performance of the proposed methods. The authors conduct ablation studies and comparisons with existing baselines, demonstrating the effectiveness of their approach across different configurations and tasks. The results indicate significant improvements in performance metrics such as SI-SDR, SNR, PESQ, and STOI, showcasing the technical superiority of the proposed methods over traditional approaches.
The paper provides sufficient details regarding the experimental setup, including the architecture parameters, training procedures, and evaluation metrics. However, the absence of a publicly accessible code repository or demo URL limits the reproducibility of the results. Future work could benefit from sharing the implementation to facilitate validation by the research community.
One limitation is the reliance on simulated data for training and evaluation, which may not fully capture the complexities of real-world environments. Additionally, while the proposed methods show promise, the performance in highly dynamic or noisy environments remains to be thoroughly evaluated. The computational efficiency, while improved, could still be a concern for deployment on resource-constrained devices.
The proposed methods have significant implications for real-world applications in speech enhancement, particularly in consumer electronics such as AR glasses and hearing aids. By enabling effective multichannel speech enhancement with fewer microphones, the work addresses a critical need for improved audio capture in compact devices. The advancements in spatial audio processing could also benefit various fields, including telecommunications, virtual reality, and assistive technologies. The paper introduces Spatial-Magnifier, a neural network for spatial upsampling in multichannel speech enhancement, and the SARL framework, significantly enhancing downstream speech processing tasks. The innovative approach and rigorous experimental validation position this work as a valuable contribution to the field of audio signal processing and machine learning.
Multimodal emotion recognition (MER) benefits from combining text, audio, and vision, yet standard fusion often fails when modalities conflict. Crucially, conflicts differ in resolvability: benign conflicts stem from missing, weak, or ambiguous cues and can be mitigated by cross-modal calibration, while severe conflicts arise from intrinsically contradictory (e.g., sarcasm) or misleading signals, for which forced fusion may amplify errors. Recognizing this, we propose Dual-Path Conflict Resolution (DCR), a unified framework that learns when to fuse and when to drop modalities. Path I (Affective Fusion Distiller, AFD) performs reverse distillation from audio/visual teachers to a textual student using temporally weighted class evidence, thereby enhancing representation-level calibration and improving fusion when alignment is beneficial. Path II (Affective Discernment Agent, ADA) formulates MER as a contextual bandit that selects among fusion and unimodal predictions based on a dual-view state and a calibration-aware reward, enabling decision-level arbitration under irreconcilable conflicts without requiring per-modality reliability labels. By taking into account the full multimodal context and coupling soft calibration with hard arbitration, DCR reconciles conflicts that can be aligned while bypassing misleading modalities when fusion is harmful. Across five benchmarks covering both dialogue-level and clip-level MER, DCR consistently outperforms competitive baselines or achieves highly competitive results. Further ablations, conflict-specific subset evaluation, and modality-selection analysis verify that AFD and ADA are complementary and jointly improve robust conflict-aware emotion recognition.
Primary: Hefei University of Technology
All Institutions: Hefei University of Technology, Singapore Management University, Nanyang Technological University, MIT Media Lab
The main contribution of this paper is the introduction of the Dual-Path Conflict Resolution framework, which innovatively addresses modality conflicts in multimodal emotion recognition by employing a dual-path approach that distinguishes between benign and severe conflicts. This comprehensive analysis highlights the technical contributions, methodological rigor, and potential impact of the research on the field of affective computing.
The proposed Dual-Path Conflict Resolution (DCR) framework is a significant advancement in multimodal emotion recognition (MER). It effectively distinguishes between benign and severe modality conflicts, employing two distinct paths (AFD and ADA) to handle these conflicts appropriately. AFD utilizes knowledge distillation to enhance textual representations with non-verbal cues, while ADA employs a contextual bandit approach for decision-level arbitration. This dual-path strategy is innovative as it shifts the focus from traditional fusion methods that may amplify errors to a more nuanced conflict-aware approach. The methodology is well-structured, with clear definitions of conflict types and a comprehensive explanation of how each path operates.
The experimental evaluation is robust, covering five diverse benchmarks that include both dialogue-level and clip-level datasets. The results consistently demonstrate that DCR outperforms competitive baselines, indicating its effectiveness across different contexts. The paper includes detailed ablation studies that validate the contributions of each component within the DCR framework, further strengthening the findings. The use of multiple evaluation metrics enhances the reliability of the results.
The paper provides sufficient implementation details, including the architecture, training protocols, and datasets used. However, the absence of a public demo or detailed code repository at the time of review limits reproducibility. The authors mention that the source code and models will be released, which is a positive step towards enhancing reproducibility.
One limitation of the study is the reliance on heuristic approximations for defining conflict severity, which may not capture the full complexity of modality interactions in real-world scenarios. Additionally, while the framework shows strong performance, its effectiveness in highly nuanced or ambiguous emotional contexts remains to be fully explored.
The DCR framework has significant implications for various applications, including human-computer interaction, healthcare, and robotics, where accurate emotion recognition is crucial. By addressing modality conflicts more effectively, this work could lead to more reliable affective computing systems that better understand human emotions. The main contribution of this paper is the introduction of the Dual-Path Conflict Resolution framework, which innovatively addresses modality conflicts in multimodal emotion recognition by employing a dual-path approach that distinguishes between benign and severe conflicts. This comprehensive analysis highlights the technical contributions, methodological rigor, and potential impact of the research on the field of affective computing.
Recent progress in diffusion-based audio generation and restoration has substantially improved performance across heterogeneous conditioning regimes, including text-conditioned audio generation and audio-conditioned super-resolution. However, training audio diffusion models remains computationally expensive, and most existing pipelines still rely on static optimization recipes that treat the relative importance of training signals as fixed throughout learning. In this work, we argue that a major source of inefficiency lies in the evolving balance between semantic acquisition and generation-oriented refinement. Early training places stronger emphasis on acquiring condition-aligned semantic structure and coarse global organization, whereas later training increasingly emphasizes temporal consistency, perceptual fidelity, and fine-detail refinement. To characterize this evolving balance, we introduce a progress-based regime variable derived from the training-time slope of an SSL-space discrepancy, which measures semantic progress during training. Based on this signal, we develop three complementary stage-aware mechanisms: decayed SSL guidance for early semantic bootstrapping, self-adaptive timestep sampling driven by the regime variable, and structure-aware regularization activated from convergent grouped organization in parameter space. We evaluate these mechanisms on text-conditioned audio generation and audio-conditioned super-resolution. Across both settings, the proposed stage-aware strategies improve convergence behavior and yield gains on the primary generation and spectral reconstruction metrics over standard static baselines. These results support the view that efficient audio diffusion training can benefit from treating external guidance, internal organization, and optimization emphasis as stage-dependent components rather than fixed ingredients.
Primary: China Pharmaceutical University
All Institutions: China Pharmaceutical University, University of Science and Technology of China
The paper presents a novel stage-adaptive framework for audio diffusion modeling, significantly enhancing training efficiency and model performance. The comprehensive methodology and experimental validation contribute valuable insights to the field, although concerns regarding reproducibility and the need for broader applicability remain.
The paper introduces a stage-aware perspective on audio diffusion training, which is a significant methodological innovation. The authors propose three complementary mechanisms—decayed SSL guidance, self-adaptive timestep sampling, and structure-aware regularization—each designed to adapt the training process based on the evolving needs of the model. This approach is well-justified and supported by a clear theoretical framework, utilizing a regime variable to monitor semantic progress. The proposed methods are distinct from traditional static optimization techniques, marking a notable advancement in the field of audio diffusion modeling.
The experiments are comprehensive, evaluating the proposed methods on both text-conditioned audio generation and audio-conditioned super-resolution. The use of multiple metrics (e.g., FAD, KL divergence, and spectral reconstruction metrics) provides a robust assessment of performance improvements. The results consistently demonstrate that the stage-aware mechanisms outperform static baselines, highlighting their effectiveness. However, the paper could benefit from additional experiments to further validate the findings across diverse datasets and conditions.
The paper lacks explicit details regarding the implementation and availability of code or datasets, which raises concerns about reproducibility. While the methodology is well-documented, the absence of a project URL or demo limits the ability of other researchers to replicate the results or build upon the work.
One limitation is the reliance on a single frozen SSL encoder, which may restrict the generalizability of the findings. Additionally, while the results show improvements in convergence and quality metrics, the paper does not sufficiently address the computational overhead introduced by the proposed mechanisms. The authors also acknowledge that gains in certain metrics (e.g., SISNR) were less pronounced, suggesting that the approach may not uniformly enhance all aspects of audio quality.
The findings have significant implications for the development of efficient audio generation systems, particularly in applications requiring high-quality audio synthesis and restoration. By demonstrating that training efficiency can be improved through a stage-aware approach, this work may influence future research directions in generative modeling and audio processing. The insights gained could also be applicable to other domains where dynamic adaptation of training strategies is beneficial. The paper presents a novel stage-adaptive framework for audio diffusion modeling, significantly enhancing training efficiency and model performance. The comprehensive methodology and experimental validation contribute valuable insights to the field, although concerns regarding reproducibility and the need for broader applicability remain.
High-quality singing annotations are fundamental to modern Singing Voice Synthesis (SVS) systems. However, obtaining these annotations at scale through manual labeling is unrealistic due to the substantial labor and musical expertise required, making automatic annotation highly necessary. Despite their utility, current automatic transcription systems face significant challenges: they often rely on complex multi-stage pipelines, struggle to recover text-note alignments, and exhibit poor generalization to out-of-distribution (OOD) singing data. To alleviate these issues, we present VocalParse, a unified singing voice transcription (SVT) model built upon a Large Audio Language Model (LALM). Specifically, our novel contribution is to introduce an interleaved prompting formulation that jointly models lyrics, melody, and word-note correspondence, yielding a generated sequence that directly maps to a structured musical score. Furthermore, we propose a Chain-of-Thought (CoT) style prompting strategy, which decodes lyrics first as a semantic scaffold, significantly mitigating the context disruption problem while preserving the structural benefits of interleaved generation. Experiments demonstrate that VocalParse achieves state-of-the-art SVT performance on multiple singing datasets. The source code and checkpoint are available at https://github.com/pymaster17/VocalParse.
Primary: Xi'an Jiaotong University
All Institutions: Xi'an Jiaotong University, Nanyang Technological University, Tianjin University, Ant Group, Zhejiang University
The main contribution of this paper is the development of VocalParse, a unified and scalable singing voice transcription framework that effectively integrates lyrics and melody transcription using advanced prompting strategies and a novel data collection pipeline. This work represents a significant step forward in addressing the challenges of automatic singing voice transcription, with implications for both academic research and practical applications in music technology.
The paper introduces VocalParse, a unified singing voice transcription model leveraging a Large Audio Language Model (LALM). The methodology is innovative, particularly with the interleaved prompting formulation that integrates lyrics and melody in a structured manner, addressing the challenges of traditional multi-stage pipelines. The Chain-of-Thought (CoT) prompting strategy is a significant advancement, allowing for better semantic continuity in the transcription process. The introduction of the SingCrawl data pipeline for large-scale data collection is also a noteworthy contribution, enhancing the model's training data quality and quantity.
The experiments demonstrate VocalParse's state-of-the-art performance across multiple datasets, showcasing its effectiveness in both Automatic Melody Transcription (AMT) and Automatic Lyric Transcription (ALT). The results are robust, with clear metrics provided for evaluation, including Mean Absolute Error (MAE) for melody and Word Error Rate (WER) for lyrics. The ablation studies effectively highlight the importance of the CoT prompting and the SingCrawl pipeline, providing insights into the model's performance drivers.
The paper provides sufficient implementation details, including training configurations and data processing steps. The availability of source code and checkpoints on GitHub enhances reproducibility, although the lack of a demo or interactive component may limit accessibility for some researchers.
The paper acknowledges limitations such as the assumption of a single global tempo for songs, which may not capture variations in performance. Additionally, the model's performance is constrained by the quality of the teacher pipeline used for data annotation. The focus on Mandarin data may also limit generalizability to other languages without further adaptation.
VocalParse has the potential to significantly impact the field of music information retrieval (MIR) and singing voice synthesis (SVS) by providing a scalable solution for automatic singing voice transcription. This could lead to advancements in music generation, annotation, and analysis, facilitating broader applications in music technology and AI-driven creative processes. The main contribution of this paper is the development of VocalParse, a unified and scalable singing voice transcription framework that effectively integrates lyrics and melody transcription using advanced prompting strategies and a novel data collection pipeline. This work represents a significant step forward in addressing the challenges of automatic singing voice transcription, with implications for both academic research and practical applications in music technology.