Audio-Visual Intelligence (AVI) has emerged as a central frontier in artificial intelligence, bridging auditory and visual modalities to enable machines that can perceive, generate, and interact in the multimodal real world. In the era of large foundation models, joint modeling of audio and vision has become increasingly crucial, i.e., not only for understanding but also for controllable generation and reasoning across dynamic, temporally grounded signals. Recent advances, such as Meta MovieGen and Google Veo-3, highlight the growing industrial and academic focus on unified audio-vision architectures that learn from massive multimodal data. However, despite rapid progress, the literature remains fragmented, spanning diverse tasks, inconsistent taxonomies, and heterogeneous evaluation practices that impede systematic comparison and knowledge integration. This survey provides the first comprehensive review of AVI through the lens of large foundation models. We establish a unified taxonomy covering the broad landscape of AVI tasks, ranging from understanding (e.g., speech recognition, sound localization) to generation (e.g., audio-driven video synthesis, video-to-audio) and interaction (e.g., dialogue, embodied, or agentic interfaces). We synthesize methodological foundations, including modality tokenization, cross-modal fusion, autoregressive and diffusion-based generation, large-scale pretraining, instruction alignment, and preference optimization. Furthermore, we curate representative datasets, benchmarks, and evaluation metrics, offering a structured comparison across task families and identifying open challenges in synchronization, spatial reasoning, controllability, and safety. By consolidating this rapidly expanding field into a coherent framework, this survey aims to serve as a foundational reference for future research on large-scale AVI.
Primary: University of Toronto
All Institutions: University of Toronto, National University of Singapore, Queen Mary University of London, The Hong Kong University of Science and Technology, The University of Texas at Dallas, University of Oxford, University of Rochester, Microsoft Research
The paper serves as a foundational reference for future research on large-scale Audio-Visual Intelligence, effectively synthesizing existing knowledge and proposing a structured framework for the field. Its comprehensive nature and focus on methodological synthesis position it as a significant contribution to the understanding and advancement of multimodal AI systems.
The paper provides a comprehensive survey of Audio-Visual Intelligence (AVI) within the context of large foundation models, establishing a unified taxonomy that organizes diverse tasks and synthesizes methodological foundations. The authors effectively categorize tasks into understanding, generation, and interaction, and detail the foundational techniques necessary for AVI, including representation learning, cross-modal fusion, and generative modeling. The methodology is well-structured, providing a clear framework for future research.
While the paper is primarily a survey and does not present original experimental results, it does curate existing datasets, benchmarks, and evaluation metrics across various task families. This curation is crucial for establishing a baseline for future work and promoting reproducibility in the field. However, the lack of new experimental results limits the paper's impact in demonstrating the efficacy of the proposed methodologies.
The paper outlines the methodologies and frameworks clearly, which aids in reproducibility. However, as it is a survey, it does not provide specific implementation details or code, which could enhance reproducibility further. The authors mention that they will publicly release summarized resources, which is a positive step toward transparency.
The paper acknowledges the fragmentation in the literature and the inconsistencies in evaluation practices, which complicate systematic comparisons. It also highlights open challenges in synchronization, spatial reasoning, and safety that remain unaddressed in current research. Additionally, the lack of new experimental validation of the proposed frameworks is a notable limitation.
The survey addresses a rapidly evolving field with significant implications for various applications, including assistive technologies, education, robotics, and entertainment. By consolidating knowledge and establishing a coherent framework, the paper can guide future research directions and foster collaboration across subfields. The emphasis on safety and governance also highlights the importance of ethical considerations in the deployment of AVI systems. The paper serves as a foundational reference for future research on large-scale Audio-Visual Intelligence, effectively synthesizing existing knowledge and proposing a structured framework for the field. Its comprehensive nature and focus on methodological synthesis position it as a significant contribution to the understanding and advancement of multimodal AI systems.
Stem retrieval, the task of matching missing stems to a given audio submix, is a key challenge currently limited by models that discard temporal information. We introduce PHALAR, a contrastive framework achieving a relative accuracy increase of up to $\approx 70\%$ over the state-of-the-art while requiring $<50\%$ of the parameters and a 7$\times$ training speedup. By utilizing a Learned Spectral Pooling layer and a complex-valued head, PHALAR enforces pitch-equivariant and phase-equivariant biases. PHALAR establishes new retrieval state-of-the-art across MoisesDB, Slakh, and ChocoChorales, correlating significantly higher with human coherence judgment than semantic baselines. Finally, zero-shot beat tracking and linear chord probing confirm that PHALAR captures robust musical structures beyond the retrieval task.
Primary: Sapienza University of Rome
All Institutions: Sapienza University of Rome, Moises Systems, Paradigma
PHALAR introduces a novel contrastive framework for musical audio representation that emphasizes phase and pitch equivariance. This work significantly advances the understanding of musical coherence in audio processing, providing a powerful tool for both academic research and practical applications in music technology.
The methodology presented in PHALAR is innovative, leveraging complex-valued neural networks (CVNNs) and a Learned Spectral Pooling layer to address the challenge of musical coherence in audio representation. The paper effectively shifts the paradigm from traditional pooling methods that enforce invariance to a framework that emphasizes equivariance, allowing for the preservation of temporal alignment critical for music tasks. The use of phase information as a geometric representation is a significant advancement, as it directly addresses limitations in existing models that fail to capture rhythmic and harmonic relationships adequately.
The experimental evaluation is robust, demonstrating PHALAR's superiority over state-of-the-art models in stem retrieval tasks across multiple datasets (MoisesDB, Slakh, and ChocoChorales). The paper includes a comprehensive set of experiments, including human-centered evaluations that correlate model outputs with human perception of coherence. The results indicate a significant improvement in retrieval accuracy and alignment with human judgment, validating the effectiveness of the proposed approach.
The authors provide a GitHub repository with code and checkpoints, enhancing the reproducibility of their work. They detail the experimental setup, including dataset construction, training parameters, and evaluation metrics, which allows other researchers to replicate their findings. However, the reliance on specific hardware (NVIDIA A100 GPUs) may pose challenges for some researchers in terms of accessibility.
The paper acknowledges several limitations, including the model's performance degradation in the presence of non-periodic rhythms and tempo drift, which can affect phase coherence. Additionally, the model's training on predominantly Western popular music may limit its applicability to other musical styles. The potential for rigid enforcement of rhythmic standards in automated systems is also noted as a concern.
PHALAR has significant implications for music information retrieval and audio processing, offering a novel approach that could enhance various applications, including automated music production and generative audio models. The framework's emphasis on phase-sensitive data representation could extend to other fields such as radar systems and medical imaging, where preserving complex signal integrity is crucial. PHALAR introduces a novel contrastive framework for musical audio representation that emphasizes phase and pitch equivariance. This work significantly advances the understanding of musical coherence in audio processing, providing a powerful tool for both academic research and practical applications in music technology.
Joint audio-video generation models are rapidly approaching professional production quality, raising a central question: do they understand audio-visual physics, or merely generate plausible sounds and frames that violate real-world consistency? We introduce AV-Phys Bench, a benchmark for evaluating physical commonsense in joint audio-video generation. AV-Phys Bench tests models across three scene categories: Steady State, Event Transition, and Environment Transition. It covers physics-grounded subcategories drawn from real-world scenes, plus Anti-AV-Physics prompts that deliberately request physically inconsistent audio-video behavior. Each generation is evaluated along five dimensions: visual semantic adherence, audio semantic adherence, visual physical commonsense, audio physical commonsense, and cross-modal physical commonsense. Across three proprietary and four open-source models, we find that Seedance 2.0 performs best overall, but all models remain far from robust physical understanding. Performance drops sharply on event-driven and environment-driven transitions, and even strong proprietary systems collapse on Anti-AV-Physics prompts. We further introduce AV-Phys Agent, a ReAct-style evaluator that combines a multimodal language model with deterministic acoustic measurement tools, producing rankings that closely align with human ratings. Our results identify cross-modal physical consistency and transition-driven scene dynamics as key open challenges for joint audio-video generation.
Primary: University of Texas at Dallas
All Institutions: University of Texas at Dallas, University of Washington, University of California, Los Angeles
The main contribution of this paper is the establishment of a comprehensive benchmark for evaluating physical commonsense in joint audio-video generation models, which addresses a critical gap in the current evaluation landscape. The innovative methodology and thorough experimental evaluation provide valuable insights into the limitations of existing models and pave the way for future advancements in the field.
The paper introduces AV-Phys Bench, a novel benchmark for evaluating physical commonsense in joint audio-video generation models. The methodology is robust, employing a structured evaluation rubric that assesses five dimensions of performance across three scene categories. The use of both human evaluation and an automated evaluator (AV-Phys Agent) that integrates multimodal reasoning with deterministic audio measurement tools is particularly innovative. This dual approach enhances the reliability of the assessments and provides a comprehensive framework for understanding model performance beyond mere perceptual quality.
The experiments conducted are thorough, evaluating seven models across various categories and dimensions. The results reveal significant gaps in physical commonsense understanding among current models, particularly in transition scenarios. The performance metrics are well-defined, and the analysis is detailed, providing insights into the strengths and weaknesses of the evaluated models. The findings are significant, highlighting the challenges in achieving physical consistency in audio-video generation.
The paper provides sufficient details regarding the evaluation setup, including the datasets and scoring mechanisms. The availability of the dataset and the code repository enhances reproducibility. However, the reliance on specific models for evaluation may limit the generalizability of the findings to other models not included in the study.
The paper acknowledges limitations such as the focus on English prompts and the binary nature of the evaluation rubric, which may not capture the nuances of model performance. Additionally, the study is constrained to eight-second clips, which may not represent longer or more complex scenarios effectively.
The introduction of AV-Phys Bench has the potential to significantly influence the development of joint audio-video generation models by providing a clear framework for assessing physical commonsense. This could lead to improvements in model architectures and training methodologies, ultimately enhancing the applicability of these models in real-world scenarios, such as virtual environments and educational content. The main contribution of this paper is the establishment of a comprehensive benchmark for evaluating physical commonsense in joint audio-video generation models, which addresses a critical gap in the current evaluation landscape. The innovative methodology and thorough experimental evaluation provide valuable insights into the limitations of existing models and pave the way for future advancements in the field.
Chord generation is an inherently constrained creative task that requires balancing stylistic diversity with music-theoretic feasibility. Existing approaches typically entangle candidate generation and constraint enforcement within a single model, making the diversity-feasibility trade-off difficult to control and interpret. In this work, we approach chord generation from a system-level perspective, introducing a Retrieval-Edit-Rerank (RER) framework that decomposes the task into three explicit stages: i) retrieval, which defines a stylistically plausible candidate space; ii) editing, which enforces music-theoretic feasibility through minimal modifications; and iii) reranking, which resolves soft preferences among feasible candidates. This separation provides a controllable pipeline, where each component addresses a distinct aspect of the generation process, thereby enhancing both the interpretability and adjustability of the output chords. Through objective metrics and subjective evaluation, our decomposed system outperforms all end-to-end chord generation baselines in balancing chord diversity and music-theoretic feasibility. Ablation studies further confirm the complementary roles of each stage in creative exploration and constraint satisfaction.
Primary: NetEase Cloud Music
All Institutions: NetEase Cloud Music, Individual Researcher
The paper introduces a novel Retrieval-Edit-Rerank framework for chord generation that effectively balances stylistic diversity and music-theoretic feasibility. This work is significant as it provides a structured approach to a complex creative task, advancing the field of music generation by offering a system that is both interpretable and adaptable.
The proposed Retrieval-Edit-Rerank (RER) framework effectively decomposes the chord generation task into three distinct stages, allowing for a clear separation of concerns that enhances both interpretability and control over the generation process. The methodology is well-structured, with a focus on leveraging a melody-chord memory for retrieval, followed by an editing stage that enforces music-theoretic constraints, and a reranking stage that resolves preferences among feasible candidates. This approach is innovative in the context of music generation, as it combines stylistic diversity with theoretical validity in a systematic manner. The use of a contrastive learning framework for memory construction is a notable strength, as it allows for the retrieval of stylistically relevant chord progressions without sacrificing harmonic integrity.
The experiments are comprehensive, utilizing multiple datasets and a variety of metrics for both objective and subjective evaluation. The inclusion of ablation studies strengthens the findings by demonstrating the importance of each stage in the RER framework. The results show that the proposed method outperforms existing end-to-end models in terms of balancing diversity and feasibility, which is a critical aspect of chord generation. The subjective evaluations involving human participants provide additional validation of the system's effectiveness, indicating a well-rounded experimental design.
The paper provides a clear description of the methodology and experimental setup, which facilitates reproducibility. However, the lack of publicly available code or datasets limits the ability of other researchers to replicate the results fully. Including a GitHub repository or links to datasets would significantly enhance the reproducibility of the work.
One limitation is the reliance on a fixed set of music-theoretic constraints, which may not capture the full range of stylistic diversity present in various musical genres. Additionally, the system's performance may vary depending on the quality and diversity of the melody-chord memory constructed during training. The paper also notes that the editing stage can sometimes lead to overly conservative outputs, which may limit creative exploration.
The RER framework has the potential to significantly impact music generation applications, particularly in contexts where adherence to music theory is essential, such as in music education, composition tools, and automated music production systems. By providing a controllable and interpretable approach to chord generation, this work could facilitate more nuanced interactions between musicians and AI systems, enhancing creativity while respecting musical traditions. The paper introduces a novel Retrieval-Edit-Rerank framework for chord generation that effectively balances stylistic diversity and music-theoretic feasibility. This work is significant as it provides a structured approach to a complex creative task, advancing the field of music generation by offering a system that is both interpretable and adaptable.
Training multimodal large language models has long been limited by the scarcity of high-quality paired multimodal data. Recent studies show that the shared representation space of pretrained multimodal contrastive models can serve as a bridge, enabling models to perform multimodal training with unimodal data. However, the key premise of this paradigm remains insufficiently understood: can representations from different modalities be reliably interchanged? The core obstacle lies in the persistent Modality Gap in the shared space. In this work, we revisit the geometric nature of the modality gap. We find that modality representations already share compatible dominant semantic geometry. What truly hinders modality interchangeability is not a simple global shift, but an anisotropic residual structure concentrated along a small number of dominant directions. Based on this finding, we further propose the principle of anisotropic modality gap alignment: effective modality alignment should align with the target-modality distribution while preserving the semantic structure of the source modality. Guided by this principle, we propose an anisotropic geometric correction framework, AnisoAlign, for unpaired modality alignment. This framework leverages the internal geometric prior of the target modality and performs bounded correction on source-modality representations, thereby constructing substitute representations in the target modality. Experiments confirm its benefits in both geometric diagnostics and text-only MLLM training. Overall, this work recasts the modality gap from an empirical observation into a correctable, structured geometric phenomenon and provides a new representation alignment perspective for training multimodal models with unimodal data.
Primary: Hong Kong University of Science and Technology (HKUST)
All Institutions: Hong Kong University of Science and Technology (HKUST), Stanford University
The main contribution of this paper is the introduction of AnisoAlign, a structured geometric correction framework that addresses the modality gap in multimodal learning, allowing for effective training of multimodal models using unimodal data. This work significantly advances the understanding of modality alignment and provides a robust methodology that can be applied in various multimodal applications.
The proposed methodology, AnisoAlign, presents a novel approach to addressing the modality gap in multimodal learning by focusing on the geometric structure of modality representations. The authors effectively identify that the modality gap is not merely a centroid shift but an anisotropic residual, which is a significant insight. The method involves a two-stage process that includes a target-modality prior pretraining and a bounded refinement step to ensure that the source modality's semantic structure is preserved while aligning with the target modality. This structured approach is well-justified and theoretically supported, making it a strong contribution to the field.
The experiments are comprehensive, evaluating both geometric diagnostics and the performance of the model in multimodal large language model (MLLM) training. The results demonstrate that AnisoAlign outperforms existing methods in terms of both representation alignment and MLLM training effectiveness. The use of various metrics to assess performance adds rigor to the evaluation, although the paper could benefit from more extensive ablation studies to further clarify the contributions of individual components.
The paper provides a detailed description of the methodology and experimental setup, which aids in reproducibility. However, the lack of a public code repository or demo limits the ability for external validation of results. Future work should consider making the implementation available to enhance reproducibility.
One limitation is the reliance on high-quality unimodal data, which may not always be available. Additionally, while the paper discusses the geometric aspects of modality alignment, it does not fully explore the implications of varying data quality on the effectiveness of the proposed method.
The findings have significant implications for the development of multimodal models, particularly in scenarios where paired data is scarce. By enabling the use of unimodal data for training, this work could facilitate advancements in applications such as image captioning, visual question answering, and other areas where multimodal understanding is crucial. The main contribution of this paper is the introduction of AnisoAlign, a structured geometric correction framework that addresses the modality gap in multimodal learning, allowing for effective training of multimodal models using unimodal data. This work significantly advances the understanding of modality alignment and provides a robust methodology that can be applied in various multimodal applications.
Discovering structure in biological signals without supervision is a fundamental problem in computational intelligence, yet existing bioacoustic methods assume vocal production models or predefined semantic units, leaving non-vocal species poorly served. This work introduces BeeVe, an unsupervised framework for acoustic state discovery in collective honey bee buzzing. BeeVe uses the self-supervised Patchout Spectrogram Transformer (PaSST) as a frozen feature extractor, then trains a Vector-Quantized Variational Autoencoder (VQ-VAE) without labels on those embeddings, learning a finite discrete codebook of acoustic tokens directly from unlabelled hive audio. No labels, pretext tasks, or contrastive objectives are used at any stage. Post-hoc evaluation against known queen status reveals that the learned tokens separate queenright and queenless conditions with Jensen-Shannon Divergence values between 0.609 and 0.688, and that the queenless condition further decomposes into three internally coherent sub-states stable across experiments with different codebook sizes and random seeds. Token transition analysis confirms non-random sequential structure (p << 0.001) across all experiments. Generalisation to unseen recordings preserves both token overlap (Jaccard = 0.947) and global manifold topology. These results demonstrate that unsupervised discrete codebook learning can recover repeatable acoustic structure from a non-vocal biological signal without annotation, opening a path toward non-invasive acoustic hive health monitoring.
Primary: Heriot-Watt University Dubai
All Institutions: Heriot-Watt University Dubai
The paper presents a significant advancement in unsupervised learning for bioacoustic state discovery, demonstrating the ability to extract structured acoustic patterns from honey bee buzzing without prior assumptions or annotations. The methodology is innovative and the results impactful, contributing to both machine learning and ecological monitoring fields.
The paper introduces BeeVe, a novel unsupervised framework for acoustic state discovery in honey bee buzzing, leveraging a self-supervised Patchout Spectrogram Transformer (PaSST) as a feature extractor and a Vector-Quantized Variational Autoencoder (VQ-VAE) for learning a discrete codebook of acoustic tokens. The methodology is well-structured, employing a rigorous unsupervised learning approach without relying on predefined labels or semantic assumptions. The use of post-hoc evaluation against known queen status to validate the learned tokens adds robustness to the methodology. However, the choice of PaSST as a frozen feature extractor, while justified, may limit the model's adaptability to other non-vocal species.
The experiments are comprehensive, utilizing a dataset of honey bee audio to assess the effectiveness of the proposed method. The results demonstrate significant separation between queenright and queenless conditions, with Jensen-Shannon Divergence values indicating meaningful distinctions. The identification of stable sub-states within the queenless condition and the analysis of token transition patterns provide strong evidence of the model's capability to uncover structured acoustic states. The metrics used for evaluation, including Jaccard overlap and manifold projection, are appropriate and effectively illustrate the model's performance.
The paper provides detailed implementation details, including the architecture of the VQ-VAE, training objectives, and evaluation metrics, which contribute to reproducibility. However, the absence of a publicly accessible code repository or demo limits the ability for other researchers to replicate the findings directly.
The study is limited by its reliance on a controlled subset of the UrBAN dataset, which may not fully capture the diversity of acoustic states across different hives and conditions. Additionally, while the findings are promising, the lack of biological annotation to ground the discovered states raises questions about their true biological relevance. The scalability of the approach to larger datasets and more complex hive conditions remains to be validated.
The implications of this work extend to non-invasive monitoring of honey bee colonies, potentially aiding in the early detection of conditions such as queen loss or swarming. The unsupervised nature of the framework allows for the identification of previously unlabelled states, which could enhance hive management practices and contribute to pollinator conservation efforts. The approach also opens avenues for future research in bioacoustics and machine learning applications in non-vocal species. The paper presents a significant advancement in unsupervised learning for bioacoustic state discovery, demonstrating the ability to extract structured acoustic patterns from honey bee buzzing without prior assumptions or annotations. The methodology is innovative and the results impactful, contributing to both machine learning and ecological monitoring fields.
The evaluation of voice anonymisation remains challenging. Current practice relies on automatic speaker verification metrics such as the equal error rate (EER). Performance estimates dependent on the classifier and operating point provide an incomplete or even misleading characterisation of privacy risk. We investigate the use of similarity rank disclosure (SRD), an information-theoretic metric, which operates on feature representations rather than classifier decisions, providing a threshold-independent assessment of privacy and analysis of both average and worst-case disclosure. We report its application to speaker embeddings, fundamental frequency, and phone embeddings using 2024 VoicePrivacy Challenge systems. The SRD reveals privacy leaks and system-specific weaknesses missed by EER-based evaluation. Findings highlight the merit of representation-level metrics and demonstrate the potential of SRD as a flexible and interpretable tool for the evaluation of voice anonymisation.
Primary: EURECOM
All Institutions: EURECOM, Ruhr-Universität Bochum, Orange Innovation, University of Stuttgart
The main contribution of this paper is the introduction of the Similarity Rank Disclosure (SRD) metric for evaluating voice anonymisation, which provides a more interpretable and comprehensive assessment of privacy risks compared to traditional metrics. The technical contribution is significant as it addresses critical gaps in existing evaluation practices, offering a robust framework for future research and application in voice privacy.
The paper introduces the Similarity Rank Disclosure (SRD) as a novel metric for evaluating voice anonymisation, which operates independently of classifier decisions and provides a more nuanced understanding of privacy risks. The methodology is well-structured, detailing the steps for computing SRD, including ranking, distribution generation, and statistical modeling. The use of empirical probability distributions and beta-binomial fitting enhances the robustness of the evaluation. However, the paper could benefit from clearer explanations of the statistical methods used and their implications for the results.
The experiments leverage a comprehensive dataset from the 2024 VoicePrivacy Challenge, applying the SRD to various anonymisation systems. The results demonstrate that SRD can reveal privacy leaks and weaknesses that traditional metrics like EER miss. The evaluation includes both qualitative and quantitative analyses, providing a thorough comparison of different anonymisation approaches. However, the paper does not provide extensive details on the experimental setup, such as the specific configurations of the anonymisation systems or the exact nature of the datasets used.
The paper lacks sufficient details for full reproducibility. While it describes the methodology and provides some results, it does not include code or data availability, which are critical for other researchers to replicate the findings. Clearer documentation of the experimental setup and access to the datasets used would enhance reproducibility.
One limitation is the reliance on a specific dataset (2024 VoicePrivacy Challenge) which may not generalize to other contexts or datasets. Additionally, the SRD's effectiveness in various real-world scenarios remains to be fully validated. The paper also acknowledges the potential for overestimation of privacy if strong attack models are not used, which is a critical consideration for future work.
The findings have significant implications for the development of privacy-preserving technologies in voice processing, particularly in light of increasing concerns about data privacy and regulation. The SRD could serve as a foundational tool for evaluating voice anonymisation systems, influencing both academic research and industry practices. The flexibility of the SRD to adapt to various feature representations also opens avenues for future research in related domains. The main contribution of this paper is the introduction of the Similarity Rank Disclosure (SRD) metric for evaluating voice anonymisation, which provides a more interpretable and comprehensive assessment of privacy risks compared to traditional metrics. The technical contribution is significant as it addresses critical gaps in existing evaluation practices, offering a robust framework for future research and application in voice privacy.
Closed-Set speaker identification aims to assign a speech utterance to one of a predefined set of enrolled speakers and requires robust modeling of speaker-specific characteristics across multiple temporal scales. While recent deep learning approaches have achieved strong performance, many existing architectures provide limited mechanisms for modeling temporal dependencies across different time scales, which can restrict the effective use of complementary short-, mid-, and long-term speaker characteristics. In this paper, we propose TARNet, a lightweight Temporal-Aware Representation Network for closed-set speaker identification. TARNet explicitly models temporal information at multiple time scales using a multi-stage temporal encoder with stage-specific dilation configurations. The resulting multi-scale representations are fused and aggregated via an Attentive Statistics Pooling (ASP) module to produce a discriminative utterance-level speaker embedding. Experiments on the VoxCeleb1 and LibriSpeech datasets show that TARNet outperforms state-of-the-art methods while maintaining competitive computational complexity, making it suitable for practical speaker identification systems. The code is publicly available at https://github.com/YassinTERRAF/TARNet.
Primary: University Mohammed VI Polytechnic
All Institutions: University Mohammed VI Polytechnic, CID Development
The paper presents TARNet, a novel multi-scale architecture for closed-set speaker identification that effectively models temporal dependencies, achieving state-of-the-art performance while maintaining computational efficiency. The comprehensive evaluation of the methodology, experimental results, and potential applications underscores its significance in the field of audio processing and speaker recognition.
The proposed TARNet architecture introduces a multi-scale temporal encoder that effectively captures speaker-specific characteristics across different temporal scales. The use of dilated convolutions allows for the modeling of temporal dependencies while preserving resolution, which is a significant improvement over traditional CNN architectures. The Attentive Statistics Pooling (ASP) module further enhances the model's ability to focus on discriminative features, making the methodology both innovative and practical for real-world applications.
The experiments conducted on VoxCeleb1 and LibriSpeech datasets demonstrate TARNet's superior performance compared to state-of-the-art models. The results are well-presented, showing a clear advantage in accuracy metrics. The paper also includes ablation studies that validate the importance of each component in the architecture, providing a comprehensive evaluation of the model's effectiveness.
The authors have made the code publicly available, which enhances reproducibility. However, the paper could benefit from more detailed descriptions of the experimental setup, including hyperparameters and training procedures, to facilitate easier replication of results by other researchers.
One limitation of the study is the lack of evaluation in noisy or reverberant conditions, which are common in real-world scenarios. Additionally, while TARNet shows strong performance on the evaluated datasets, its generalizability to other speaker identification tasks or languages remains untested.
The advancements presented in TARNet have significant implications for biometric authentication and forensic analysis, where accurate speaker identification is crucial. The lightweight nature of the model also suggests potential applications in mobile and embedded systems, expanding its usability in various domains. The paper presents TARNet, a novel multi-scale architecture for closed-set speaker identification that effectively models temporal dependencies, achieving state-of-the-art performance while maintaining computational efficiency. The comprehensive evaluation of the methodology, experimental results, and potential applications underscores its significance in the field of audio processing and speaker recognition.
Music comprises two core structural components, melody and rhythm, that vary widely across cultures. Whether these components coevolve in a coupled way or follow independent trajectories remains unclear. We introduce a novel computational pipeline to extract vocal melodic pitch-interval and percussive inter-onset timing distributions from 27,628 popular songs across 59 countries, enabling large-scale cross-cultural comparison that bypasses traditional music annotations. Musical similarities between countries aligned with geographic and linguistic relationships, validating our approach. Substantial variation emerged in both melodic and rhythmic structures across countries, yet the diversity of the two components was not significantly correlated, challenging assumptions of coupled evolution. Only rhythmic diversity was significantly associated with ethnic and linguistic heterogeneity, while melodic diversity showed no such association. These findings suggest that melody and rhythm constitute partially independent systems shaped by distinct cultural and evolutionary pressures, rather than components of a single monolithic musical style.
Primary: University of Cambridge
All Institutions: University of Cambridge, RITMO Centre for Interdisciplinary Studies in Rhythm, Time and Motion, University of Oslo, Department of Psychology, Goldsmiths College, University of London, Department of Life Sciences, Leipzig University, Division of Social Science, New York University Abu Dhabi, Department of Psychology, Cornell University
This paper presents a significant advancement in understanding the independent evolution of melody and rhythm across cultures through a novel computational approach. The methodology is innovative and the findings challenge existing assumptions in music theory, providing a fresh perspective on cultural music analysis.
The paper introduces a novel computational pipeline that leverages deep learning source separation techniques to extract melodic and rhythmic features from a large dataset of songs. This approach is innovative as it allows for the analysis of music without relying on traditional, often biased, manual annotations. The methodology is well-detailed, including the use of kernel density estimation for summarizing melodic and rhythmic distributions, and the careful consideration of time scales for analyzing pitch intervals. The choice of using distributional profiles rather than higher-level constructs is a significant strength, as it minimizes analytical biases. The operational definitions of melody and rhythm are clear, although they are somewhat limited in scope.
The experimental design is robust, utilizing a large dataset of 27,628 songs from 59 countries, which provides a comprehensive basis for cross-cultural analysis. The authors validate their computational pipeline by demonstrating that the extracted distributions align with known musical patterns, thus establishing face validity. The use of Jensen-Shannon divergence to assess musical similarity between countries is appropriate and effectively highlights the independence of melodic and rhythmic diversity. However, the paper could benefit from additional metrics or qualitative assessments to further substantiate the findings.
The paper provides sufficient detail regarding the methods and algorithms used, including the specific tools and parameters for source separation and feature extraction. The availability of the code and metadata through the provided GitHub link enhances reproducibility. However, the reliance on proprietary audio data from YouTube may limit the ability of others to fully replicate the study, particularly in regions with less representation.
The authors acknowledge several limitations, including the potential biases introduced by using YouTube chart data, which may not capture traditional or non-commercial music. Additionally, the source separation algorithms are primarily trained on Western music, which could affect the accuracy of the extracted features for non-Western genres. The operational definitions of melody and rhythm are also somewhat narrow, potentially overlooking the complexity of musical interactions.
The findings have significant implications for the fields of music cognition and cultural evolution, suggesting that melody and rhythm are shaped by different cultural and evolutionary pressures. This research could influence how music is studied across disciplines, including anthropology, psychology, and musicology. The methodology could also be applied to other forms of cultural expression, providing insights into the interplay between different artistic components. This paper presents a significant advancement in understanding the independent evolution of melody and rhythm across cultures through a novel computational approach. The methodology is innovative and the findings challenge existing assumptions in music theory, providing a fresh perspective on cultural music analysis.
Multimodal Emotion Recognition (MER) has attracted growing attention with the rapid advancement of human-computer interaction. However, different modalities exhibit substantial discrepancies in semantics, quality, and availability, leading to highly heterogeneous modality combinations and posing significant challenges to achieving consistent and reliable emotion understanding. To address this challenge, we propose the Modality-Aware Contrastive and Uncertainty-Regularized (MCUR) framework, which approaches MER from the perspective of representation consistency, aiming to enable robust emotion prediction across heterogeneous modality combinations. MCUR incorporates two core components: (1) Modality Combination-Based and Category-Based Contrastive Learning mechanism (MCB-CL), which encourages samples with the same emotion category and the same available modalities to be close in the representation space; and (2) Sample-wise Uncertainty-Guided Regularization (SUGR), which adaptively assigns sample-wise uncertain weights to samples to optimize training. Extensive experiments demonstrate that MCUR consistently outperforms existing methods, achieving average F1 gains of 2.2% on MOSI, 2.67% on MOSEI, and 4.37% on IEMOCAP.
Primary: University of Electronic Science and Technology of China
All Institutions: University of Electronic Science and Technology of China, Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China
The paper introduces the MCUR framework, which enhances multimodal emotion recognition by promoting representation consistency and addressing uncertainty in predictions. This comprehensive analysis highlights the technical contributions, innovative methodology, and potential impact on the field of machine learning and human-computer interaction.
The proposed MCUR framework presents a novel approach to multimodal emotion recognition (MER) by focusing on representation consistency across heterogeneous modalities. The integration of Modality Combination-Based and Category-Based Contrastive Learning (MCB-CL) and Sample-wise Uncertainty-Guided Regularization (SUGR) is a significant methodological advancement. MCB-CL enhances the discriminative power of representations by enforcing proximity in the embedding space for samples with the same emotion category and modality combination, while SUGR addresses uncertainty in predictions, allowing for adaptive weighting during training. This dual approach is innovative and effectively tackles the challenges posed by modality heterogeneity.
The experiments are comprehensive, utilizing three widely recognized datasets (MOSI, MOSEI, and IEMOCAP) to validate the effectiveness of the MCUR framework. The reported performance improvements over existing state-of-the-art methods, with average F1 gains of 2.2% on MOSI, 2.67% on MOSEI, and 4.37% on IEMOCAP, demonstrate the robustness of the proposed approach. The ablation studies further substantiate the contributions of each component of the framework, revealing the critical role of both MCB-CL and SUGR in enhancing model performance.
The paper provides detailed implementation details, including the training configurations, hyperparameter settings, and evaluation protocols, which are crucial for reproducibility. The authors also mention the use of official implementations for baseline models, ensuring a fair comparison. However, the lack of publicly available code or demo URLs limits the ease of reproduction for external researchers.
While the MCUR framework shows promising results, the paper does not address the potential computational overhead associated with the added complexity of the proposed methods. Additionally, the performance in real-world noisy conditions is not thoroughly evaluated, which could limit the applicability of the framework in practical scenarios. The reliance on specific datasets may also restrict generalizability to other contexts or domains.
The advancements presented in this paper have significant implications for human-computer interaction, particularly in enhancing emotion recognition systems that can adapt to varying modalities. The ability to maintain consistent representations across different modalities can improve the robustness of applications in areas such as virtual assistants, mental health monitoring, and social robotics. The focus on uncertainty in predictions may also lead to more reliable systems that can better handle real-world variability. The paper introduces the MCUR framework, which enhances multimodal emotion recognition by promoting representation consistency and addressing uncertainty in predictions. This comprehensive analysis highlights the technical contributions, innovative methodology, and potential impact on the field of machine learning and human-computer interaction.
Recently, neural directional filtering (NDF) has been introduced as a flexible approach for reconstructing a virtual directional microphone (VDM) with a desired directivity pattern for spatial sound capture. Building on this idea, we propose NDF+, which enables joint neural directional filtering and diffuse sound extraction. NDF+ reformulates VDM estimation into two coupled subtasks: dereverberated VDM reconstruction and diffuse sound extraction. This reformulation enables NDF+ to manipulate diffuse components in the final reconstructed VDM output. We evaluated NDF+ under reverberant conditions and compared it with representative conventional baselines. Results show that NDF+ consistently outperforms the baselines on both subtasks, while maintaining VDM reconstruction quality comparable to that of the original single-task NDF model. These findings indicate that NDF+ introduces an additional degree of freedom for diffuse sound control in the VDM reconstruction. In a stereo recording application, NDF+ provides controllable inter-channel level differences between left and right channels by adjusting the estimated diffuse component.
Primary: International Audio Laboratories Erlangen
All Institutions: International Audio Laboratories Erlangen, Fraunhofer IIS, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU)
The main contribution of this paper is the introduction of NDF+, a joint framework for neural directional filtering and diffuse sound extraction that enhances VDM reconstruction while allowing for effective control of diffuse sound components. This work represents a significant step forward in spatial audio processing, combining innovative methodologies with rigorous experimental validation to address key challenges in the field.
The paper introduces NDF+, a novel framework that combines neural directional filtering with diffuse sound extraction, effectively reformulating the VDM estimation into two coupled subtasks. The methodology employs a dual-mask architecture using LSTM networks to estimate coherent and diffuse components, which is a significant advancement over previous models that focused solely on VDM reconstruction. The approach is well-structured, with a clear explanation of the DNN architecture, training strategy, and loss functions, demonstrating a thoughtful integration of existing techniques with innovative modifications.
The experimental evaluation is comprehensive, comparing NDF+ against conventional baselines under various reverberant conditions. The results indicate that NDF+ consistently outperforms these baselines on both subtasks while maintaining VDM reconstruction quality. The use of objective metrics such as SDR and PESQ to measure performance adds rigor to the evaluation. However, the paper could benefit from more detailed qualitative assessments, such as user studies or subjective listening tests, to further validate the improvements in audio quality.
The paper provides a detailed description of the experimental setup, including the configurations of the microphone array, training data, and evaluation metrics. However, the absence of a public code repository or demo URL limits the reproducibility of the results. Including such resources would enhance the paper's impact and allow other researchers to validate and build upon the findings.
One limitation is the reliance on simulated environments for training and testing, which may not fully capture the complexities of real-world acoustic scenarios. Additionally, while the paper discusses the performance of NDF+ in stereo recording applications, it does not explore its scalability to larger microphone arrays or more complex sound environments.
The advancements presented in NDF+ have significant implications for spatial audio applications, particularly in enhancing the quality of recordings in reverberant environments. The ability to control diffuse sound components can improve immersive audio experiences in various fields, including virtual reality, telecommunications, and music production. The framework could also inspire further research into joint signal processing techniques in audio applications. The main contribution of this paper is the introduction of NDF+, a joint framework for neural directional filtering and diffuse sound extraction that enhances VDM reconstruction while allowing for effective control of diffuse sound components. This work represents a significant step forward in spatial audio processing, combining innovative methodologies with rigorous experimental validation to address key challenges in the field.
In audio generation evaluation, Fréchet Audio Distance (FAD) is a 2-Wasserstein distance with structural constraints for both primitives: the cost is a frozen embedding pullback whose invariance set hides severe artifacts, and the coupling is a Gaussian fit that dilutes rank-1 contamination relative to discrete OT. We propose Optimal Transport Audio Distance (OTAD), which corrects each primitive with one dedicated mechanism -- a residual Riemannian ground-metric adapter for the cost and entropic Sinkhorn optimal transport for the coupling. Across eight encoders under a four-axis protocol, coupling-only comparisons at $ε= 0.05$ show that Sinkhorn's rank-1 sensitivity exceeds FAD's by a factor of 1.9 to 3.6. Furthermore, OTAD achieves a higher mean Spearman correlation with audio-quality MOS (DCASE 2023 Task 7) than baseline metrics. As an intrinsic benefit of the discrete transport plan, OTAD yields per-sample diagnostics with AUROC $\ge 0.86$, a capability that scalar- or kernel-aggregated metrics structurally lack.
Primary: Sogang University
All Institutions: Sogang University
The paper presents a significant advancement in audio evaluation metrics by introducing OTAD, which effectively addresses the limitations of existing methods through innovative methodological contributions and rigorous empirical validation.
The proposed methodology introduces a novel Optimal Transport Audio Distance (OTAD) metric that addresses the limitations of existing metrics like Fréchet Audio Distance (FAD) by employing a dual correction mechanism: a learned Riemannian ground-metric adapter for the cost function and entropic Sinkhorn optimal transport for the coupling. This innovative approach allows for a more sensitive detection of artifacts in audio generation, which is crucial for applications in text-to-audio synthesis. The method is theoretically grounded and systematically validated through a comprehensive experimental design, including a factorial decomposition of the contributions from cost and coupling.
The experiments are robust, utilizing eight different encoders and a four-axis evaluation protocol to assess the performance of OTAD against FAD and KAD. The results indicate a significant improvement in sensitivity to rank-1 contamination and a higher correlation with human Mean Opinion Scores (MOS). The experiments also include per-sample diagnostics, which provide insights into the specific artifacts present in audio samples, highlighting the practical utility of OTAD in real-world applications.
The paper includes sufficient detail regarding the implementation of the OTAD metric and the experimental setup, including the datasets used (FSD50K and ESC-50) and the training of the adapters. The release of the OTAD toolkit on GitHub further enhances reproducibility, allowing other researchers to replicate the findings and utilize the metric in their own work.
The study acknowledges several limitations, including the reliance on a single listening test for MOS validation and the potential biases introduced by training on a specific dataset (FSD50K). Additionally, the performance of OTAD on music and speech domains remains untested, and the scalability of the method for larger datasets is not fully explored.
The introduction of OTAD has significant implications for the field of audio generation evaluation, providing a more nuanced and sensitive metric that can improve the quality of generated audio. This advancement could lead to better user experiences in applications such as music synthesis, sound design, and audio restoration. The methodology can also serve as a blueprint for future research in audio evaluation metrics across different domains. The paper presents a significant advancement in audio evaluation metrics by introducing OTAD, which effectively addresses the limitations of existing methods through innovative methodological contributions and rigorous empirical validation.
Symbolic music datasets with matched scores and performances are essential for many music information retrieval (MIR) tasks. Yet, existing resources often cover a narrow range of composers, lack performance variety, omit note-level alignments, or use inconsistent naming formats. This work presents PianoCoRe, a large-scale piano MIDI dataset that unifies and refines major open-source piano corpora. The dataset contains 250,046 performances of 5,625 pieces written by 483 composers, totaling 21,763 h of performed music. PianoCoRe is released in tiered subsets to support different applications: from large-scale analysis and pre-training (PianoCoRe-C and deduplicated PianoCoRe-B) to expressive performance modeling with note-level score alignment (PianoCoRe-A/A*). The note-aligned subset, PianoCoRe-A, provides the largest open-source collection of 157,207 performances aligned to 1,591 scores to date. In addition to the dataset, the contributions are: (1) a MIDI quality classifier for detecting corrupted and score-like transcriptions and (2) RAScoP, an alignment refinement pipeline that cleans temporal alignment errors and interpolates missing notes. The analysis shows that the refinement reduces temporal noise and eliminates tempo outliers. Moreover, an expressive performance rendering model trained on PianoCoRe demonstrates improved robustness to unseen pieces compared to models trained on raw or smaller datasets. PianoCoRe provides a ready-to-use foundation for the next generation of expressive piano performance research.
Primary: Skolkovo Institute of Science and Technology
All Institutions: Skolkovo Institute of Science and Technology
The main contribution of this paper is the introduction of the PianoCoRe dataset, a refined and comprehensive MIDI dataset that addresses the limitations of existing resources in symbolic music analysis. This work significantly enhances the foundation for future research in expressive piano performance modeling and music information retrieval, showcasing a meticulous approach to dataset curation and quality assessment.
The methodology presented in this paper is robust and comprehensive, detailing a systematic approach to curating and refining a large-scale piano MIDI dataset. The authors employ a multi-tiered strategy that includes deduplication, quality assessment, and note alignment refinement using the RAScoP pipeline. The integration of various existing datasets into a unified collection is particularly noteworthy, as it addresses the inconsistencies and limitations found in previous datasets. The use of a MIDI quality classifier to filter out corrupted transcriptions and the detailed description of the alignment process further enhance the methodological rigor.
The experiments conducted demonstrate the effectiveness of the proposed dataset and methodologies. The authors provide a thorough evaluation of the MIDI quality classifier, achieving a high macro F1 score, which indicates the classifier's reliability in distinguishing between performance qualities. Additionally, the application of the dataset in training an expressive performance rendering model shows significant improvements in robustness, suggesting that the dataset effectively supports advanced modeling tasks. However, specific quantitative results from the expressive performance rendering model could further strengthen the experimental validation.
The paper includes detailed descriptions of the dataset construction process, including data sources, matching methodologies, and quality assessment techniques. The authors provide a GitHub repository link for the project, which enhances reproducibility. However, the paper could benefit from including specific implementation details or code snippets to facilitate replication of the methodologies by other researchers.
One limitation identified is the reliance on existing datasets, which may still contain inherent biases or limitations that could affect the quality of the combined dataset. Additionally, while the RAScoP pipeline improves alignment, the paper does not fully address potential edge cases where alignment might still be problematic. The focus on public domain works may also limit the dataset's applicability to contemporary compositions.
The PianoCoRe dataset has the potential to significantly impact the field of music information retrieval and computational musicology by providing a comprehensive resource for training models in expressive performance rendering and analysis. Its tiered structure allows for diverse applications, from large-scale analysis to specific performance modeling tasks, thus fostering advancements in music generation and understanding. The main contribution of this paper is the introduction of the PianoCoRe dataset, a refined and comprehensive MIDI dataset that addresses the limitations of existing resources in symbolic music analysis. This work significantly enhances the foundation for future research in expressive piano performance modeling and music information retrieval, showcasing a meticulous approach to dataset curation and quality assessment.
We propose a plug-and-play framework for speech enhancement and separation that augments predictive methods with a generative speech prior. Our approach, termed Stochastic Interpolant Prior for Speech (SIPS), builds on stochastic interpolants and leverages their flexibility to bridge predictive and generative modeling. Specifically, we decompose the interpolation dynamics into a task-specific drift and a stochastic denoising component, allowing a predictive estimate to be integrated directly into the generative sampling process. This results in a mathematically grounded framework for combining strong pretrained predictors with the expressive power of generative models. To this end, we train a score model using only clean speech, yielding a degradation-agnostic prior that can be reused across tasks. During inference, the predictor provides a deterministic drift that steers the sampling process toward a task-consistent estimate, while the score model preserves perceptual naturalness. Unlike prior hybrid approaches, which typically rely on architecture-specific conditioning and are tied to particular predictors or degradation settings, SIPS provides a unified framework that generalizes across predictors and additive degradation tasks. We demonstrate its effectiveness for both speech enhancement and speech separation using recent predictors such as SEMamba and FlexIO. The proposed method consistently improves perceptual quality, achieving gains up +1.0 NISQA for speech separation.
Primary: MERL
All Institutions: MERL
The paper presents a novel framework that bridges predictive and generative modeling for speech enhancement and separation, demonstrating significant improvements in perceptual quality while maintaining competitive performance on traditional metrics. The comprehensive methodology and robust experimental validation position this work as a meaningful contribution to the field of machine learning in audio processing.
The proposed Stochastic Interpolant Prior for Speech (SIPS) framework effectively integrates predictive and generative modeling approaches, addressing the limitations of both paradigms by introducing a mathematically grounded decomposition of interpolation dynamics. This innovative methodology allows for a flexible and efficient plug-and-play integration with existing predictors, enhancing the perceptual quality of speech enhancement and separation tasks while maintaining fidelity to the original signals.
The experiments conducted demonstrate the efficacy of SIPS across various tasks, including speech enhancement and separation, using multiple state-of-the-art predictors. The results indicate consistent improvements in non-intrusive perceptual quality metrics, alongside competitive performance in reference-based metrics, showcasing the robustness and versatility of the proposed method.
The paper provides a clear implementation of the proposed method, including detailed descriptions of the experimental setup, data representation, and training procedures. The availability of the implementation on GitHub enhances reproducibility, allowing other researchers to validate and build upon the findings.
One limitation is the reliance on clean speech data for training the generative prior, which may affect performance in real-world scenarios with diverse degradation types. Additionally, while the method shows promise, further exploration of its generalization capabilities across different audio domains is warranted.
The SIPS framework has significant implications for various applications in speech processing, including telecommunications, assistive technologies, and audio content creation. By improving speech quality in challenging conditions, this work can enhance user experiences in voice communication systems and contribute to advancements in automatic speech recognition and natural language processing. The paper presents a novel framework that bridges predictive and generative modeling for speech enhancement and separation, demonstrating significant improvements in perceptual quality while maintaining competitive performance on traditional metrics. The comprehensive methodology and robust experimental validation position this work as a meaningful contribution to the field of machine learning in audio processing.
Large audio language models (LALMs) are increasingly used to reason over long audio clips, yet deployment often compresses audio before inference to reduce memory and latency. The risk is that compression can leave aggregate accuracy acceptable while sharply degrading answers for a deployment-critical query family. We study answer-preserving audio compression, judging a compressor by the excess answer-error it induces, especially for the worst-affected family. We formulate this theoretically as a compressor acceptance-rejection criterion, derive a practical sign-off protocol that returns compression budgets satisfying worst-family checks with statistical confidence, and evaluate it on five multiple-choice audio question-answering benchmarks with two Qwen-based backbones. The protocol exposes hidden family-level damage, shows that the chosen query-family partition can change the approved budget, and identifies regimes where query-conditioned compression helps maintain answer preservation.
Primary: Technion--Israel Institute of Technology
All Institutions: Technion--Israel Institute of Technology
The main contribution of this paper is the introduction of a framework for task-aware answer-preserving audio compression, which addresses the critical challenge of maintaining answer quality in large audio language models under compression constraints. This work significantly advances the understanding of audio compression impacts on model performance and provides a practical methodology for evaluating and ensuring answer preservation across diverse query families.
The methodology presented in this paper is robust and well-structured. The authors introduce a theoretical framework for task-aware answer-preserving audio compression, which is a novel approach to evaluating audio compression techniques in the context of large audio language models (LALMs). The paper formulates a compressor acceptance-rejection criterion and derives a practical sign-off protocol that incorporates statistical confidence, which is a significant contribution to the field. The approach is grounded in a solid theoretical foundation, linking practical deployment configurations to answer preservation metrics. The use of paired evaluations and the focus on worst-family checks are particularly noteworthy, as they address the critical issue of performance degradation across different query families.
The experimental evaluation is comprehensive, utilizing five multiple-choice audio question-answering benchmarks. The authors effectively demonstrate the applicability of their framework and the importance of considering family-level performance rather than relying solely on average metrics. The results reveal significant insights into how compression can affect different query families, showcasing the hidden damage that can occur when using average performance metrics. The experiments are well-designed, and the analysis is thorough, providing empirical support for the theoretical claims made in the paper.
The paper provides detailed descriptions of the experimental setup, including the datasets, models, and evaluation metrics used. However, there are some limitations regarding the availability of code and data, as no URLs for project repositories or demo pages are provided. This lack of resources may hinder reproducibility for other researchers looking to validate or build upon the findings.
The paper acknowledges several limitations, including the potential for query-family coarsening and the challenges of estimating true Bayes risks due to calibration errors and prompt sensitivity. Additionally, the framework's applicability to different languages, longer audio clips, or varying deployment scenarios is not fully established, which may limit its generalizability.
The proposed framework has significant implications for the deployment of audio language models in real-world applications, particularly in scenarios where audio compression is necessary for efficiency. By emphasizing the importance of answer preservation across different query families, this work could influence future research and development in audio processing, machine learning, and multimodal systems. The findings could lead to improved audio compression techniques that better maintain the integrity of information critical for specific tasks. The main contribution of this paper is the introduction of a framework for task-aware answer-preserving audio compression, which addresses the critical challenge of maintaining answer quality in large audio language models under compression constraints. This work significantly advances the understanding of audio compression impacts on model performance and provides a practical methodology for evaluating and ensuring answer preservation across diverse query families.
Integrating speech understanding and generation is a pivotal step toward building unified speech models. However, the different representations required for these two tasks currently pose significant compatibility challenges. Typically, semantics-oriented features are learned from self-supervised learning (SSL), and acoustic-oriented features from reconstruction. Such fragmented representations hinder the realization of truly unified speech systems. We present WavCube, a compact continuous latent derived from an SSL speech encoder that simultaneously supports speech understanding, reconstruction, and generation. WavCube employs a two-stage training scheme. Stage 1 trains a semantic bottleneck to filter off-manifold redundancy that makes raw SSL features intractable for diffusion. Stage 2 injects fine-grained acoustic details via end-to-end reconstruction, while a semantic anchoring loss ensures the representation remains grounded within its original semantic manifold. Comprehensive experiments show that WavCube closely approaches WavLM performance on SUPERB despite an 8x dimensional compression, attains reconstruction quality on par with existing acoustic representations, delivers state-of-the-art zero-shot TTS performance with markedly faster training convergence, and excels in speech enhancement, separation, and voice conversion tasks on the SUPERB-SG benchmark. Systematic ablations reveal that WavCube's two-stage recipe resolves two intrinsic flaws of SSL features for generative modeling, paving the way for future unified speech systems. Codes and checkpoints are available at https://github.com/yanghaha0908/WavCube.
Primary: Shanghai Jiao Tong University
All Institutions: Shanghai Jiao Tong University, Shanghai Innovation Institute, Tencent, Independent Researcher, Peking University, Tianjin University, Zhejiang University
WavCube presents a novel approach to unify speech understanding and generation through a compact continuous latent representation. This paper makes a substantial contribution to the field by addressing the compatibility challenges between semantic and acoustic features, demonstrating its effectiveness through rigorous experimentation across multiple benchmarks.
The methodology employed in WavCube is innovative, utilizing a two-stage training scheme that effectively addresses the challenges of integrating semantic and acoustic representations. The first stage compresses high-dimensional SSL features into a compact latent space, while the second stage enriches this latent space with fine-grained acoustic details. This approach is well-justified and systematically tackles the inherent flaws of existing SSL representations, making it a significant contribution to the field.
The experiments conducted are comprehensive and well-structured, demonstrating WavCube's performance across various tasks, including speech understanding, reconstruction, and generation. The results show that WavCube achieves competitive performance against existing methods, indicating its effectiveness and robustness. The use of benchmarks like SUPERB and the detailed evaluation metrics further enhance the credibility of the findings.
The paper provides sufficient details regarding the methodology and experimental setup, including the datasets and training configurations used. However, the lack of a demo or interactive component may hinder some aspects of reproducibility for practitioners who wish to implement the model.
While the paper presents a strong framework, it does not explicitly discuss potential limitations or assumptions underlying the proposed approach. For instance, the performance drop due to dimensionality reduction and the reliance on specific datasets could be areas of concern that warrant further exploration.
The implications of WavCube are significant, as it offers a unified framework for speech processing that could enhance applications in voice synthesis, speech recognition, and multimodal interactions. By bridging the gap between understanding and generation, WavCube could pave the way for more integrated and efficient speech technologies. WavCube presents a novel approach to unify speech understanding and generation through a compact continuous latent representation. This paper makes a substantial contribution to the field by addressing the compatibility challenges between semantic and acoustic features, demonstrating its effectiveness through rigorous experimentation across multiple benchmarks.
In this paper, we present X-Voice, a 0.4B multilingual zero-shot voice cloning model that clones arbitrary voices and enables everyone to speak 30 languages. X-Voice is trained on a 420K-hour multilingual corpus using the International Phonetic Alphabet (IPA) as a unified representation. To eliminate the reliance on prompt text without complex preprocessing like forced alignment, we design a two-stage training paradigm. In Stage 1, we establish X-Voice$_{\text{s1}}$ through standard conditional flow-matching training and use it to synthesize 10K hours of speaker-consistent segments as audio prompts. In Stage 2, we fine-tune on these audio pairs with prompt text masked to derive X-Voice$_{\text{s2}}$, which enables zero-shot voice cloning without requiring transcripts of audio prompts. Architecturally, we extend F5-TTS by implementing a dual-level injection of language identifiers and decoupling and scheduling of Classifier-Free Guidance to facilitate multilingual speech synthesis. Subjective and objective evaluation results demonstrate that X-Voice outperforms existing flow-matching based multilingual systems like LEMAS-TTS and achieves zero-shot cross-lingual cloning capabilities comparable to billion-scale models such as Qwen3-TTS. To facilitate research transparency and community advancement, we open-source all related resources.
Primary: Zhejiang University
All Institutions: Zhejiang University, Beijing Haitian Ruisheng Science Technology Ltd, Center for Language and Speech Processing, Fudan University, Geely Automobile Research Institute (Ningbo) Company Ltd, MoE Key Lab of Artificial Intelligence, Shanghai Innovation Institute, Shanghai Jiao Tong University, X-LANCE Lab
The paper presents X-Voice, a novel multilingual zero-shot voice cloning model that significantly advances the capabilities of TTS systems across 30 languages. The methodology, which includes a two-stage training process and innovative architectural enhancements, addresses critical limitations in existing systems, making it a valuable contribution to the field of machine learning and audio processing.
The paper introduces a two-stage training paradigm for zero-shot voice cloning, which is a significant advancement in the field. The first stage focuses on building a robust multilingual backbone using a large corpus, while the second stage fine-tunes the model using synthetic audio prompts without the need for reference transcripts. This approach effectively addresses the challenges of multilingual TTS systems, particularly the reliance on aligned text and audio, which is often problematic for low-resource languages. The introduction of dual-level language injection and decoupled classifier-free guidance further enhances the model's ability to maintain speaker identity and prosodic accuracy across languages.
The experimental results are comprehensive, comparing X-Voice against several state-of-the-art models across multiple languages. The use of both subjective and objective evaluation metrics, including WER, SIM-o, IMOS, and SMOS, provides a well-rounded assessment of the model's performance. The results indicate that X-Voice achieves competitive performance, particularly in low-resource languages, while also demonstrating improvements in intelligibility and speaker consistency compared to existing systems. The release of a new evaluation benchmark with human annotations adds significant value to the research community.
The paper provides detailed implementation details, including model configurations, training setups, and evaluation protocols, which enhances reproducibility. The authors have also open-sourced their training corpus and evaluation benchmarks, fostering transparency and allowing other researchers to build upon their work.
Despite its strengths, the model still faces challenges in preserving speaker similarity in certain phonological contexts, indicating a trade-off between accent suppression and timbre preservation. Additionally, the handling of intra-sentential code-switching is noted as an area for future improvement. The reliance on high-quality synthetic data in the fine-tuning stage may also limit the model's applicability in scenarios where such data is not available.
The advancements presented in this paper have the potential to democratize high-fidelity TTS technology, making it accessible for a wider range of languages, including low-resource ones. The implications extend to various applications, such as personalized voice assistants, language learning tools, and accessibility technologies for individuals with speech impairments. The open-sourcing of resources could significantly accelerate research in multilingual TTS systems and contribute to the development of more inclusive technologies. The paper presents X-Voice, a novel multilingual zero-shot voice cloning model that significantly advances the capabilities of TTS systems across 30 languages. The methodology, which includes a two-stage training process and innovative architectural enhancements, addresses critical limitations in existing systems, making it a valuable contribution to the field of machine learning and audio processing.
Quantum machine learning has emerged as a promising tool for pattern recognition, yet many audio-focused approaches still treat spectrograms as generic images and do not explicitly exploit their time-frequency structure. We propose Q-Patch, a quantum feature map tailored to audio that encodes local time-frequency patches from mel-spectrograms into quantum states using shallow, hardware-efficient circuits with adjacency-aware entanglement. Each selected patch is summarized by a compact four-dimensional acoustic descriptor and mapped to a four-qubit circuit with depth at most three, enabling practical quantum kernel construction under near-term constraints. We evaluate Q-Patch on an audio spoofing detection task using a controlled, balanced protocol and compare it with size-matched classical baselines. Q-Patch improves discrimination between bona fide and spoofed samples, achieving an area under the receiver operating characteristic curve (AUROC) of 0.87, compared with 0.82 for a radial basis function support vector machine (RBF-SVM) trained on the same patch-level features. Kernel-space analysis further reveals a clear class structure, with cross-class similarity around 0.615 and within-class self-similarity of 1.00. Overall, Q-Patch provides a practical framework for incorporating time-frequency-aware representations into quantum kernel learning for audio authenticity assessment in low-resource settings.
Primary: Potomac Quantum
All Institutions: Potomac Quantum, United International University, University of Maryland, Monash University, University of the Sunshine Coast
The paper presents Q-Patch, a quantum feature-mapping framework for audio spoofing detection that effectively utilizes time-frequency structures in spectrograms. This innovative approach, combined with rigorous experimental validation, positions the work as a meaningful contribution to the field of audio deepfake detection and quantum machine learning.
The methodology introduces Q-Patch, a novel quantum feature mapping framework specifically designed for audio spoofing detection. It effectively utilizes local time-frequency patches from mel-spectrograms, which is a significant improvement over treating spectrograms as generic images. The use of shallow, hardware-efficient quantum circuits with adjacency-aware entanglement is innovative, as it addresses practical constraints of near-term quantum computing. The approach to summarize patches into compact four-dimensional descriptors before quantum embedding is well thought out, allowing for efficient processing while maintaining relevant information.
The experimental evaluation is conducted on a balanced dataset derived from LJ Speech, which includes both bona fide and spoofed audio samples. The results indicate that Q-Patch outperforms classical baselines, achieving an AUROC of 0.87 compared to 0.82 for RBF-SVM. The analysis of kernel-space structure further supports the effectiveness of the proposed method, showing clear class separability. However, the limited dataset size (100 samples) raises concerns about the generalizability of the results, which should be addressed in future work.
The paper provides a detailed description of the methodology, including data preparation, feature extraction, and quantum embedding processes. However, the absence of code or a project URL limits the reproducibility of the results. Future work should include sharing the implementation details or code to enable other researchers to replicate the findings.
The study's limitations include the small dataset size, which may not capture the full diversity of real-world audio spoofing attacks. The controlled nature of the spoof generation (using additive noise and spectral distortions) may not reflect the complexities of actual spoofing methods. Additionally, the results are based on ideal quantum simulations, which may not translate directly to performance on physical quantum hardware.
The proposed Q-Patch framework has the potential to significantly impact the field of audio deepfake detection by introducing quantum machine learning techniques that leverage time-frequency structures. This could lead to more robust detection methods that are particularly useful in low-resource settings. As quantum computing technology advances, the framework may become increasingly applicable in real-world scenarios, enhancing security against audio spoofing. The paper presents Q-Patch, a quantum feature-mapping framework for audio spoofing detection that effectively utilizes time-frequency structures in spectrograms. This innovative approach, combined with rigorous experimental validation, positions the work as a meaningful contribution to the field of audio deepfake detection and quantum machine learning.
The Massive Sound Embedding Benchmark (MSEB) has emerged as a standard for evaluating the functional breadth of audio models. While initial baselines focused on specialized encoders, the shift toward "audio-native" Large Language Models (LLMs) suggests a new paradigm where a single multimodal backbone may replace complex, task-specific pipelines. This paper provides a rigorous empirical evaluation of leading LLMs - including members from the Gemini and GPT families - across the eight core MSEB capabilities to assess their efficacy and audio-text parity. Our results indicate that while a significant modality gap persists regarding performance and robustness, the empirical evidence for an "optimal" modeling approach remains inconclusive. Ultimately, the choice between audionative and cascaded architectures depends heavily on specific use-case requirements and the underlying assumptions regarding latency, cost, and reasoning depth.
Primary: Google
All Institutions: Google USA & Germany
The main contribution of this paper is the rigorous empirical evaluation of leading audio-native LLMs on the MSEB, providing valuable insights into their performance and the challenges of achieving audio-text parity. The comprehensive analysis of methodologies and results positions this work as a significant step forward in the integration of audio processing within the framework of large language models, addressing both theoretical and practical aspects of the field.
The paper presents a comprehensive methodology for applying large language models (LLMs) to the Massive Sound Embedding Benchmark (MSEB), detailing a systematic approach to task-specific prompting and evaluation across diverse audio tasks. The methodology is well-structured, with clear definitions of tasks, input/output formats, and considerations for model performance. The iterative refinement of prompt templates through interactions with models like Gemini 3 demonstrates a thoughtful approach to optimizing LLMs for audio tasks. However, the paper could benefit from a more detailed discussion of the limitations of the chosen methodologies, particularly regarding the adaptability of LLMs to non-generative tasks.
The experimental evaluation is robust, covering a wide range of models and tasks, with detailed performance metrics provided for each evaluation. The use of a diverse set of datasets, including multilingual and varied acoustic environments, enhances the reliability of the results. The paper effectively compares audio-native LLMs with traditional cascaded systems, providing insights into their relative strengths and weaknesses. However, the analysis of results could be improved by including more visual aids (e.g., graphs) to illustrate performance trends across tasks and models.
The paper mentions the open-source nature of the MSEB toolkit and provides a link to the GitHub repository, which is a positive aspect for reproducibility. However, the paper lacks detailed implementation specifics, such as hyperparameter settings, training protocols, and the exact versions of models used, which could hinder full reproducibility for other researchers.
The paper acknowledges the significant modality gap that persists between audio and text processing, which is a critical limitation. Additionally, the authors note the challenges in achieving consistent performance across different locales and acoustic conditions, indicating that the models may not generalize well in real-world applications. The potential for test data contamination is also a significant concern that could skew results.
The findings of this research have significant implications for the development of audio processing systems, particularly in enhancing the capabilities of LLMs in understanding and reasoning about audio data. The establishment of the MSEB as a benchmark could drive further research and innovation in the field, promoting the development of more robust and versatile audio-native models. The open-source nature of the toolkit encourages community engagement and collaboration, which could accelerate advancements in auditory intelligence. The main contribution of this paper is the rigorous empirical evaluation of leading audio-native LLMs on the MSEB, providing valuable insights into their performance and the challenges of achieving audio-text parity. The comprehensive analysis of methodologies and results positions this work as a significant step forward in the integration of audio processing within the framework of large language models, addressing both theoretical and practical aspects of the field.
This study presents a bio inspired signal processing framework for robust Underwater Acoustic Target Recognition (UATR). The latest state of the art methods often fail to resolve dense low frequency harmonic structures in vessel propulsion signals under high noise conditions, which is addressed by the proposed framework using a biologically inspired Gammatone filter bank that emulates the cochlea nonlinear frequency selectivity. By distributing filters according to the Equivalent Rectangular Bandwidth (ERB) scale, the framework achieves a high fidelity representation of engine radiated tonals while effectively suppressing isotropic ambient interference. The resulting Cochleagram features are processed by a lightweight, custom designed Convolutional Neural Network (CNN) that leverages large receptive fields to integrate spectral-temporal continuities. Experimental results on the VTUAD dataset demonstrate a state of the art classification accuracy of 98.41%, outperforming Continuous Wavelet Transform and Mel Frequency Cepstral Coefficients baselines by 3.5% and 7.7% respectively. Furthermore, the framework achieves an inference latency of only 0.77 ms and a 0.971 Cohen Kappa score, validating its efficacy for real time deployment on autonomous, low-power sonar hardware.
Primary: Centre for Applied Research in Electronics (CARE)
All Institutions: Central Research Laboratory, Bharat Electronics Limited, Ghaziabad, India, Centre for Applied Research in Electronics (CARE), IIT Delhi, India
The main contribution of this paper is the development of a bio-inspired Gammatone-CNN framework for underwater acoustic target classification, achieving state-of-the-art performance through innovative feature extraction techniques. This research significantly advances the field of underwater acoustics by providing a method that combines biological principles with modern machine learning, demonstrating the potential for improved classification accuracy in challenging acoustic environments.
The paper introduces a novel bio-inspired Gammatone filter bank for underwater acoustic target classification, leveraging the non-linear frequency selectivity of the cochlea. The methodology prioritizes feature extraction over architectural complexity, employing a lightweight CNN that effectively integrates spectral-temporal features. The use of the Equivalent Rectangular Bandwidth (ERB) scale for filter distribution is particularly innovative, allowing for high fidelity in low-frequency representation, which is crucial for underwater acoustics. The mathematical foundations of the Gammatone filter and the detailed description of the Cochleagram formation process provide a solid basis for the proposed approach.
The experimental validation is robust, utilizing the VTUAD dataset to demonstrate the framework's effectiveness. Achieving a classification accuracy of 98.41% and a Cohen Kappa score of 0.971 indicates strong performance and reliability. The comparative analysis against established methods like Continuous Wavelet Transform and Mel Frequency Cepstral Coefficients shows significant improvements, reinforcing the proposed method's superiority. The inclusion of diverse metrics such as ROC curves and confusion matrices adds depth to the evaluation.
The paper provides detailed information on the experimental setup, including dataset partitioning, feature extraction parameters, and model architecture. However, the absence of a publicly available code repository limits reproducibility. Future work should consider sharing implementation details to facilitate validation by other researchers.
While the proposed framework shows remarkable performance, it may struggle with class imbalance, particularly for underrepresented classes like Passengership. The reliance on a specific dataset (VTUAD) may also limit generalizability to other underwater environments. Additionally, the computational efficiency on standard CPUs, while acceptable, could be a concern for real-time applications in more constrained environments.
The implications of this research extend to maritime security, ecological monitoring, and autonomous underwater vehicles (AUVs). By improving underwater target recognition, the framework can enhance surveillance capabilities and contribute to the protection of marine ecosystems. The low-power, real-time processing capabilities make it suitable for deployment in resource-constrained environments. The main contribution of this paper is the development of a bio-inspired Gammatone-CNN framework for underwater acoustic target classification, achieving state-of-the-art performance through innovative feature extraction techniques. This research significantly advances the field of underwater acoustics by providing a method that combines biological principles with modern machine learning, demonstrating the potential for improved classification accuracy in challenging acoustic environments.
The rapid advancement of generative audio models has outpaced the development of robust evaluation methodologies. Existing objective metrics and general multimodal large language models (MLLMs) often struggle with domain generalization, zero-shot capabilities, and instructional flexibility. To address these bottlenecks, we propose JASTIN, a generalizable, instruction-driven audio evaluation framework that formulates audio assessment as a self-instructed reasoning task. JASTIN bridges a frozen high-performance audio encoder with a fine-tuned LLM backbone via a trainable audio adapter. To ensure robust zero-shot generalization, we introduce a comprehensive instruction following data preparation pipeline, incorporating Multi-Source, Multi-Task, Multi-Calibration, and Multi-Description data. Experimental results demonstrate that JASTIN achieves state-of-the-art Pearson and Spearman correlations with human subjective ratings. It consistently outperforms general MLLMs across speech, sound, music, and out-of-domain evaluation tasks without the need for task-specific retraining.
Primary: Shanghai Jiao Tong University
All Institutions: Shanghai Jiao Tong University, MoE Key Laboratory of Artificial Intelligence, AI Institute
The main contribution of this paper is the introduction of JASTIN, a novel instruction-driven framework for zero-shot audio evaluation that significantly enhances the evaluation process by integrating multimodal LLMs with advanced audio processing techniques. This work represents a meaningful advancement in the field of audio evaluation, addressing critical challenges and setting a new standard for future research.
The proposed JASTIN framework innovatively integrates a frozen high-performance audio encoder with a fine-tuned LLM backbone through a trainable audio adapter, addressing the limitations of existing evaluation metrics by employing a self-instructed reasoning paradigm. The comprehensive data preparation pipeline, which includes multi-source, multi-task, multi-calibration, and multi-description strategies, enhances the model's zero-shot generalization capabilities, making it adaptable to various audio evaluation tasks without the need for task-specific retraining.
The experimental results demonstrate that JASTIN achieves state-of-the-art Pearson and Spearman correlations with human subjective ratings across diverse audio domains, including speech, sound, and music. The framework consistently outperforms both traditional metrics and general MLLMs, showcasing its robustness and effectiveness in real-world applications. The evaluation on out-of-domain tasks further emphasizes its generalization capabilities, which is a significant advancement in the field.
The authors have provided detailed implementation information, including training configurations, data preparation methods, and evaluation metrics, which enhances the reproducibility of their results. However, the lack of a demo URL limits immediate accessibility for other researchers to test the framework.
While the paper presents a comprehensive framework, it does not address potential biases in the training data or the limitations of the LLMs used. Additionally, the model's performance on highly specialized audio tasks may still require further validation.
The JASTIN framework has the potential to revolutionize audio evaluation methodologies by providing a more flexible and generalizable approach. Its implications extend to various applications in audio synthesis, music generation, and speech processing, enabling more efficient and scalable evaluation processes in these domains. The main contribution of this paper is the introduction of JASTIN, a novel instruction-driven framework for zero-shot audio evaluation that significantly enhances the evaluation process by integrating multimodal LLMs with advanced audio processing techniques. This work represents a meaningful advancement in the field of audio evaluation, addressing critical challenges and setting a new standard for future research.
While the spatial directivity of multichannel speech enhancement algorithms improves with the number of microphones, fitting large capture arrays into real-world edge devices is typically limited by physical constraints. To overcome this limitation, we propose Spatial-Magnifier, a neural network designed to generate virtual microphone (VM) signals from a limited set of real microphone (RM) measurements. Moreover, we introduce the Spatial Audio Representation Learning (SARL) framework, which leverages estimated VM signals and features to condition a downstream speech enhancement system. Experimental results demonstrate that the proposed framework outperforms existing spatial upsampling baselines across various speech extraction systems, including end-to-end multichannel speech enhancement and neural beamforming. The proposed method nearly recovers the oracle performance achieved when all microphones are available.
Primary: Korea Advanced Institute of Science and Technology (KAIST)
All Institutions: Korea Advanced Institute of Science and Technology (KAIST), Meta Reality Labs Research
The paper introduces Spatial-Magnifier, a neural network for spatial upsampling in multichannel speech enhancement, and the SARL framework, significantly enhancing downstream speech processing tasks. The innovative approach and rigorous experimental validation position this work as a valuable contribution to the field of audio signal processing and machine learning.
The paper presents a novel neural network architecture, Spatial-Magnifier, which effectively generates virtual microphone signals from real microphone measurements. It introduces the Spatial Audio Representation Learning (SARL) framework, which enhances the conditioning of downstream speech enhancement tasks by leveraging both estimated virtual microphone signals and features. The use of a GAN-based approach and the incorporation of selection and dynamic channel allocation modules are innovative aspects that contribute to the flexibility and efficiency of the model. The methodology is well-structured, with clear definitions and a logical flow from problem identification to proposed solutions.
The experiments are comprehensive, utilizing a well-defined dataset and a robust experimental setup to evaluate the performance of the proposed methods. The authors conduct ablation studies and comparisons with existing baselines, demonstrating the effectiveness of their approach across different configurations and tasks. The results indicate significant improvements in performance metrics such as SI-SDR, SNR, PESQ, and STOI, showcasing the technical superiority of the proposed methods over traditional approaches.
The paper provides sufficient details regarding the experimental setup, including the architecture parameters, training procedures, and evaluation metrics. However, the absence of a publicly accessible code repository or demo URL limits the reproducibility of the results. Future work could benefit from sharing the implementation to facilitate validation by the research community.
One limitation is the reliance on simulated data for training and evaluation, which may not fully capture the complexities of real-world environments. Additionally, while the proposed methods show promise, the performance in highly dynamic or noisy environments remains to be thoroughly evaluated. The computational efficiency, while improved, could still be a concern for deployment on resource-constrained devices.
The proposed methods have significant implications for real-world applications in speech enhancement, particularly in consumer electronics such as AR glasses and hearing aids. By enabling effective multichannel speech enhancement with fewer microphones, the work addresses a critical need for improved audio capture in compact devices. The advancements in spatial audio processing could also benefit various fields, including telecommunications, virtual reality, and assistive technologies. The paper introduces Spatial-Magnifier, a neural network for spatial upsampling in multichannel speech enhancement, and the SARL framework, significantly enhancing downstream speech processing tasks. The innovative approach and rigorous experimental validation position this work as a valuable contribution to the field of audio signal processing and machine learning.
Multimodal emotion recognition (MER) benefits from combining text, audio, and vision, yet standard fusion often fails when modalities conflict. Crucially, conflicts differ in resolvability: benign conflicts stem from missing, weak, or ambiguous cues and can be mitigated by cross-modal calibration, while severe conflicts arise from intrinsically contradictory (e.g., sarcasm) or misleading signals, for which forced fusion may amplify errors. Recognizing this, we propose Dual-Path Conflict Resolution (DCR), a unified framework that learns when to fuse and when to drop modalities. Path I (Affective Fusion Distiller, AFD) performs reverse distillation from audio/visual teachers to a textual student using temporally weighted class evidence, thereby enhancing representation-level calibration and improving fusion when alignment is beneficial. Path II (Affective Discernment Agent, ADA) formulates MER as a contextual bandit that selects among fusion and unimodal predictions based on a dual-view state and a calibration-aware reward, enabling decision-level arbitration under irreconcilable conflicts without requiring per-modality reliability labels. By taking into account the full multimodal context and coupling soft calibration with hard arbitration, DCR reconciles conflicts that can be aligned while bypassing misleading modalities when fusion is harmful. Across five benchmarks covering both dialogue-level and clip-level MER, DCR consistently outperforms competitive baselines or achieves highly competitive results. Further ablations, conflict-specific subset evaluation, and modality-selection analysis verify that AFD and ADA are complementary and jointly improve robust conflict-aware emotion recognition.
Primary: Hefei University of Technology
All Institutions: Hefei University of Technology, Singapore Management University, Nanyang Technological University, MIT Media Lab
The main contribution of this paper is the introduction of the Dual-Path Conflict Resolution framework, which innovatively addresses modality conflicts in multimodal emotion recognition by employing a dual-path approach that distinguishes between benign and severe conflicts. This comprehensive analysis highlights the technical contributions, methodological rigor, and potential impact of the research on the field of affective computing.
The proposed Dual-Path Conflict Resolution (DCR) framework is a significant advancement in multimodal emotion recognition (MER). It effectively distinguishes between benign and severe modality conflicts, employing two distinct paths (AFD and ADA) to handle these conflicts appropriately. AFD utilizes knowledge distillation to enhance textual representations with non-verbal cues, while ADA employs a contextual bandit approach for decision-level arbitration. This dual-path strategy is innovative as it shifts the focus from traditional fusion methods that may amplify errors to a more nuanced conflict-aware approach. The methodology is well-structured, with clear definitions of conflict types and a comprehensive explanation of how each path operates.
The experimental evaluation is robust, covering five diverse benchmarks that include both dialogue-level and clip-level datasets. The results consistently demonstrate that DCR outperforms competitive baselines, indicating its effectiveness across different contexts. The paper includes detailed ablation studies that validate the contributions of each component within the DCR framework, further strengthening the findings. The use of multiple evaluation metrics enhances the reliability of the results.
The paper provides sufficient implementation details, including the architecture, training protocols, and datasets used. However, the absence of a public demo or detailed code repository at the time of review limits reproducibility. The authors mention that the source code and models will be released, which is a positive step towards enhancing reproducibility.
One limitation of the study is the reliance on heuristic approximations for defining conflict severity, which may not capture the full complexity of modality interactions in real-world scenarios. Additionally, while the framework shows strong performance, its effectiveness in highly nuanced or ambiguous emotional contexts remains to be fully explored.
The DCR framework has significant implications for various applications, including human-computer interaction, healthcare, and robotics, where accurate emotion recognition is crucial. By addressing modality conflicts more effectively, this work could lead to more reliable affective computing systems that better understand human emotions. The main contribution of this paper is the introduction of the Dual-Path Conflict Resolution framework, which innovatively addresses modality conflicts in multimodal emotion recognition by employing a dual-path approach that distinguishes between benign and severe conflicts. This comprehensive analysis highlights the technical contributions, methodological rigor, and potential impact of the research on the field of affective computing.
Recent progress in diffusion-based audio generation and restoration has substantially improved performance across heterogeneous conditioning regimes, including text-conditioned audio generation and audio-conditioned super-resolution. However, training audio diffusion models remains computationally expensive, and most existing pipelines still rely on static optimization recipes that treat the relative importance of training signals as fixed throughout learning. In this work, we argue that a major source of inefficiency lies in the evolving balance between semantic acquisition and generation-oriented refinement. Early training places stronger emphasis on acquiring condition-aligned semantic structure and coarse global organization, whereas later training increasingly emphasizes temporal consistency, perceptual fidelity, and fine-detail refinement. To characterize this evolving balance, we introduce a progress-based regime variable derived from the training-time slope of an SSL-space discrepancy, which measures semantic progress during training. Based on this signal, we develop three complementary stage-aware mechanisms: decayed SSL guidance for early semantic bootstrapping, self-adaptive timestep sampling driven by the regime variable, and structure-aware regularization activated from convergent grouped organization in parameter space. We evaluate these mechanisms on text-conditioned audio generation and audio-conditioned super-resolution. Across both settings, the proposed stage-aware strategies improve convergence behavior and yield gains on the primary generation and spectral reconstruction metrics over standard static baselines. These results support the view that efficient audio diffusion training can benefit from treating external guidance, internal organization, and optimization emphasis as stage-dependent components rather than fixed ingredients.
Primary: China Pharmaceutical University
All Institutions: China Pharmaceutical University, University of Science and Technology of China
The paper presents a novel stage-adaptive framework for audio diffusion modeling, significantly enhancing training efficiency and model performance. The comprehensive methodology and experimental validation contribute valuable insights to the field, although concerns regarding reproducibility and the need for broader applicability remain.
The paper introduces a stage-aware perspective on audio diffusion training, which is a significant methodological innovation. The authors propose three complementary mechanisms—decayed SSL guidance, self-adaptive timestep sampling, and structure-aware regularization—each designed to adapt the training process based on the evolving needs of the model. This approach is well-justified and supported by a clear theoretical framework, utilizing a regime variable to monitor semantic progress. The proposed methods are distinct from traditional static optimization techniques, marking a notable advancement in the field of audio diffusion modeling.
The experiments are comprehensive, evaluating the proposed methods on both text-conditioned audio generation and audio-conditioned super-resolution. The use of multiple metrics (e.g., FAD, KL divergence, and spectral reconstruction metrics) provides a robust assessment of performance improvements. The results consistently demonstrate that the stage-aware mechanisms outperform static baselines, highlighting their effectiveness. However, the paper could benefit from additional experiments to further validate the findings across diverse datasets and conditions.
The paper lacks explicit details regarding the implementation and availability of code or datasets, which raises concerns about reproducibility. While the methodology is well-documented, the absence of a project URL or demo limits the ability of other researchers to replicate the results or build upon the work.
One limitation is the reliance on a single frozen SSL encoder, which may restrict the generalizability of the findings. Additionally, while the results show improvements in convergence and quality metrics, the paper does not sufficiently address the computational overhead introduced by the proposed mechanisms. The authors also acknowledge that gains in certain metrics (e.g., SISNR) were less pronounced, suggesting that the approach may not uniformly enhance all aspects of audio quality.
The findings have significant implications for the development of efficient audio generation systems, particularly in applications requiring high-quality audio synthesis and restoration. By demonstrating that training efficiency can be improved through a stage-aware approach, this work may influence future research directions in generative modeling and audio processing. The insights gained could also be applicable to other domains where dynamic adaptation of training strategies is beneficial. The paper presents a novel stage-adaptive framework for audio diffusion modeling, significantly enhancing training efficiency and model performance. The comprehensive methodology and experimental validation contribute valuable insights to the field, although concerns regarding reproducibility and the need for broader applicability remain.
High-quality singing annotations are fundamental to modern Singing Voice Synthesis (SVS) systems. However, obtaining these annotations at scale through manual labeling is unrealistic due to the substantial labor and musical expertise required, making automatic annotation highly necessary. Despite their utility, current automatic transcription systems face significant challenges: they often rely on complex multi-stage pipelines, struggle to recover text-note alignments, and exhibit poor generalization to out-of-distribution (OOD) singing data. To alleviate these issues, we present VocalParse, a unified singing voice transcription (SVT) model built upon a Large Audio Language Model (LALM). Specifically, our novel contribution is to introduce an interleaved prompting formulation that jointly models lyrics, melody, and word-note correspondence, yielding a generated sequence that directly maps to a structured musical score. Furthermore, we propose a Chain-of-Thought (CoT) style prompting strategy, which decodes lyrics first as a semantic scaffold, significantly mitigating the context disruption problem while preserving the structural benefits of interleaved generation. Experiments demonstrate that VocalParse achieves state-of-the-art SVT performance on multiple singing datasets. The source code and checkpoint are available at https://github.com/pymaster17/VocalParse.
Primary: Xi'an Jiaotong University
All Institutions: Xi'an Jiaotong University, Nanyang Technological University, Tianjin University, Ant Group, Zhejiang University
The main contribution of this paper is the development of VocalParse, a unified and scalable singing voice transcription framework that effectively integrates lyrics and melody transcription using advanced prompting strategies and a novel data collection pipeline. This work represents a significant step forward in addressing the challenges of automatic singing voice transcription, with implications for both academic research and practical applications in music technology.
The paper introduces VocalParse, a unified singing voice transcription model leveraging a Large Audio Language Model (LALM). The methodology is innovative, particularly with the interleaved prompting formulation that integrates lyrics and melody in a structured manner, addressing the challenges of traditional multi-stage pipelines. The Chain-of-Thought (CoT) prompting strategy is a significant advancement, allowing for better semantic continuity in the transcription process. The introduction of the SingCrawl data pipeline for large-scale data collection is also a noteworthy contribution, enhancing the model's training data quality and quantity.
The experiments demonstrate VocalParse's state-of-the-art performance across multiple datasets, showcasing its effectiveness in both Automatic Melody Transcription (AMT) and Automatic Lyric Transcription (ALT). The results are robust, with clear metrics provided for evaluation, including Mean Absolute Error (MAE) for melody and Word Error Rate (WER) for lyrics. The ablation studies effectively highlight the importance of the CoT prompting and the SingCrawl pipeline, providing insights into the model's performance drivers.
The paper provides sufficient implementation details, including training configurations and data processing steps. The availability of source code and checkpoints on GitHub enhances reproducibility, although the lack of a demo or interactive component may limit accessibility for some researchers.
The paper acknowledges limitations such as the assumption of a single global tempo for songs, which may not capture variations in performance. Additionally, the model's performance is constrained by the quality of the teacher pipeline used for data annotation. The focus on Mandarin data may also limit generalizability to other languages without further adaptation.
VocalParse has the potential to significantly impact the field of music information retrieval (MIR) and singing voice synthesis (SVS) by providing a scalable solution for automatic singing voice transcription. This could lead to advancements in music generation, annotation, and analysis, facilitating broader applications in music technology and AI-driven creative processes. The main contribution of this paper is the development of VocalParse, a unified and scalable singing voice transcription framework that effectively integrates lyrics and melody transcription using advanced prompting strategies and a novel data collection pipeline. This work represents a significant step forward in addressing the challenges of automatic singing voice transcription, with implications for both academic research and practical applications in music technology.
Music popularity prediction has attracted growing research interest, with relevance to artists, platforms, and recommendation systems. However, the explosive rise of AI-generated music platforms has created an entirely new and largely unexplored landscape, where a surge of songs is produced and consumed daily without the traditional markers of artist reputation or label backing. Key, yet unexplored in this pursuit is aesthetic quality. We propose APEX, the first large-scale multi-task learning framework for AI-generated music, trained on over 211k songs (10k hours of audio) from Suno and Udio, that jointly predicts engagement-based popularity signals - streams and likes scores - alongside five perceptual aesthetic quality dimensions from frozen audio embeddings extracted from MERT, a self-supervised music understanding model. Aesthetic quality and popularity capture complementary aspects of music that together prove valuable: in an out-of-distribution evaluation on the Music Arena dataset, comprising pairwise human preference battles across eleven generative music systems unseen during training, including aesthetic features consistently improves preference prediction, demonstrating strong generalisation of the learned representations across generative architectures.
Primary: Singapore University of Technology and Design
All Institutions: Singapore University of Technology and Design
The main contribution of this paper is the introduction of APEX, a multi-task learning framework that predicts both popularity and aesthetic quality in AI-generated music, demonstrating the complementary relationship between these dimensions and providing a foundation for future research in music recommendation systems. The comprehensive analysis of the technical contributions, methodology, and significance to the field highlights the innovative approach taken and its potential impact on the landscape of music technology.
The proposed APEX framework is a significant advancement in the field of music popularity prediction, particularly for AI-generated music. It employs a multi-task learning approach that integrates aesthetic quality dimensions with engagement-based popularity signals, which is a novel combination in this domain. The use of MERT embeddings for audio representation is well-justified, and the systematic exploration of loss strategies and task configurations demonstrates a rigorous approach to model design. The methodology is comprehensive, addressing both the technical aspects of model training and the theoretical underpinnings of the relationship between aesthetic quality and popularity.
The experiments are robust, utilizing a large-scale dataset of over 211k songs and including a thorough ablation study across 24 experimental conditions. The evaluation on the Music Arena dataset with unseen generative music systems adds significant value, demonstrating the model's generalization capabilities. The results indicate that aesthetic features enhance preference prediction, which is a critical insight for future research in music recommendation systems.
The paper provides detailed implementation details, including dataset construction, embedding extraction, training procedures, and model architectures. The open-source release of the APEX model and its code on GitHub further supports reproducibility, allowing other researchers to validate and build upon this work.
One limitation is the potential bias introduced by the dataset, as it primarily focuses on AI-generated music from specific platforms (Udio and Suno), which may not generalize to all music genres or styles. Additionally, the performance gap on vocal tracks suggests that further refinement is needed for models that incorporate vocal elements, which could be explored in future work.
The implications of this research are substantial, as it addresses the growing domain of AI-generated music and its integration into popular music consumption. By providing a framework that predicts popularity based on intrinsic audio features and aesthetic quality, it opens avenues for improved music recommendation systems and enhances the understanding of how listeners perceive AI-generated music. This work could influence artists, music platforms, and researchers alike, fostering a deeper appreciation for the aesthetic dimensions of music. The main contribution of this paper is the introduction of APEX, a multi-task learning framework that predicts both popularity and aesthetic quality in AI-generated music, demonstrating the complementary relationship between these dimensions and providing a foundation for future research in music recommendation systems. The comprehensive analysis of the technical contributions, methodology, and significance to the field highlights the innovative approach taken and its potential impact on the landscape of music technology.
Training data for bioacoustics is scattered across taxa, regions, and institutions. Centralizing it all is often infeasible. We show that independently fine-tuned BEATs encoders can be composed into a unified 661-species classifier via task vector arithmetic without sharing data. We find that bioacoustic task vectors are near-orthogonal (cosine 0.01-0.09). Their separation aligns closely with spectral distribution distance, a gradient consistent with the acoustic niche hypothesis. This geometry makes simple averaging optimal while sign-conflict methods reduce accuracy by one to six percentage points. Composition also creates an asymmetric gap: species-rich groups lose accuracy relative to joint training while underrepresented taxa gain, a redistribution useful for equitable biodiversity monitoring. We verify linear mode connectivity across all taxonomic pairs, demonstrate zero-shot transfer to new regions, and identify domain negation as a boundary condition where composition fails. These results enable a collaborative paradigm for bioacoustics where institutions share only task vectors to assemble multi-taxa classifiers, preserving data privacy.
Primary: Institute of Science Tokyo
All Institutions: Institute of Science Tokyo, RIKEN BDR
This paper presents a significant advancement in the field of bioacoustics by introducing a novel method for composing multi-taxa classifiers using task vector arithmetic, enabling collaborative model building while preserving data privacy. The combination of ecological principles with machine learning techniques offers a fresh perspective on model merging, with the potential to impact biodiversity monitoring and conservation efforts.
The methodology presented in this paper is innovative, leveraging task vector arithmetic to combine independently fine-tuned models into a unified classifier without sharing data. The authors provide a clear and systematic approach to the problem, including the definition of task vectors and the exploration of their geometric properties in weight space. The use of ecological principles, specifically the acoustic niche hypothesis, to predict the geometry of task vectors is a novel angle that adds depth to the analysis. The methodology is well-structured, with detailed descriptions of the merging strategies and the assumptions made during the process.
The experimental evaluation is robust, with multiple experiments validating the proposed approach. The authors demonstrate linear mode connectivity, analyze task vector geometry, and assess the impact of merging on species classification accuracy. The results are comprehensive, showing both the advantages and limitations of the proposed method. The use of various datasets and the exploration of zero-shot transfer capabilities further strengthen the findings. However, the paper could benefit from more extensive comparisons with existing methods to highlight the advantages of their approach.
The paper provides a thorough account of the experimental setup, including hyperparameters and training protocols, which aids reproducibility. The authors mention using SHA-256 hashes for configuration validation, ensuring that the experiments can be replicated. However, the lack of a detailed description of the datasets used and their preprocessing steps may pose challenges for complete reproducibility.
One limitation of the study is the potential for overfitting to the specific datasets used, which may not generalize to other bioacoustic contexts. Additionally, the assumption of disjoint species sets may not hold in all real-world scenarios, potentially affecting the performance of the proposed method. The authors also acknowledge that domain negation fails, indicating that there are boundaries to the applicability of their approach.
The proposed method has significant implications for biodiversity monitoring and conservation efforts, allowing institutions to collaborate without compromising data privacy. This collaborative paradigm could enhance the development of multi-taxa classifiers, making it easier to monitor diverse ecosystems and respond to conservation needs. The findings could also inspire further research into task vector arithmetic in other domains of machine learning, potentially leading to more efficient and privacy-preserving model training techniques. This paper presents a significant advancement in the field of bioacoustics by introducing a novel method for composing multi-taxa classifiers using task vector arithmetic, enabling collaborative model building while preserving data privacy. The combination of ecological principles with machine learning techniques offers a fresh perspective on model merging, with the potential to impact biodiversity monitoring and conservation efforts.
Music-inspired Automatic Stage Lighting Control (ASLC) has gained increasing attention in recent years due to the substantial time and financial costs associated with hiring and training professional lighting engineers. However, existing methods suffer from several notable limitations: the low interpretability of rule-based approaches, the restriction to single-primary-light control in music-to-color-space methods, and the limited transferability of music-to-controlling-parameter frameworks. To address these gaps, we propose SeqLight, a hierarchical deep learning framework that maps music to multi-light Hue-Saturation-Value (HSV) space. Our approach first customizes SkipBART, an end-to-end single primary light generation model, to predict the full light color distribution for each frame, followed by hybrid Imitation Learning (IL) techniques to derive an effective decomposition strategy that distributes the global color distribution among individual lights. Notably, the light decomposition module can be trained under varying venue-specific lighting configurations using only mixed light data and no professional demonstrations, thereby flexibly adapting across diverse venues. In this stage, we formulate the light decomposition task as a Goal-Conditioned Markov Decision Process (GCMDP), construct an expert demonstration set inspired by Hindsight Experience Replay (HER), and introduce a three-phase IL training pipeline, achieving strong generalization capability. To validate our IL solution for the proposed GCMDP, we conduct a series of quantitative analysis and human study. The code and trained models are provided at https://github.com/RS2002/SeqLight .
Primary: The Hong Kong University of Science and Technology
All Institutions: The Hong Kong Polytechnic University, The University of Hong Kong, City University of Hong Kong
The main contribution of this paper is the introduction of SeqLight, a novel hierarchical framework for automatic stage lighting control that effectively combines music analysis with advanced machine learning techniques to enhance the quality and adaptability of lighting in live performances. This work represents a significant advancement in the intersection of machine learning and performing arts, addressing practical challenges while providing a robust methodological framework.
The paper introduces SeqLight, a hierarchical deep learning framework that innovatively combines a modified Skip-BART model with imitation learning techniques to address the challenges of automatic stage lighting control. The methodology is well-structured, separating the problem into two stages: predicting light distributions and decomposing these into individual light controls via a Goal-Conditioned Markov Decision Process (GCMDP). The use of hybrid imitation learning, particularly the incorporation of Hindsight Experience Replay (HER) and Adversarial Inverse Reinforcement Learning (AIRL), is a notable strength, as it allows for effective learning without the need for extensive expert demonstrations. The approach is comprehensive, addressing both the technical challenges of multi-light control and the practical limitations of data collection in diverse venues.
The experiments are robust, utilizing both quantitative metrics (L1 distance, Wasserstein distance, etc.) and qualitative assessments through human evaluations. The results demonstrate that SeqLight outperforms competitive baselines, including Skip-BART and rule-based methods, in various music styles. The inclusion of a human study adds significant value, providing insights into user preferences and the system's generalization capabilities across different music genres. However, the paper could benefit from clearer presentation of experimental setups and results, particularly in terms of dataset descriptions and evaluation metrics.
The authors provide a GitHub repository with code and trained models, which is a positive aspect for reproducibility. However, the paper lacks detailed descriptions of the datasets used and the specific configurations for training, which could hinder full reproducibility. More explicit guidelines on running the experiments and the environment setup would enhance this aspect.
The paper acknowledges the limitations of its approach, particularly regarding the reliance on mixed light data without professional demonstrations. While the method shows promise in diverse venues, the generalization to real-world scenarios may still face challenges due to the variability in lighting setups and music styles. Additionally, the performance in out-of-domain settings, while improved, still shows some degradation compared to in-domain results.
The proposed method has significant implications for the fields of live music performance and event production, potentially reducing the need for professional lighting engineers and making stage lighting more accessible to amateurs. The ability to adapt to various venues and music styles opens up new avenues for creative expression in live performances. Furthermore, the integration of machine learning in artistic domains like lighting design could inspire further research and applications in related fields. The main contribution of this paper is the introduction of SeqLight, a novel hierarchical framework for automatic stage lighting control that effectively combines music analysis with advanced machine learning techniques to enhance the quality and adaptability of lighting in live performances. This work represents a significant advancement in the intersection of machine learning and performing arts, addressing practical challenges while providing a robust methodological framework.
Sound Event Detection (SED) plays a vital role in audio understanding, with applications in surveillance, smart cities, healthcare, and multimedia indexing. However, conventional SED systems operate under a closed-world assumption, limiting their effectiveness in real-world environments where novel acoustic events frequently emerge. Inspired by the success of open-world learning in computer vision, we introduce the Open-World Sound Event Detection (OW-SED) paradigm, where models must detect known events, identify unseen ones, and incrementally learn from them. To tackle the unique challenges of OW-SED, such as overlapping and ambiguous events, we propose a 1D Deformable architecture that leverages deformable attention to adaptively focus on salient temporal regions. Furthermore, we design a novel Open-World Deformable Sound Event Detection Transformer (WOOT) framework incorporating feature disentanglement to separate class-specific and class-agnostic representations, together with a one-to-many matching strategy and a diversity loss to enhance representation diversity. Experimental results demonstrate that our method achieves marginally superior performance compared to existing leading techniques in closed-world settings and significantly improves over existing baselines in open-world scenarios.
Primary: VNU University of Engineering and Technology
All Institutions: VNU University of Engineering and Technology, Artificial Intelligence Research Center, VNU Information Technology Institute
The main contribution of this paper is the introduction of the OW-SED paradigm and the WOOT framework, which significantly advances the field of sound event detection by enabling models to detect known events, identify unseen ones, and incrementally learn from them. The methodology and experimental results demonstrate a strong technical contribution that addresses real-world challenges in audio understanding.
The paper introduces a novel Open-World Sound Event Detection (OW-SED) paradigm, which is a significant shift from traditional closed-world approaches. The proposed 1D Deformable architecture leverages deformable attention mechanisms to focus on salient temporal regions, addressing the unique challenges posed by overlapping and ambiguous sound events. The introduction of the WOOT framework, which incorporates feature disentanglement and a two-stage training strategy, is innovative and effectively enhances the model's ability to generalize to unseen classes while mitigating catastrophic forgetting. The methodology is well-structured, with clear explanations of the architecture and training processes.
The experimental results demonstrate that the proposed methods outperform existing state-of-the-art techniques in both closed-world and open-world settings. The authors provide comprehensive evaluations using two datasets (URBAN-SED and DESED), showcasing significant improvements in performance metrics such as U-Recall and F1 scores. The experiments are robust, with multiple random seeds used to ensure reliability, and the results are well-presented in comparative tables.
The paper includes detailed implementation details, including the architecture, training protocols, and hyperparameters. However, the absence of a public code repository or demo URL limits the reproducibility of the results. The authors should consider making their code available to facilitate further research and validation of their findings.
One limitation is the reliance on human annotation for labeling unknown events, which may introduce subjectivity and variability in the training process. Additionally, while the model shows strong performance, the paper does not extensively discuss the computational efficiency or scalability of the proposed framework in real-world applications.
The OW-SED paradigm has significant implications for various applications, including surveillance, smart cities, and healthcare, where the ability to detect and learn from novel sound events in dynamic environments is crucial. This work paves the way for more adaptive audio understanding systems that can continuously evolve with their environments. The main contribution of this paper is the introduction of the OW-SED paradigm and the WOOT framework, which significantly advances the field of sound event detection by enabling models to detect known events, identify unseen ones, and incrementally learn from them. The methodology and experimental results demonstrate a strong technical contribution that addresses real-world challenges in audio understanding.
MiniMind-O is an open 0.1B-scale omni model built on the MiniMind language model. It accepts text, speech, and image inputs, and returns both text and streaming speech. The release includes model code, checkpoints, and the main Parquet training datasets for text-to-audio, image-to-text, and audio-to-audio training, making the complete interaction loop directly inspectable. The model uses a full MiniMind backbone as the Thinker and an independent four-layer Talker made from MiniMind blocks. Frozen SenseVoice-Small and SigLIP2 encoders provide speech and image features, which are mapped by lightweight MLP projectors and injected at modality-placeholder positions. The Talker reads a middle-layer Thinker state together with an autoregressive eight-layer Mimi-code buffer. Speaker control is handled by a dedicated speaker token, right-aligned reference codec prompts, and precomputed CAM++ speaker embeddings, so voice conditioning remains part of the audio-code context rather than a separate TTS module. With a 768-dimensional Talker, the dense and MoE variants reach average CERs of 0.0897 and 0.0900 in Thinker--Talker consistency evaluation, with overall voice-cloning similarities of 0.5995 and 0.5937. Beyond reporting a working system, the paper identifies three scale-critical design choices for small omni models: middle-layer semantic bridging, a released multimodal sequence format, and a parameter-efficient eight-codebook interface.
Primary: Independent Researcher
All Institutions: Independent Researcher
The main contribution of this paper is the introduction of MiniMind-O, a small-scale omni model that effectively integrates text, speech, and image modalities while maintaining a focus on reproducibility and parameter efficiency. This work represents a meaningful step towards creating accessible and controllable multimodal systems, with implications for various applications in machine learning and human-computer interaction.
The methodology employed in MiniMind-O is innovative, leveraging a compact architecture that integrates text, speech, and image modalities within a 0.1B parameter framework. The separation of the Thinker and Talker components allows for a more efficient processing pipeline, while the use of middle-layer semantic bridging and low-rank codebook interfaces enhances parameter efficiency. The decision to release the training datasets and model code further promotes transparency and reproducibility in research.
The experimental evaluation is robust, with clear metrics for assessing consistency between the Thinker and Talker outputs. The use of Character Error Rate (CER) and voice-cloning similarity scores provides a quantitative basis for evaluating performance. However, the evaluation metrics focus primarily on consistency rather than subjective quality measures, which could limit the understanding of the model's performance in real-world applications.
The paper emphasizes reproducibility by providing detailed descriptions of the model architecture, training pipeline, and the datasets used. The release of both the model code and training datasets is a significant step towards enabling other researchers to replicate the results. However, the lack of a specific institution may raise questions about the long-term support and maintenance of the project.
The main limitations include the model's performance in generating natural-sounding speech, especially for longer responses, which may not match the quality of larger models. Additionally, the visual pathway relies on a frozen encoder, which may not capture the full complexity of visual inputs. The narrow evaluation focus on consistency may overlook other important aspects of model performance, such as user experience and adaptability.
The potential applications of MiniMind-O are significant, particularly in areas requiring multimodal interaction, such as virtual assistants, educational tools, and accessibility technologies. By providing an open-source framework, the work encourages further research and development in the field of speech-native omni models, potentially leading to advancements in human-computer interaction. The main contribution of this paper is the introduction of MiniMind-O, a small-scale omni model that effectively integrates text, speech, and image modalities while maintaining a focus on reproducibility and parameter efficiency. This work represents a meaningful step towards creating accessible and controllable multimodal systems, with implications for various applications in machine learning and human-computer interaction.