Generative audio modeling has largely been fragmented into specialized tasks, text-to-speech (TTS), text-to-music (TTM), and text-to-audio (TTA), each operating under heterogeneous control paradigms. Unifying these modalities remains a fundamental challenge due to the intrinsic dissonance between structured semantic representations (speech/music) and unstructured acoustic textures (sound effects). In this paper, we introduce UniSonate, a unified flow-matching framework capable of synthesizing speech, music, and sound effects through a standardized, reference-free natural language instruction interface. To reconcile structural disparities, we propose a novel dynamic token injection mechanism that projects unstructured environmental sounds into a structured temporal latent space, enabling precise duration control within a phoneme-driven Multimodal Diffusion Transformer (MM-DiT). Coupled with a multi-stage curriculum learning strategy, this approach effectively mitigates cross-modal optimization conflicts. Extensive experiments demonstrate that UniSonate achieves state-of-the-art performance in instruction-based TTS (WER 1.47%) and TTM (SongEval Coherence 3.18), while maintaining competitive fidelity in TTA. Crucially, we observe positive transfer, where joint training on diverse audio data significantly enhances structural coherence and prosodic expressiveness compared to single-task baselines. Audio samples are available at https://qiangchunyu.github.io/UniSonate/.
Primary: Tianjin University
All Institutions: Tianjin University, Kuaishou Technology, Institute of Automation, Chinese Academy of Sciences
UniSonate presents a unified framework for audio generation that synthesizes speech, music, and sound effects through a novel natural language interface. The technical contributions, including dynamic token injection and a multi-stage curriculum learning strategy, significantly advance the field of generative audio modeling, offering a comprehensive solution to the challenges of multimodal audio synthesis.
The methodology proposed in UniSonate is innovative, introducing a unified flow-matching framework that integrates speech, music, and sound effect generation through a natural language interface. The dynamic token injection mechanism is particularly noteworthy as it allows unstructured sound effects to be processed in a structured manner, enabling precise control over audio generation. This is complemented by a multi-stage curriculum learning strategy that effectively mitigates optimization conflicts, showcasing a thoughtful approach to training across diverse audio modalities.
The experimental evaluation is robust, with extensive comparisons against state-of-the-art models in TTS, TTM, and TTA. The paper presents clear metrics for performance evaluation, including WER, SongEval scores, and subjective evaluations like MOS. The results indicate that UniSonate achieves state-of-the-art performance in TTS and TTM while maintaining competitive fidelity in TTA, demonstrating the effectiveness of the proposed methods.
The paper provides a comprehensive description of the model architecture, training procedures, and datasets used, which supports reproducibility. However, the lack of a public code repository may hinder independent verification of results. The authors do mention the use of specific hardware configurations and hyperparameters, which aids in understanding the implementation details.
The paper acknowledges limitations, particularly in the sound effect generation where performance lags behind specialized models. Additionally, challenges in generating long-form audio content and the inherent ambiguity in natural language instructions are highlighted. These limitations suggest areas for future research and improvement.
The potential applications of UniSonate are significant, as it paves the way for general-purpose audio generation systems that can synthesize complex auditory scenes. However, ethical considerations regarding the misuse of generated audio, biases in training data, and copyright issues in music generation are critical and warrant careful attention. UniSonate presents a unified framework for audio generation that synthesizes speech, music, and sound effects through a novel natural language interface. The technical contributions, including dynamic token injection and a multi-stage curriculum learning strategy, significantly advance the field of generative audio modeling, offering a comprehensive solution to the challenges of multimodal audio synthesis.
Omnimodal Notation Processing (ONP) represents a unique frontier for omnimodal AI due to the rigorous, multi-dimensional alignment required across auditory, visual, and symbolic domains. Current research remains fragmented, focusing on isolated transcription tasks that fail to bridge the gap between superficial pattern recognition and the underlying musical logic. This landscape is further complicated by severe notation biases toward Western staff and the inherent unreliability of "LLM-as-a-judge" metrics, which often mask structural reasoning failures with systemic hallucinations. To establish a more rigorous standard, we introduce ONOTE, a multi-format benchmark that utilizes a deterministic pipeline--grounded in canonical pitch projection--to eliminate subjective scoring biases across diverse notation systems. Our evaluation of leading omnimodal models exposes a fundamental disconnect between perceptual accuracy and music-theoretic comprehension, providing a necessary framework for diagnosing reasoning vulnerabilities in complex, rule-constrained domains.
Primary: Beijing University of Posts and Telecommunications
All Institutions: Beijing University of Posts and Telecommunications, China Conservatory of Music, Nanyang Technological University
The paper introduces ONOTE, a comprehensive benchmark for evaluating Omnimodal Notation Processing, which addresses critical gaps in the assessment of music intelligence systems. The methodology and results presented are significant contributions to the field, paving the way for more effective and interpretable models in music AI.
The proposed ONOTE benchmark introduces a structured and deterministic evaluation framework for Omnimodal Notation Processing (ONP), addressing the limitations of existing models that often rely on subjective evaluations. The methodology effectively integrates multiple notation systems and tasks, ensuring a comprehensive assessment of model capabilities across auditory, visual, and symbolic domains. The use of canonical pitch projection and sequence alignment to eliminate biases is particularly innovative, allowing for a more rigorous comparison of model performance.
The experiments conducted on leading omnimodal models reveal significant insights into their performance across various tasks, including Visual Score Understanding (VSU), Cross-Format Notation Conversion (CNC), Audio-to-Symbolic Transcription (AST), and Symbolic Music Generation (SMG). The results highlight a clear disconnect between perceptual accuracy and music-theoretic comprehension, underscoring the benchmark's effectiveness in diagnosing reasoning vulnerabilities. The dataset construction and evaluation metrics are well-defined, providing a robust foundation for future research.
The paper provides detailed implementation details and a clear methodology for constructing the ONOTE benchmark, which enhances reproducibility. The availability of the dataset and code on GitHub further supports the reproducibility of the results, allowing other researchers to validate and build upon the work.
While the benchmark addresses several critical issues in music notation processing, it may still be limited by the inherent biases present in the datasets used for training and evaluation. Additionally, the focus on specific notation systems may not fully encompass the diversity of global musical representations, potentially limiting the generalizability of the findings.
The ONOTE benchmark has the potential to significantly influence the field of music intelligence by providing a standardized evaluation framework that encourages the development of more robust and interpretable omnimodal systems. Its implications extend beyond academic research, potentially impacting music education, automated composition, and music analysis tools. The paper introduces ONOTE, a comprehensive benchmark for evaluating Omnimodal Notation Processing, which addresses critical gaps in the assessment of music intelligence systems. The methodology and results presented are significant contributions to the field, paving the way for more effective and interpretable models in music AI.
While Large Audio Language Models (LALMs) achieve strong performance on short audio, they degrade on long-form inputs. This degradation is more severe in temporal awareness tasks, where temporal alignment becomes increasingly inaccurate as audio duration grows. We attribute these limitations to the lack of data, benchmarks, and modeling approaches tailored for long-form temporal awareness. To bridge this gap, we first construct LAT-Chronicle, a 1.2k hour long-form audio dataset with temporal annotations across real-world scenarios. We further develop LAT-Bench, the first human-verified benchmark supporting audio up to 30 minutes while covering three core tasks: Dense Audio Caption, Temporal Audio Grounding, and Targeted Audio Caption. Leveraging these resources, we propose LAT-Audio, formulating temporal awareness as a progressive global-to-local reasoning paradigm. A global timeline is first constructed as an aligned temporal-semantic context,and the Think-With-Audio Chain-of-Thought (TWA-CoT) is then introduced to perform iterative reasoning by incorporating local audio information via tool use. Experiments show that LAT-Audio surpasses existing models on long-form audio temporal awareness tasks and improves robustness to input duration. We release the dataset, benchmark, and model to facilitate future research at https://github.com/alanshaoTT/LAT-Audio-Repo.
Primary: Northwestern Polytechnical University
All Institutions: Northwestern Polytechnical University, Independent Researcher
The main contribution of this paper is the introduction of a novel framework and dataset for improving temporal awareness in long-form audio understanding, which significantly advances the state of the art in audio language models. The comprehensive methodology, robust experimental validation, and potential applications underscore its significance in the field of machine learning and audio processing.
The paper presents a comprehensive methodology that addresses the limitations of existing Large Audio Language Models (LALMs) in handling long-form audio. The authors construct a new dataset (LAT-Chronicle) and benchmark (LAT-Bench) specifically designed for Long-form Audio Temporal Awareness (LATA) tasks, which include Dense Audio Captioning, Temporal Audio Grounding, and Targeted Audio Captioning. The proposed LAT-Audio framework introduces a novel global-to-local reasoning paradigm and the Think-With-Audio Chain-of-Thought (TWA-CoT) approach, which iteratively refines audio understanding by leveraging local audio segments based on a constructed global timeline. This innovative approach is well-justified and effectively addresses the challenges posed by long-form audio inputs.
The experimental evaluation is robust, demonstrating the effectiveness of LAT-Audio against existing models across multiple tasks. The authors provide thorough comparisons with baseline models and conduct ablation studies to validate the importance of key components such as the global timeline and TWA-CoT. The results show significant improvements in performance metrics, indicating that the proposed methods enhance temporal awareness and robustness in long-form audio understanding. The inclusion of a diverse dataset and human-verified benchmarks adds credibility to the findings.
The paper includes detailed implementation details and a clear description of the training strategy, which enhances the reproducibility of the results. The authors provide access to the dataset, benchmark, and model through a GitHub repository, facilitating further research and validation of their findings by the community.
While the proposed framework shows promise, there are limitations, such as the computational overhead introduced by multi-turn reasoning and tool use, which may hinder real-time applications. Additionally, the focus on single-audio inputs limits the framework's applicability in more complex multimodal scenarios. Future work is needed to enhance efficiency and extend the framework to broader contexts.
The research has significant implications for various applications, including automated transcription, audio search engines, and multimedia content analysis. By improving long-form audio understanding, the work can enhance user experiences in domains such as education, entertainment, and accessibility for the hearing impaired. The open-source nature of the project encourages further innovation and exploration in the field of audio language processing. The main contribution of this paper is the introduction of a novel framework and dataset for improving temporal awareness in long-form audio understanding, which significantly advances the state of the art in audio language models. The comprehensive methodology, robust experimental validation, and potential applications underscore its significance in the field of machine learning and audio processing.
Large Audio-Language Models show consistent performance gains across speech and audio benchmarks, yet high scores may not reflect true auditory perception. If a model can answer questions without processing the acoustic signal, the benchmark fails as a measure of auditory understanding. We present a diagnostic framework using two axes: text prior, which measures answerability from text and general knowledge alone, and audio reliance, which assesses actual dependency on the acoustic signal. Evaluating eight LALMs across three benchmarks, we find that models retain 60-72% of their full audio scores even without any audio input. Moreover, among items that require audio, only 3.0-4.2% need the complete audio clip; the majority can be resolved using localized fragments. These findings challenge the assumption that benchmark performance equals robust audio understanding, and we conclude with practical guidelines for improving evaluation reliability and benchmark design.
Primary: National Taiwan University
All Institutions: National Taiwan University, NTU Artificial Intelligence Center of Research Excellence
The paper presents a critical analysis of the reliance on audio in audio-language models, challenging existing benchmarks and proposing a framework for better evaluation. The methodology and findings are significant, offering valuable insights for researchers and practitioners in the field of machine learning and audio understanding.
The paper introduces a novel diagnostic framework that assesses audio-language models (LALMs) based on two axes: text prior and audio reliance. This dual-axis approach allows for a nuanced understanding of how much of a model's performance can be attributed to textual cues versus actual audio processing. The methodology is well-structured, employing controlled settings to quantify the text prior and audio reliance, which is a significant advancement in evaluating LALMs. The use of multiple benchmarks and a variety of models strengthens the robustness of the findings.
The experiments are thorough, evaluating eight LALMs across three distinct benchmarks. The results indicate a substantial grounding gap, revealing that models can achieve high scores without audio input, which challenges the assumption of robust auditory understanding. The analysis of performance retention with partial audio is particularly insightful, providing a clear picture of how audio information is utilized by the models. However, the paper could benefit from more detailed statistical analysis to support its claims.
The paper provides a clear description of the experimental setup, including the models used and the evaluation protocols. However, it lacks specific URLs or repositories for code and data, which could hinder reproducibility. Including such resources would enhance the paper's impact and facilitate further research in this area.
One limitation is the reliance on existing benchmarks, which may not fully capture the complexities of audio understanding. Additionally, while the study identifies issues with current benchmarks, it does not propose new benchmarks or datasets, which could be a missed opportunity for advancing the field. The findings may also be limited by the specific models and benchmarks chosen for evaluation.
The findings have significant implications for the design of future audio-language benchmarks and the evaluation of LALMs. By highlighting the potential for models to rely on textual priors rather than genuine auditory understanding, the paper calls for a reevaluation of how auditory capabilities are assessed in machine learning. This could lead to more accurate and reliable evaluations, ultimately improving the development of models that genuinely understand audio. The paper presents a critical analysis of the reliance on audio in audio-language models, challenging existing benchmarks and proposing a framework for better evaluation. The methodology and findings are significant, offering valuable insights for researchers and practitioners in the field of machine learning and audio understanding.
Automatic chord recognition (ACR) extracts time-aligned chord labels from music audio recordings. Despite recent advances, ACR still struggles with oversegmentation, data scarcity, and imbalance, especially in recognizing complex chords such as non-triads, which are unpopular in existing datasets. To address these challenges, we reformulate ACR as a segment-level sequence-to-sequence prediction task, where chord sequences are predicted auto-regressively rather than frame by frame. This design mitigates excessive segmentation by detecting chord changes only at segment boundaries. We further introduce two types of token representations and an encoder pre-training method, both specifically designed for time-aligned chord modeling. Experimental results show that our model improves performance in both chord recognition and segmentation, with notable gains for complex and infrequent chord types. These findings demonstrate the effectiveness of segment-level sequence modeling, structured tokenization, and representation learning for advancing chord recognition systems.
Primary: Seoul National University
All Institutions: Seoul National University
This paper presents a significant advancement in automatic chord recognition through a novel segment-level sequence modeling approach, effectively addressing oversegmentation and data imbalance challenges. The methodology is well-structured, and the experimental results demonstrate substantial improvements, marking a meaningful contribution to the field of music information retrieval.
The paper introduces a novel segment-level sequence-to-sequence approach for automatic chord recognition (ACR), effectively addressing oversegmentation and data imbalance issues prevalent in traditional frame-level methods. The use of a Transformer encoder-decoder architecture is well-justified, and the introduction of two token representations (MERGE and SPLIT) demonstrates a thoughtful approach to chord modeling. The encoder pre-training method based on chord similarity is innovative and enhances the model's ability to generalize, particularly for complex chord types.
The experiments are comprehensive, utilizing a well-defined dataset of 471 pop songs with manual annotations. The use of 5-fold cross-validation strengthens the reliability of the results. The reported improvements in both chord recognition and segmentation metrics, particularly for complex chords, are significant and demonstrate the effectiveness of the proposed methods. The ablation studies provide clear insights into the contributions of each component of the model.
The paper includes sufficient implementation details, such as data preprocessing, model architecture, training procedures, and evaluation metrics, which facilitate reproducibility. The availability of the code repository enhances this aspect, allowing other researchers to replicate the results and build upon this work.
While the paper addresses several critical challenges in ACR, it does not discuss the potential limitations of the proposed methods, such as the reliance on the quality of the training dataset or the challenges in generalizing to genres or styles not represented in the dataset. Additionally, the model's performance on real-world recordings versus studio recordings could be explored further.
The advancements in chord recognition could have significant implications for music information retrieval, music education, and automated music composition systems. By improving the recognition of complex chords, this work could enhance tools for musicians and composers, making music analysis more accessible and efficient. This paper presents a significant advancement in automatic chord recognition through a novel segment-level sequence modeling approach, effectively addressing oversegmentation and data imbalance challenges. The methodology is well-structured, and the experimental results demonstrate substantial improvements, marking a meaningful contribution to the field of music information retrieval.
Automatic speech recognition systems often produce confident yet incorrect transcriptions under noisy or ambiguous conditions, which can be misleading for both users and downstream applications. Standard evaluation based on Word Error Rate focuses solely on accuracy and fails to capture transcription reliability. We introduce an abstention-aware transcription framework that enables ASR models to explicitly abstain from uncertain segments. To evaluate reliability under abstention, we propose RAS, a reliability-oriented metric that balances transcription informativeness and error aversion, with its trade-off parameter calibrated by human preference. We then train an abstention-aware ASR model through supervised bootstrapping followed by reinforcement learning. Our experiments demonstrate substantial improvements in transcription reliability while maintaining competitive accuracy.
Primary: Shanghai Jiao Tong University
All Institutions: Shanghai Jiao Tong University, MoE Key Lab of Artificial Intelligence, Jiangsu Key Lab of Language Computing
The paper presents a significant advancement in automatic speech recognition by introducing an abstention-aware framework and a novel reliability metric, RAS, which enhances the reliability of ASR outputs in uncertain conditions. The methodology is well-founded and the experimental results robustly support the proposed contributions, marking a meaningful step forward in the field of speech processing.
The paper introduces a novel abstention-aware transcription framework for ASR systems, which allows models to abstain from uncertain segments rather than producing potentially misleading transcriptions. The proposed Reliability-Aware Score (RAS) metric is innovative as it integrates a placeholder for uncertainty directly into the transcription process, moving beyond traditional metrics like Word Error Rate (WER). The methodology is well-structured, employing a two-stage training pipeline that combines supervised bootstrapping and reinforcement learning, effectively enhancing the model's reliability in challenging acoustic conditions.
The experiments are comprehensive, utilizing two datasets (LibriSpeech and TALCS) to evaluate the proposed method under both clean and noisy conditions. The results demonstrate significant improvements in transcription reliability, particularly in adverse environments, while maintaining competitive accuracy. The use of human preference alignment for calibrating the RAS metric adds robustness to the evaluation process, ensuring that the proposed framework is grounded in real-world applicability.
The paper provides detailed descriptions of the methodology, including the training pipeline and experimental setup. However, there is a lack of supplementary material or code repositories that would facilitate complete reproducibility. The absence of a project URL limits the ability for other researchers to replicate the findings directly.
While the proposed framework shows promise, the reliance on human preference data for calibrating the RAS metric may introduce biases based on the specific population sampled. Additionally, the performance in highly diverse acoustic environments beyond those tested (e.g., different languages or dialects) remains unaddressed, which could limit the generalizability of the findings.
The approach has significant implications for high-stakes applications of ASR, such as medical and legal transcription, where reliability is critical. By providing a mechanism for models to indicate uncertainty, the framework can enhance user trust and improve decision-making processes in various domains. The introduction of RAS as a new evaluation metric could also pave the way for further research into reliable ASR systems. The paper presents a significant advancement in automatic speech recognition by introducing an abstention-aware framework and a novel reliability metric, RAS, which enhances the reliability of ASR outputs in uncertain conditions. The methodology is well-founded and the experimental results robustly support the proposed contributions, marking a meaningful step forward in the field of speech processing.
We propose Speech Enhancement based on Drifting Models (DriftSE), a novel generative framework that formulates denoising as an equilibrium problem. Rather than relying on iterative sampling, DriftSE natively achieves one-step inference by evolving the pushforward distribution of a mapping function to directly match the clean speech distribution. This evolution is driven by a Drifting Field, a learned correction vector that guides samples toward the high-density regions of the clean distribution, which naturally facilitates training on unpaired data by matching distributions rather than paired samples. We investigate the framework under two formulations: a direct mapping from the noisy observation, and a stochastic conditional generative model from a Gaussian prior. Experiments on the VoiceBank-DEMAND benchmark demonstrate that DriftSE achieves high-fidelity enhancement in a single step, outperforming multi-step diffusion baselines and establishing a new paradigm for speech enhancement.
Primary: Victoria University of Wellington
All Institutions: Victoria University of Wellington, GN Audio A/S
The main contribution of this paper is the introduction of DriftSE, a novel generative framework for speech enhancement that reformulates denoising as an equilibrium problem, achieving high-fidelity results in a single inference step. This work represents a significant advancement in the field of speech enhancement, combining innovative methodology with robust experimental validation to address critical challenges in real-time applications.
The proposed method, DriftSE, innovatively formulates speech enhancement as an equilibrium problem, leveraging a learned Drifting Field for one-step inference. This approach diverges from traditional iterative sampling techniques, providing a significant computational advantage. The use of a semantic latent space for drift computation enhances the model's ability to capture complex speech structures, which is a notable improvement over existing methods. The dual formulation of the model—direct mapping and conditional generation—adds flexibility and robustness to the framework, allowing it to adapt to various scenarios, including unpaired training.
The experiments conducted on the VoiceBank-DEMAND benchmark and the DNS Challenge 2020 blind test set showcase the effectiveness of DriftSE in achieving high-fidelity speech enhancement. The reported metrics (PESQ, SI-SDR, SCOREQ) indicate that DriftSE outperforms both multi-step diffusion models and other one-step approaches, establishing its competitive edge. The thorough evaluation across different datasets and conditions demonstrates the model's generalization capabilities, which is crucial for real-world applications.
The paper provides detailed implementation specifics, including architecture choices, training procedures, and hyperparameter settings, which are essential for reproducibility. However, the absence of a public code repository or demo URL limits the accessibility of the method for further validation by the research community.
While the DriftSE framework shows promising results, its reliance on a pre-trained self-supervised learning encoder may introduce limitations related to the quality and representativeness of the latent features. Additionally, the performance drop in unpaired settings suggests that the model may struggle in scenarios where clean-reference data is not available, highlighting a potential area for improvement.
The DriftSE framework has significant implications for real-time speech enhancement applications, particularly in environments with varying noise conditions. Its ability to perform one-step inference could facilitate deployment in low-latency scenarios, such as telecommunication and assistive technologies. Furthermore, the methodology could inspire future research in generative modeling and distribution matching across other domains beyond audio. The main contribution of this paper is the introduction of DriftSE, a novel generative framework for speech enhancement that reformulates denoising as an equilibrium problem, achieving high-fidelity results in a single inference step. This work represents a significant advancement in the field of speech enhancement, combining innovative methodology with robust experimental validation to address critical challenges in real-time applications.
Automated movie creation requires coordinating multiple characters, modalities, and narrative elements across extended sequences -- a challenge that existing end-to-end approaches struggle to address effectively. We present \textbf{CineAGI}, a hierarchical movie generation framework that decomposes this complex task through specialized multi-agent orchestration. Our framework employs three key innovations: (1) a multi-agent narrative synthesis module where specialized LLM agents collaboratively generate comprehensive cinematic blueprints with character profiles, scene descriptions, and cross-modal specifications; (2) a decoupled character-centric pipeline that maintains identity consistency through instance-level tracking and integration while enabling flexible multi-character composition; and (3) a hierarchical audio-visual synchronization mechanism ensuring frame-level alignment of dialogue, expressions, and music. Extensive experiments demonstrate that CineAGI achieves 40\% improvement in overall consistency, 4.4\% gain in subject consistency, 5.4\% enhancement in aesthetic quality, and 28.7\% higher character consistency compared to baselines. Our work establishes a principled foundation for automated multi-scene video generation that preserves narrative coherence and character authenticity.
Primary: Nanjing University
All Institutions: Nanjing University, Zhejiang Sci-Tech University, University of British Columbia, Beijing Shuzhimei Technology Co., Ltd, Jilin University, Tianjin University
CineAGI represents a significant advancement in automated movie creation through its innovative multi-agent orchestration framework. The comprehensive methodology and substantial experimental validation establish it as a leading approach in the field, with the potential to reshape how narratives are crafted in digital media.
The methodology presented in CineAGI is robust and innovative, leveraging a hierarchical multi-agent orchestration approach to tackle the complex task of automated movie creation. The use of specialized LLM agents for narrative synthesis, character generation, and cinematographic synthesis is a significant advancement over traditional end-to-end models. The framework's ability to maintain character consistency and narrative coherence across scenes through decoupled processing and explicit synchronization mechanisms is particularly noteworthy. The detailed breakdown of each module and the integration of various generative models demonstrate a comprehensive understanding of the challenges in automated filmmaking.
The experimental evaluation is thorough, utilizing a diverse benchmark of 100 story prompts across multiple genres to assess the framework's performance. The use of both quantitative metrics and qualitative human evaluations provides a well-rounded perspective on the system's effectiveness. The reported improvements in consistency and aesthetic quality are substantial, indicating that the proposed methods yield significant enhancements over existing baselines. However, the paper could benefit from more detailed comparisons with a wider range of contemporary methods to further contextualize its contributions.
The paper provides a detailed description of the experimental setup, including generation settings, evaluation metrics, and baseline comparisons. However, the lack of publicly available code or demo URLs limits reproducibility. Future work should consider releasing the implementation to facilitate further research and validation by the community.
One limitation of the study is the reliance on specific generative models, which may not generalize across all contexts or genres of filmmaking. Additionally, while the framework shows improvements in character consistency and narrative coherence, the complexity of the system may introduce challenges in real-time applications or scalability. The computational cost of approximately 11.3 minutes per scene on a single GPU could also be a barrier for broader adoption.
The implications of CineAGI extend beyond academic research into practical applications in the film and entertainment industry. By automating aspects of movie creation, this framework could democratize content production, enabling creators with limited resources to produce high-quality narratives. Furthermore, the integration of AI in creative processes raises questions about authorship and the role of human creativity in storytelling. CineAGI represents a significant advancement in automated movie creation through its innovative multi-agent orchestration framework. The comprehensive methodology and substantial experimental validation establish it as a leading approach in the field, with the potential to reshape how narratives are crafted in digital media.
Recent large audio language models (LALMs) demonstrate remarkable capabilities in processing extended multi-modal sequences, yet incur high inference costs. Token compression is an effective method that directly reduces redundant tokens in the sequence. Existing compression methods usually assume that all attention heads in LALMs contribute equally to various audio tasks and calculate token importance by averaging scores across all heads. However, our analysis demonstrates that attention heads exhibit distinct behaviors across diverse audio domains. We further reveal that only a sparse subset of attention heads actively responds to audio, with completely different performance when handling semantic and acoustic tasks. In light of this observation, we propose HeadRouter, a head-importance-aware token pruning method that perceives the varying importance of attention heads in different audio tasks to maximize the retention of crucial tokens. HeadRouter is training-free and can be applied to various LALMs. Extensive experiments on the AudioMarathon and MMAU-Pro benchmarks demonstrate that HeadRouter achieves state-of-the-art compression performance, exceeding the baseline model even when retaining 70% of the audio tokens and achieving 101.8% and 103.0% of the vanilla average on Qwen2.5-Omni-3B and Qwen2.5-Omni-7B, respectively.
Primary: Shanghai Jiao Tong University
All Institutions: Shanghai Jiao Tong University, DAIL Tech, Northeastern University, Sichuan University, Huazhong University of Science and Technology
The main contribution of this paper is the introduction of HeadRouter, a dynamic head-weight routing mechanism for audio token pruning in large audio language models, which significantly enhances performance and efficiency in processing diverse audio tasks. This work represents a meaningful advancement in the field of audio language models, addressing critical challenges in token management and model efficiency while maintaining high performance across various audio tasks.
The proposed HeadRouter method introduces a novel dynamic head-weight routing mechanism that adapts to the varying importance of attention heads in large audio language models (LALMs). This approach is innovative in its use of entropy-based selectivity scores and Gaussian soft mixing to create task-specific head-weight profiles. The training-free nature of the method allows it to be easily integrated into existing models without additional training overhead, which is a significant advantage for practical applications.
The experiments conducted on the AudioMarathon and MMAU-Pro benchmarks demonstrate the effectiveness of HeadRouter in outperforming existing token pruning methods across various audio tasks. The results indicate that the method not only maintains performance while aggressively pruning tokens but also adapts well to different audio contexts, showcasing its robustness. The comparative analysis with state-of-the-art methods further validates the proposed approach's superiority in managing token importance dynamically.
The paper provides a clear description of the methodology, including the routing mechanism and evaluation setup, which supports reproducibility. However, the lack of publicly available code or detailed implementation guidelines may hinder full reproducibility for other researchers.
One limitation is the reliance on pre-calibrated head-weight profiles, which may not generalize across all audio tasks or models. Additionally, while the method shows promise in reducing computational costs, the paper does not explore the implications of using HeadRouter in real-time applications or its impact on latency in practical deployments.
The implications of this research extend to various applications in audio processing, including speech recognition, music analysis, and multimodal systems. By improving the efficiency of LALMs, this work could facilitate more widespread adoption of advanced audio understanding technologies in real-time applications, enhancing user experiences in voice-interactive systems. The main contribution of this paper is the introduction of HeadRouter, a dynamic head-weight routing mechanism for audio token pruning in large audio language models, which significantly enhances performance and efficiency in processing diverse audio tasks. This work represents a meaningful advancement in the field of audio language models, addressing critical challenges in token management and model efficiency while maintaining high performance across various audio tasks.
Machine generation of symbolic music and digital audio are hot topics but there have been relatively few digital musical instruments that integrate generative AI. Present musical AI tools are not artist centred and do not support experimentation or integrating into musical instruments or practices. This work introduces an inexpensive generative AI instrument platform based on a single board computer that connects via MIDI to other musical devices. The platform uses artist-collected datasets with models trained on a regular computer. This paper asks what the design space of intelligent musical instruments might look like when accessible and portable AI systems are available for artistic exploration. I contribute five examples of instruments created and tested through a two-year first-person artistic research process. These show that (re)mapping can replace retraining for discovering AI interaction, that fast input interleaving is a new co-creative strategy, that small-data AI models can be a transportable design resource, and that cheap hardware can lower barriers to inclusion. This work could enable artists to explore new interaction and performance schemes with intelligent musical instruments.
Primary: The Australian National University
All Institutions: The Australian National University
This paper presents a novel generative AI platform for intelligent musical instruments, emphasizing artist-centered design and small-data approaches. The comprehensive exploration of performance experiences and instrument development contributes valuable insights to the intersection of AI and music, highlighting the potential for innovative co-creative practices.
The methodology is grounded in a first-person artistic research approach, which is innovative in the context of generative AI in music. The use of small-data AI models trained on artist-collected datasets is a significant contribution, allowing for a more personalized and artist-centered exploration of generative AI in musical contexts. The paper effectively outlines the design and implementation of a generative AI platform that integrates with existing musical instruments, showcasing a practical application of AI in music performance. The iterative development of five distinct instruments provides a rich qualitative dataset for analysis.
The experiments conducted over two years of performance practice are well-documented, providing insights into the evolution of the instruments and their interactions with musicians. The author details the performance experiences and the adaptability of the instruments in various contexts, which adds depth to the evaluation. However, the paper lacks quantitative metrics for assessing the performance of the AI models, which could strengthen the evaluation of their effectiveness.
The implementation details are provided, including the use of Raspberry Pi and the open-source nature of the software, which enhances reproducibility. The availability of the project on GitHub allows others to replicate the setup and experiment with the platform. However, more detailed instructions on the configuration and training processes would further aid reproducibility.
The study is limited by its first-person perspective, which may not capture the full range of experiences from diverse musicians. Additionally, the exploration of model updates over time is not systematically addressed, which could provide further insights into the adaptability and longevity of the AI models in performance settings.
This work has the potential to democratize access to intelligent musical instruments by lowering the cost barrier and encouraging experimentation among artists. The findings could influence future designs of musical AI systems, promoting a shift towards artist-centered approaches in generative AI applications. The implications for HCI and music technology communities are significant, as the research opens new avenues for interaction and collaboration between humans and AI in creative practices. This paper presents a novel generative AI platform for intelligent musical instruments, emphasizing artist-centered design and small-data approaches. The comprehensive exploration of performance experiences and instrument development contributes valuable insights to the intersection of AI and music, highlighting the potential for innovative co-creative practices.
With the rapid advancement of speech generation technologies, the threat posed by speech deepfakes in real-time communication (RTC) scenarios has intensified. However, existing detection studies mainly focus on offline simulations and struggle to cope with the complex distortions introduced during RTC transmission, including unknown speech enhancement processes (e.g., noise suppression) and codec compression. To address this challenge, we present the first large-scale speech deepfake dataset tailored for RTC scenarios, termed \textit{RTCFake}, totaling approximately 600 hours. The dataset is constructed by transmitting speech through multiple mainstream social media and conferencing platforms (e.g., Zoom), enabling precise pairing between offline and online speech. In addition, we propose a phoneme-guided consistency learning (PCL) strategy that enforces models to learn platform-invariant semantic structural representations. In this paper, the RTCFake dataset is divided into training, development, and evaluation sets. The evaluation set further includes both unseen RTC platforms and unseen complex noise conditions, thereby providing a more realistic and challenging evaluation benchmark for speech deepfake detection. Furthermore, the proposed PCL strategy achieves significant improvements in both cross-platform generalization and noise robustness, offering an effective and generalizable modeling paradigm. The \textit{RTCFake} dataset is provided in the {https://huggingface.co/datasets/JunXueTech/RTCFake}.
Primary: unknown
All Institutions: unknown
The paper presents RTCFake, a novel dataset and a phoneme-guided consistency learning strategy for detecting speech deepfakes in real-time communication, addressing a critical gap in existing research. The methodology is innovative, and the experimental results demonstrate substantial improvements, making it a valuable contribution to the field of audio and speech processing.
The paper introduces a phoneme-guided consistency learning (PCL) strategy, which is a novel approach aimed at enhancing the robustness of speech deepfake detection in real-time communication scenarios. The proposed methodology effectively addresses the challenges posed by various distortions and codec compressions encountered in RTC environments. The dataset, RTCFake, is a significant contribution, as it is specifically designed for the complexities of real-time communication, which is often overlooked in existing literature.
The authors provide a comprehensive evaluation of their proposed method using a large-scale dataset of approximately 600 hours of speech. The evaluation set includes both unseen RTC platforms and complex noise conditions, which enhances the realism of the testing environment. The reported improvements in cross-platform generalization and noise robustness are significant, indicating that the proposed method is effective in practical applications.
While the paper mentions the availability of the RTCFake dataset on Hugging Face, it lacks detailed implementation specifics regarding the PCL strategy and the models used. This omission could hinder reproducibility, as other researchers may struggle to replicate the results without clear guidance on the experimental setup.
One limitation is that the dataset may not encompass all possible real-time communication scenarios, potentially limiting the generalizability of the findings. Additionally, the paper does not address the computational efficiency of the proposed method, which is crucial for real-time applications.
The implications of this research are significant, as it addresses a pressing issue in the age of deepfake technology. The ability to detect speech deepfakes in real-time communication can have far-reaching effects on security, privacy, and trust in digital communications. The proposed dataset and methodology could serve as a foundation for future research in this area. The paper presents RTCFake, a novel dataset and a phoneme-guided consistency learning strategy for detecting speech deepfakes in real-time communication, addressing a critical gap in existing research. The methodology is innovative, and the experimental results demonstrate substantial improvements, making it a valuable contribution to the field of audio and speech processing.
Directional Selective Fixed-Filter Active Noise Control (D-SFANC) can effectively attenuate noise from different directions by selecting the suitable pre-trained control filter based on the Direction-of-Arrival (DoA) of the current noise. However, this method is weak at tracking the direction variations of non-stationary noise, such as that from a moving source. Therefore, this work proposes a Predictive Directional SFANC (PD-SFANC) method that uses a Convolutional Recurrent Neural Network (CRNN) to capture the hidden temporal dynamics of the moving noise and predict the control filter to cancel future noise. Accordingly, the proposed method can significantly improve its noise-tracking ability and dynamic noise-reduction performance. Furthermore, numerical simulations confirm the superiority of the proposed method for handling moving sources across various movement scenarios, compared to several representative ANC baselines.
Primary: Nanyang Technological University
All Institutions: Nanyang Technological University, Northwestern Polytechnical University
The main contribution of this paper is the introduction of a novel PD-SFANC method that leverages CRNNs for proactive noise control in dynamic environments. This work significantly advances the field of active noise control by addressing the challenges of tracking moving noise sources, offering a promising solution that could enhance the performance of ANC systems in real-world applications.
The proposed Predictive Directional SFANC (PD-SFANC) method effectively integrates a Convolutional Recurrent Neural Network (CRNN) for predicting the Direction-of-Arrival (DoA) of moving noise sources. The methodology is well-structured, utilizing a pre-trained control filter library and a dual-module architecture that separates the predictive and real-time noise control processes. This design addresses the limitations of existing methods, particularly the lag in filter adaptation for moving sources, showcasing a significant advancement in active noise control systems.
The experiments are comprehensive, utilizing numerical simulations to evaluate the performance of PD-SFANC against established baseline methods. The authors provide detailed descriptions of the simulation setup, including the dataset construction and the noise scenarios tested. The results demonstrate that PD-SFANC outperforms traditional methods in various movement scenarios, with robust noise reduction performance and accurate DoA predictions, reinforcing the effectiveness of the proposed approach.
The paper mentions that the code will be available on GitHub, which is a positive aspect for reproducibility. However, specific implementation details, such as hyperparameters and training settings, could be more explicitly stated to facilitate easier replication of the results by other researchers.
One limitation is that the proposed method is designed for single-source scenarios, which may restrict its applicability in environments with multiple overlapping noise sources. Additionally, while the CRNN shows strong performance, its reliance on a pre-trained filter library may limit adaptability to entirely new noise types not represented in the training data.
The implications of this research extend to various fields where noise control is critical, such as automotive, aviation, and consumer electronics. The ability to effectively manage noise from moving sources can enhance user experience in products like headphones, smart devices, and automotive noise cancellation systems, potentially leading to broader adoption of advanced ANC technologies. The main contribution of this paper is the introduction of a novel PD-SFANC method that leverages CRNNs for proactive noise control in dynamic environments. This work significantly advances the field of active noise control by addressing the challenges of tracking moving noise sources, offering a promising solution that could enhance the performance of ANC systems in real-world applications.
Human-imitated speech poses a greater challenge than AI-generated speech for both human listeners and automatic detection systems. Unlike AI-generated speech, which often contains artifacts, over-smoothed spectra, or robotic cues, imitated speech is produced naturally by humans, thereby preserving a higher degree of naturalness that makes imitation-based speech forgery significantly more challenging to detect using conventional acoustic or cepstral features. To overcome this challenge, this study proposes an auditory perception-based Spectro-Temporal Modulation (STM) representation framework for human-imitated speech detection. The STM representations are derived from two cochlear filterbank models: the Gammatone Filterbank (GTFB), which simulates frequency selectivity and can be regarded as a first approximation of cochlear filtering, and the Gammachirp Filterbank (GCFB), which further models both frequency selectivity and level-dependent asymmetry. These STM representations jointly capture temporal and spectral fluctuations in speech signals, corresponding to changes over time in the spectrogram and variations along the frequency axis related to human auditory perception. We also introduce a Segmental-STM representation to analyze short-term modulation patterns across overlapping time windows, enabling high-resolution modeling of temporal speech variations. Experimental results show that STM representations are effective for human-imitated speech detection, achieving accuracy levels close to those of human listeners. In addition, Segmental-STM representations are more effective, surpassing human perceptual performance. The findings demonstrate that perceptually inspired spectro-temporal modeling is promising for detecting imitation-based speech attacks and improving voice authentication robustness.
Primary: Japan Advanced Institute of Science and Technology
All Institutions: Japan Advanced Institute of Science and Technology
The paper presents a comprehensive framework for detecting human-imitated speech through innovative auditory-inspired representations, addressing a critical gap in the field. The methodology is well-founded in auditory processing principles, and the experimental results demonstrate significant advancements in detection accuracy, highlighting the potential for real-world applications in voice authentication and security.
The paper introduces a novel Spectro-Temporal Modulation (STM) representation framework based on auditory perception, utilizing Gammatone and Gammachirp filterbanks to capture temporal and spectral fluctuations in human-imitated speech. The methodology is well-grounded in auditory processing principles, and the introduction of Segmental-STM representation enhances the modeling of short-term modulation patterns, which is a significant advancement over conventional acoustic features. The approach is innovative, addressing a critical gap in the detection of human-imitated speech, which has been underexplored in existing literature.
The experimental setup is robust, utilizing a dataset specifically designed for human-imitated speech detection. The results indicate that the proposed STM representations outperform traditional acoustic features, achieving accuracy levels comparable to human listeners. The inclusion of multiple classifiers (SVM, KNN, Extra Trees) strengthens the evaluation, and the performance metrics are clearly presented. However, the dataset size could be a limitation, as only 100 samples were used for testing, which may affect the generalizability of the findings.
The paper provides a detailed description of the methodology, including the computation of STM representations and the machine learning models used. However, the lack of a publicly available dataset or code repository limits reproducibility. Future work should consider sharing the dataset and implementation details to facilitate independent validation of results.
The primary limitation is the small dataset size, which may restrict the robustness of the findings and their applicability to broader contexts. Additionally, while the results are promising, the study does not address potential variations in performance across different languages or speaker characteristics, which could affect the generalizability of the approach.
The proposed framework has significant implications for voice authentication and security systems, particularly in contexts where human-imitated speech poses a threat. By improving detection capabilities, this work could enhance the security of voice-based systems, making them more resilient against imitation attacks. The findings also contribute to the understanding of auditory perception in speech processing, potentially influencing future research in related fields. The paper presents a comprehensive framework for detecting human-imitated speech through innovative auditory-inspired representations, addressing a critical gap in the field. The methodology is well-founded in auditory processing principles, and the experimental results demonstrate significant advancements in detection accuracy, highlighting the potential for real-world applications in voice authentication and security.
Mispronunciation Detection and Diagnosis (MDD) requires modeling fine-grained acoustic deviations. However, current ASR-derived MDD systems often face inherent limitations. In particular, CTC-based models favor sequence-level alignments that neglect transient mispronunciation cues, while explicit canonical priors bias predictions toward intended targets. To address these bottlenecks, we propose a prompt-free framework decoupling acoustic fidelity from canonical guidance. First, we introduce CROTTC, an acoustic model enforcing monotonic, frame-level alignment to accurately capture pronunciation deviations. Second, we implicitly inject mispronunciation information via the IF strategy under the knowledge transfer principle. Experiments show CROTTC-IF achieves a 71.77% F1-score on L2-ARCTIC and 71.70% F1-score on the Iqra'Eval2 leaderboard. With empirical analysis, we demonstrate that decoupling acoustics from explicit priors provides highly robust MDD.
Primary: The University of Tokyo
All Institutions: The University of Tokyo
The main contribution of this paper is the introduction of a prompt-free paradigm for mispronunciation detection that effectively separates acoustic fidelity from canonical bias, leading to improved diagnostic accuracy. This work significantly advances the field of MDD by addressing critical methodological challenges and demonstrating state-of-the-art performance across diverse benchmarks, thus paving the way for future research and applications in language learning and speech recognition.
The paper introduces a novel framework, CROTTC-IF, which effectively decouples acoustic fidelity from canonical guidance in Mispronunciation Detection and Diagnosis (MDD). The methodology is well-structured, incorporating a frame-wise acoustic model (CROTTC) that utilizes Optimal Temporal Transport Classification (OTTC) to capture fine-grained mispronunciation cues. Additionally, the Indirect Fusion (IF) strategy allows for implicit knowledge transfer, enhancing the model's performance without relying on explicit canonical prompts. The integration of Consistency Regularization further stabilizes predictions, showcasing a comprehensive approach to addressing the limitations of existing MDD systems.
The experimental evaluation is robust, with the authors conducting extensive tests on multiple datasets, including L2-ARCTIC and Iqra'Eval2. The reported F1-scores of 71.77% and 71.70% demonstrate competitive performance compared to state-of-the-art methods. The paper includes ablation studies that effectively highlight the contributions of different components of the proposed framework, providing a clear understanding of the impact of each method on overall performance.
The paper provides detailed implementation details, including architecture specifications, training protocols, and hyperparameter settings. However, the lack of a publicly accessible code repository limits the reproducibility of the results, as external researchers cannot easily verify or build upon the findings.
While the proposed framework shows promise, the paper does not address potential limitations regarding the generalizability of the model to spontaneous speech or other languages beyond the tested datasets. Additionally, the reliance on specific datasets may introduce biases that could affect the model's applicability in diverse real-world scenarios.
The advancements in MDD presented in this paper have significant implications for various applications, particularly in language learning and automated speech recognition. By improving the accuracy of mispronunciation detection, the framework can enhance educational tools for language learners and contribute to more effective speech therapy solutions. The main contribution of this paper is the introduction of a prompt-free paradigm for mispronunciation detection that effectively separates acoustic fidelity from canonical bias, leading to improved diagnostic accuracy. This work significantly advances the field of MDD by addressing critical methodological challenges and demonstrating state-of-the-art performance across diverse benchmarks, thus paving the way for future research and applications in language learning and speech recognition.
Multi-speaker automatic speech recognition (ASR) aims to transcribe conversational speech involving multiple speakers, requiring the model to capture not only what was said, but also who said it and sometimes when it was spoken. Recent Speech-LLM approaches have shown the potential of unified modeling for this task, but jointly learning speaker attribution, temporal structure, and lexical recognition remains difficult and data-intensive. At the current stage, leveraging reliable speaker diarization as an explicit structural prior provides a practical and efficient way to simplify this task. To effectively exploit such priors, we propose DM-ASR, a diarization-aware multi-speaker ASR framework that reformulates the task as a multi-turn dialogue generation process. Given an audio chunk and diarization results, DM-ASR decomposes transcription into a sequence of speaker- and time-conditioned queries, each corresponding to one speaker in one time segment. This formulation converts multi-speaker recognition into a series of structured sub-tasks, explicitly decoupling speaker-temporal structure from linguistic content and enabling effective integration of diarization cues with the reasoning capability of large language models. We further introduce an optional word-level timestamp prediction mechanism that interleaves word and timestamp tokens, yielding richer structured outputs and better transcription quality. Our analysis shows that diarization systems provide more reliable speaker identities and segment-level boundaries, while LLMs excel at modeling linguistic content and long-range dependencies, demonstrating their complementary strengths. Experiments on Mandarin and English benchmarks show that the proposed approach achieves strong performance with relatively small models and training data, while remaining competitive with or outperforming existing unified approaches.
Primary: Wuhan University
All Institutions: Wuhan University, Tencent Ethereal Audio Lab, The Chinese University of Hong Kong
The main contribution of this paper is the introduction of DM-ASR, a diarization-aware multi-speaker ASR framework that effectively combines speaker attribution and temporal grounding through a structured dialogue generation approach. This innovative methodology not only improves transcription quality but also demonstrates the potential of integrating diarization cues with large language models, marking a significant advancement in the field of automatic speech recognition.
The proposed DM-ASR framework innovatively reformulates the multi-speaker ASR task as a multi-turn dialogue generation process, effectively integrating speaker diarization cues into the transcription process. This approach decouples speaker identity and temporal information from linguistic content, allowing for a structured generation that enhances both transcription accuracy and robustness against imperfect diarization cues. The introduction of special tokens for speaker and timestamp information, alongside the optional word-level timestamp prediction, represents a significant methodological advancement in the field.
The experiments conducted on both Mandarin and English datasets demonstrate the effectiveness of DM-ASR, achieving competitive performance with smaller models and limited training data compared to larger, more data-intensive systems. The results indicate that the framework not only outperforms traditional cascaded systems but also rivals state-of-the-art end-to-end models, showcasing the practical applicability and generalizability of the proposed method across different languages and conversational contexts.
The paper provides detailed implementation information, including the architecture of the model, training procedures, and datasets used, which enhances reproducibility. However, the lack of publicly available code or demo URLs limits the ability for others to directly replicate the findings without additional effort.
One notable limitation is the reliance on external diarization systems, which can introduce errors that affect overall performance. Additionally, while the model shows robustness against imperfect cues, it does not consistently outperform strong diarization front-ends under all conditions, indicating a potential area for improvement. The paper also does not explore the scalability of the method to larger datasets or more complex conversational scenarios.
The DM-ASR framework has significant implications for real-world applications in multi-speaker environments such as meetings, interviews, and call centers. By improving the accuracy of speaker attribution and temporal grounding in ASR systems, it could enhance accessibility for users requiring accurate transcriptions, such as those with hearing impairments. Furthermore, the integration of LLMs with diarization cues could pave the way for more advanced conversational AI systems capable of understanding and generating human-like dialogue. The main contribution of this paper is the introduction of DM-ASR, a diarization-aware multi-speaker ASR framework that effectively combines speaker attribution and temporal grounding through a structured dialogue generation approach. This innovative methodology not only improves transcription quality but also demonstrates the potential of integrating diarization cues with large language models, marking a significant advancement in the field of automatic speech recognition.
Rhythm transcription is a key subtask of notation-level Automatic Music Transcription (AMT). While deep learning models have been extensively used for detecting the metrical grid in audio and MIDI performances, beat-based rhythm quantization remains largely unexplored. In this work, we introduce a novel deep learning approach for quantizing MIDI performances using a priori beat information. Our method leverages the transformer architecture to effectively process synchronized score and performance data for training a quantization model. Key components of our approach include dataset preparation, a beat-based pre-quantization method to align performance and score times within a unified framework, and a MIDI tokenizer tailored for this task. We adapt a transformer model based on the T5 architecture to meet the specific requirements of rhythm quantization. The model is evaluated using a set of score-level metrics designed for objective assessment of quantization performance. Through systematic evaluation, we optimize both data representation and model architecture. Additionally, we apply performance and score augmentations, such as transposition, note deletion, and performance-side time jitter, to enhance the model's robustness. Finally, a qualitative analysis compares our model's quantization performance against state-of-the-art probabilistic and deep-learning models on various example pieces. Our model achieves an onset F1-score of 97.3% and a note value accuracy of 83.3% on the ASAP dataset. It generalizes well across time signatures, including those not seen during training, and produces readable score output. Fine-tuning on instrument-specific datasets further improves performance by capturing characteristic rhythmic and melodic patterns. This work contributes a robust and flexible framework for beat-based MIDI quantization using transformer models.
Primary: Klangio GmbH
All Institutions: Klangio GmbH, Institute of Industrial Information Technology, Karlsruhe Institute of Technology
This paper presents a novel transformer-based approach for beat-based rhythm quantization of MIDI performances, significantly advancing the field of Automatic Music Transcription. The integration of beat annotations into the quantization process enhances the model's performance and flexibility, marking a meaningful contribution to music information retrieval.
The methodology is robust, leveraging a transformer architecture tailored for rhythm quantization by incorporating beat annotations. The preprocessing steps for aligning performance and score data are well-defined, and the tokenization scheme is innovative, allowing for efficient encoding of musical data. The model's adaptability to different time signatures and its ability to generalize across unseen time signatures are significant contributions. However, the reliance on a priori beat information may limit its applicability in scenarios where such data is not available.
The experiments are comprehensive, utilizing a suitable dataset (ASAP) that includes diverse performance MIDI files. The evaluation metrics are well-chosen, focusing on onset F1-score and note value accuracy, which are critical for assessing quantization performance. The results demonstrate strong performance compared to state-of-the-art models, indicating the effectiveness of the proposed approach. However, the paper could benefit from more extensive comparisons with a broader range of existing methods.
The paper provides sufficient details on the model architecture, training process, and evaluation metrics, which would allow other researchers to replicate the study. However, the absence of a publicly available code repository limits reproducibility.
The main limitations include the dependency on beat annotations, which may not always be available, and the model's performance on more complex time signatures that were not part of the training set. Additionally, the focus on piano and guitar data may restrict the model's generalizability to other instruments.
This work has significant implications for music information retrieval and automatic music transcription, offering a new approach to rhythm quantization that could enhance the usability of MIDI data in various applications, including music education, performance analysis, and music generation. The model's ability to generalize across different time signatures and instruments could lead to broader applications in music technology. This paper presents a novel transformer-based approach for beat-based rhythm quantization of MIDI performances, significantly advancing the field of Automatic Music Transcription. The integration of beat annotations into the quantization process enhances the model's performance and flexibility, marking a meaningful contribution to music information retrieval.
Full-duplex interaction, where speakers and listeners converse simultaneously, is a key element of human communication often missing from traditional spoken dialogue systems. These systems, based on rigid turn-taking paradigms, struggle to respond naturally in dynamic conversations. The Full-Duplex Interaction Track of ICASSP 2026 Human-like Spoken Dialogue Systems Challenge (HumDial Challenge) aims to advance the evaluation of full-duplex systems by offering a framework for handling real-time interruptions, speech overlap, and dynamic turn negotiation. We introduce a comprehensive benchmark for full-duplex spoken dialogue systems, built from the HumDial Challenge. We release a high-quality dual-channel dataset of real human-recorded conversations, capturing interruptions, overlapping speech, and feedback mechanisms. This dataset forms the basis for the HumDial-FDBench benchmark, which assesses a system's ability to handle interruptions while maintaining conversational flow. Additionally, we create a public leaderboard to compare the performance of open-source and proprietary models, promoting transparent, reproducible evaluation. These resources support the development of more responsive, adaptive, and human-like dialogue systems.
Primary: Nanjing University
All Institutions: Nanjing University, Northwestern Polytechnical University, AISHELL
This paper presents a comprehensive study on full-duplex interaction in spoken dialogue systems, introducing a novel dataset and evaluation framework that significantly advance the field. The methodology is well-structured, and the results demonstrate the potential for developing more human-like dialogue systems, addressing key challenges in real-time conversational dynamics.
The paper introduces a dual-channel dataset that captures realistic conversational dynamics, including interruptions and overlapping speech, which is a significant advancement over existing datasets that primarily focus on single-channel recordings. The methodology for dataset construction combines LLM-generated scripts with human recordings, ensuring both authenticity and control over interaction behavior. The evaluation framework, HumDial-FDBench, is well-structured, providing clear metrics for assessing system performance in real-time dialogue scenarios. This comprehensive approach allows for a nuanced understanding of full-duplex interaction, making it a valuable resource for future research.
The experimental results are robust, with a clear comparison of various models' performance on the released benchmark. The paper provides detailed metrics for interruption handling, rejection behavior, and response latency, which are critical for evaluating the effectiveness of dialogue systems in real-world scenarios. The inclusion of a public leaderboard enhances the transparency and reproducibility of the results, encouraging further development in this area. However, the paper could benefit from more extensive discussion on the specific experimental setups and conditions under which the models were evaluated.
The paper emphasizes the release of a publicly available dataset and benchmark, which facilitates reproducibility. The authors provide a clear methodology for data collection and evaluation metrics, allowing other researchers to replicate their experiments. However, the lack of detailed implementation specifics for the models evaluated may hinder full reproducibility for those attempting to build upon this work.
One limitation is the potential bias in the dataset construction, as it relies on scripted dialogues performed by professional actors, which may not fully capture the variability of spontaneous human interactions. Additionally, the paper acknowledges challenges related to background noise and speaker overlap, which could affect model performance in real-world applications. The evaluation metrics primarily focus on behavioral correctness and latency, potentially overlooking other important aspects of dialogue quality.
The resources provided by this research have significant implications for the development of more natural and responsive spoken dialogue systems. By addressing the limitations of traditional turn-taking paradigms, this work paves the way for advancements in human-computer interaction, with applications in customer service, virtual assistants, and conversational agents. The emphasis on real-time interaction and the ability to handle interruptions could lead to more engaging and effective communication tools. This paper presents a comprehensive study on full-duplex interaction in spoken dialogue systems, introducing a novel dataset and evaluation framework that significantly advance the field. The methodology is well-structured, and the results demonstrate the potential for developing more human-like dialogue systems, addressing key challenges in real-time conversational dynamics.
Fine-grained local timing control is still absent from modern text-to-speech systems: existing approaches typically provide only utterance-level duration or global speaking-rate control, while precise token-level timing manipulation remains unavailable. To the best of our knowledge, MAGIC-TTS is the first TTS model with explicit local timing control over token-level content duration and pause. MAGIC-TTS is enabled by explicit token-level duration conditioning, carefully prepared high-confidence duration supervision, and training mechanisms that correct zero-value bias and make the model robust to missing local controls. On our timing-control benchmark, MAGIC-TTS substantially improves token-level duration and pause following over spontaneous synthesis. Even when no timing control is provided, MAGIC-TTS maintains natural high-quality synthesis. We further evaluate practical local editing with a scenario-based benchmark covering navigation guidance, guided reading, and accessibility-oriented code reading. In this setting, MAGIC-TTS realizes a reproducible uniform-timing baseline and then moves the edited regions toward the requested local targets with low mean bias. These results show that explicit fine-grained controllability can be implemented effectively in a high-quality TTS system and can support realistic local timing-editing applications.
Primary: South China University of Technology
All Institutions: South China University of Technology
MAGIC-TTS introduces the first TTS model with explicit local timing control over token-level content duration and pause. This comprehensive analysis highlights the model's innovative approach to TTS, its rigorous methodology, and its potential to significantly impact the field of speech synthesis by improving the quality and controllability of generated speech.
The methodology presented in MAGIC-TTS is robust, leveraging a flow-based TTS backbone to achieve explicit local timing control over token-level content duration and pause. The authors introduce a novel training mechanism that incorporates high-confidence duration supervision and zero-value correction, which effectively addresses the challenges of local timing manipulation in TTS systems. The separation of timing control from the acoustic generation process is a significant improvement, allowing for precise control without compromising synthesis quality. The detailed explanation of the training data pipeline and the careful construction of timing supervision demonstrate a thorough understanding of the complexities involved in TTS systems.
The experiments are well-designed, utilizing a comprehensive timing-control benchmark to validate the effectiveness of MAGIC-TTS. The results show substantial improvements in token-level duration and pause accuracy when explicit controls are provided, with clear metrics such as mean absolute error and correlation coefficients. The ablation studies further strengthen the claims by isolating the contributions of key components, confirming the importance of zero-value correction and cross-validated timing supervision. The practical local editing scenarios also illustrate the model's versatility and real-world applicability.
The paper provides sufficient details regarding the experimental setup, including model architecture, training configurations, and evaluation protocols, which supports reproducibility. However, the absence of a publicly available demo or project URL limits the practical reproducibility of the results, as external researchers would need to replicate the entire setup from scratch.
One limitation is the reliance on high-confidence supervision, which may not be easily attainable in all datasets or languages, potentially affecting the model's generalizability. Additionally, while the paper demonstrates improvements in timing control, it does not extensively explore the impact of these improvements on user experience or subjective quality assessments in real-world applications.
The advancements in fine-grained controllability in TTS systems have significant implications for applications such as navigation guidance, accessibility tools, and interactive voice assistants. By enabling precise local timing manipulation, MAGIC-TTS can enhance the expressiveness and naturalness of synthesized speech, making it more adaptable to various contexts and user needs. MAGIC-TTS introduces the first TTS model with explicit local timing control over token-level content duration and pause. This comprehensive analysis highlights the model's innovative approach to TTS, its rigorous methodology, and its potential to significantly impact the field of speech synthesis by improving the quality and controllability of generated speech.
This paper introduces PHOTON (PHysical Optical Tracking of Notes), a non-invasive optical sensing system for measuring key-lever motion in historical keyboard instruments. PHOTON tracks the vertical displacement of the key lever itself, capturing motion shaped by both performer input and the instrument's mechanically imposed, time-varying load. Reflective optical sensors mounted beneath the distal end of each lever provide continuous displacement, timing, and articulation data without interfering with the action. Unlike existing optical systems designed for modern pianos, PHOTON accommodates the diverse geometries, limited clearances, and non-standard layouts of harpsichords, clavichords, and early fortepianos. Its modular, low-profile architecture enables high-resolution, low-latency sensing across multiple manuals and variable key counts. Beyond performance capture, PHOTON provides real-time MIDI output and supports empirical study of expressive gesture, human-instrument interaction, and the construction of instrument-specific MIDI corpora using real historical mechanisms. The complete system is released as open-source hardware and software, from schematics and PCB layouts developed in KiCad to firmware written in CircuitPython, lowering the barrier to adoption, replication, and extension.
Primary: Institute for Logic, Language, and Computation
All Institutions: Institute for Logic, Language, and Computation, University of Amsterdam
The main contribution of this paper is the introduction of the PHOTON system, a non-invasive optical tracking technology for historical keyboard instruments that facilitates detailed analysis of key-lever motion and expressive gesture. This innovative approach, combined with its open-source nature, positions PHOTON as a valuable tool for researchers and performers alike, potentially transforming the study and practice of historical keyboard music.
The methodology presented in this paper is innovative and well-structured, focusing on a non-invasive optical sensing system tailored for historical keyboard instruments. The use of reflective optical sensors to measure key-lever motion is a significant advancement over existing systems, which are primarily designed for modern pianos. The modular and low-profile design allows for high-resolution data capture while accommodating the unique geometries of historical instruments. The authors provide a thorough explanation of the hardware design, including sensor selection, calibration, and integration, which demonstrates a strong understanding of the mechanical constraints involved. The open-source nature of the project enhances its accessibility and encourages further research and development.
While the paper does not present extensive experimental results, it includes a case study that illustrates the effectiveness of the PHOTON system in capturing key-action behavior on a harpsichord. The authors provide motion traces that reveal fine-grained aspects of touch and articulation, which are crucial for understanding performance nuances. However, more comprehensive experiments comparing PHOTON with existing systems or evaluating its performance across various historical instruments would strengthen the paper's contributions.
The authors emphasize reproducibility by providing detailed schematics, PCB layouts, and firmware source code. The use of widely available components and open-source tools further supports the project's replicability. The inclusion of a custom KiCad plugin for sensor placement is particularly noteworthy, as it simplifies the adaptation of the system to different keyboard layouts.
One limitation of the study is the lack of extensive empirical validation across a broader range of historical keyboard instruments. While the case study is informative, additional data from various setups would provide a more robust evaluation of the system's capabilities. Furthermore, ethical considerations regarding unobtrusive sensing are briefly mentioned but could benefit from a more in-depth discussion.
The PHOTON system has the potential to significantly impact the fields of musicology, performance practice, and instrument design. By enabling detailed empirical studies of expressive gesture and human-instrument interaction, it opens new avenues for research that have been historically underrepresented. The integration of real-time MIDI output and the ability to create instrument-specific MIDI corpora can enhance both educational and performance contexts, making historical keyboard instruments more accessible to contemporary musicians. The main contribution of this paper is the introduction of the PHOTON system, a non-invasive optical tracking technology for historical keyboard instruments that facilitates detailed analysis of key-lever motion and expressive gesture. This innovative approach, combined with its open-source nature, positions PHOTON as a valuable tool for researchers and performers alike, potentially transforming the study and practice of historical keyboard music.
Portamento in string performance has been studied primarily as a binary presence-or-absence phenomenon, with existing research measuring frequency of occurrence and, less commonly, duration in milliseconds. This paper introduces a third quantitative descriptor; the spectrographic gradient of the portamento slide, measured in Hz/second, and demonstrates its measurement using a protocol combining Sonic Visualizer's melodic spectrogram layer, GIMP pixel analysis, and metric calibration against the spectrogram's known frequency axis. The gradient captures what duration alone cannot: the steepness of the pitch trajectory, which encodes the expressive character of the slide independently of its length. Applied to the opening measures of. Specifically because their monophonic texture permits reliable spectrographic pitch tracking. The method yields gradient values ranging from approximately 600~Hz/s in late-period recordings to over 4,000~Hz/s in early twentieth-century performances. The paper further documents a gain-recovery protocol that extends the analysable corpus to analogue recordings from the 1930s where portamento traces are faint in digital transfer. Applying the method to a corpus of 22 recordings spanning 1930--2012, the paper tests the hypothesis that gradient steepness correlates negatively with tempo: that slower performances produce steeper, longer slides while faster performances produce shallower slides or none at all. The results support this hypothesis, suggesting that the widely documented decline of portamento across the twentieth century is not a binary transition from presence to absence but a continuou
Primary: unknown
All Institutions: unknown
This paper introduces a new quantitative descriptor for portamento in string performance, significantly enhancing the analysis of expressive techniques in historical recordings. The innovative methodology and empirical findings provide valuable insights into the evolution of musical expression, making a meaningful contribution to the fields of musicology and audio analysis.
The paper introduces a novel methodology for measuring portamento in string performance through a spectrographic gradient, which is a significant advancement over existing binary measures of portamento presence and duration. The combination of Sonic Visualizer for spectrogram analysis and GIMP for pixel analysis is innovative, allowing for a more nuanced understanding of musical expressiveness. The calibration of the gradient measurement to physical units (Hz/second) adds rigor and comparability to the findings.
The experiments are well-structured, utilizing a corpus of 22 recordings spanning over eight decades. The analysis of gradient values and their correlation with tempo provides empirical support for the paper's hypotheses. The use of historical recordings adds depth to the findings, showing a continuous decline in portamento expressiveness rather than a simple absence.
The methodology is detailed, with clear steps for measurement and calibration, which should allow for reproducibility by other researchers. However, the reliance on human judgment in placing reference points for gradient measurement introduces variability that could affect reproducibility.
The study is limited to specific passages of two sonatas, which may not generalize across the entire cello repertoire. Additionally, the subjective nature of reference point placement could lead to inconsistencies in gradient measurement. The calibration constants are also specific to the settings used, which may limit comparisons with other studies.
This research has the potential to influence both musicology and performance practice by providing a quantitative framework for analyzing expressive techniques in string performance. The findings could inform teaching practices and performance interpretations, as well as contribute to the broader understanding of stylistic evolution in music. This paper introduces a new quantitative descriptor for portamento in string performance, significantly enhancing the analysis of expressive techniques in historical recordings. The innovative methodology and empirical findings provide valuable insights into the evolution of musical expression, making a meaningful contribution to the fields of musicology and audio analysis.
Voiceprints are widely used for authentication; however, they are easily captured in public settings and cannot be revoked once leaked. Existing anonymization systems operate inside recording devices, which makes them ineffective when microphones or software are untrusted, as in conference rooms, lecture halls, and interviews. We present EchoMask, the first practical physical-layer system for real-time voiceprint anonymization using acoustic metamaterials. By modifying sound waves before they reach the microphone, EchoMask prevents attackers from capturing clean voiceprints through compromised devices. Our design combines three key innovations: frequency-selective interference to disrupt voiceprint features while preserving speech intelligibility, an acoustic-field model to ensure stability under speaker movement, and reconfigurable structures that create time-varying interference to prevent learning or canceling a fixed acoustic pattern. EchoMask is low-cost, power-free, and 3D-printable, requiring no machine learning, software support, or microphone modification. Experiments conducted across eight microphones in diverse environments demonstrate that EchoMask increases the Miss-match Rate, i.e., the fraction of failed voiceprint matching attempts, to over 90%, while maintaining high speech intelligibility.
Primary: Northwest University
All Institutions: Northwest University, University of Leeds
This paper presents a pioneering approach to voiceprint anonymization using acoustic metamaterials, addressing critical challenges in real-time applications while maintaining speech intelligibility. The combination of innovative design principles and thorough experimental validation positions this work as a significant contribution to the field of audio privacy and security.
The methodology presented in this paper is innovative, leveraging acoustic metamaterials for voiceprint anonymization in real-time scenarios. The authors effectively address three critical challenges: maintaining speech intelligibility while disrupting identity cues, ensuring stability under speaker movement, and preventing predictable acoustic patterns. The design principles are well-structured, focusing on targeted low-frequency perturbation, dynamic stability, and passive randomization, which collectively enhance the robustness of the system. The use of numerical simulations and physical experimentation to validate the design is commendable, although the lack of machine learning integration may limit adaptability in some contexts.
The experiments are comprehensive, evaluating the system across various microphones and real-world conditions. The results demonstrate a high Miss-match Rate (MMR) of over 90%, indicating effective voiceprint protection while maintaining speech intelligibility. The inclusion of subjective listening tests (Mean Opinion Score) further strengthens the evaluation by providing insights into perceived audio quality. However, the paper could benefit from a more detailed breakdown of the experimental setup and conditions to enhance transparency.
While the paper provides a solid theoretical foundation and experimental results, it lacks specific implementation details that would facilitate reproducibility. Key parameters, such as the exact configurations of the metamaterials and the experimental setups, are not thoroughly documented. Additionally, the absence of a project URL or code repository limits the ability of other researchers to replicate the work.
The primary limitations include the reliance on passive metamaterials, which may restrict adaptability to varying acoustic environments and speaker dynamics. The system's performance under extreme conditions (e.g., very high noise levels or rapid speaker movement) is not fully explored. Furthermore, while the approach is innovative, it does not incorporate machine learning techniques that could enhance performance through adaptive learning.
The implications of this research are significant, particularly in enhancing privacy and security in voice-based authentication systems. The ability to anonymize voiceprints in real-time without requiring modifications to existing devices opens up new avenues for protecting users in public and shared environments. The findings could influence future designs of microphones and voice interaction systems, promoting user privacy in increasingly digital and interconnected spaces. This paper presents a pioneering approach to voiceprint anonymization using acoustic metamaterials, addressing critical challenges in real-time applications while maintaining speech intelligibility. The combination of innovative design principles and thorough experimental validation positions this work as a significant contribution to the field of audio privacy and security.
Audio carries richer information than text, including emotion, speaker traits, and environmental context, while also enabling lower-latency processing compared to speech-to-text pipelines. However, recent multimodal information retrieval research has predominantly focused on images, largely overlooking audio, especially in the setting of interleaved audio-text contextual retrieval. In this work, we introduce the Audio-Text Interleaved contextual Retrieval (ATIR) task, where queries can alternate between audio and text modalities. We construct an ATIR benchmark by integrating several Automatic Speech Recognition (ASR), QA, and retrieval datasets, ultimately unifying four types of contextual retrieval tasks. This benchmark substantially addresses the limitations of existing audio retrieval datasets in semantic retrieval. To study this task, we evaluate several off-the-shelf retrievers and train our ATIR model based on a Multimodal Large Language Model (MLLM). We further introduce a novel token compression mechanism that is orthogonal to existing compression methods, thereby alleviating the issue of excessive audio tokens in MLLM-based ATIR models. Experimental results demonstrate that our ATIR model achieves substantial improvements over strong baselines.
Primary: Renmin University of China
All Institutions: Renmin University of China
The paper presents a novel approach to audio-text interleaved contextual retrieval, introducing the ATIR task and a benchmark that significantly enhances the capabilities of existing retrieval systems. The comprehensive methodology, innovative technical contributions, and thorough experimental validation position this work as a meaningful advancement in the field of multimodal information retrieval.
The methodology presented in the paper is robust, introducing the ATIR task and a comprehensive benchmark that addresses the limitations of existing audio retrieval datasets. The novel token compression mechanism and the bi-encoder architecture with a token selector module are innovative contributions that enhance the performance of interleaved audio-text retrieval. The synthesis pipeline for data generation is well-structured, ensuring high-quality multimodal data that is critical for training effective models.
The experimental evaluation is thorough, demonstrating significant improvements over strong baselines across various retrieval settings. The use of multiple metrics (Recall@k and nDCG@k) provides a comprehensive assessment of model performance. The ablation studies effectively validate the contributions of the proposed components, particularly the token selector's impact on retrieval efficiency and accuracy.
The paper provides detailed implementation information, including model architecture, training configurations, and hyperparameters, which supports reproducibility. However, the lack of a publicly available project or demo URL limits accessibility for other researchers wishing to replicate the results.
The paper acknowledges limitations, such as the focus on single-document retrieval and the potential for future exploration of more complex retrieval scenarios. Additionally, the lightweight representation design may restrict performance in certain contexts, and the evaluation is primarily centered on QA-centric tasks, leaving broader applications untested.
The introduction of the ATIR task and benchmark has the potential to significantly influence multimodal retrieval research, particularly in applications involving conversational agents and hybrid search systems. The findings could lead to advancements in how audio and text are integrated for more effective information retrieval systems. The paper presents a novel approach to audio-text interleaved contextual retrieval, introducing the ATIR task and a benchmark that significantly enhances the capabilities of existing retrieval systems. The comprehensive methodology, innovative technical contributions, and thorough experimental validation position this work as a meaningful advancement in the field of multimodal information retrieval.
We propose a new approach for the second stage of a practical two-stage Optical Music Recognition (OMR) pipeline. Given symbol and event candidates from the visual pipeline, we decode them into an editable, verifiable, and exportable score structure. We focus on complex polyphonic staff notation, especially piano scores, where voice separation and intra-measure timing are the main bottlenecks. Our approach formulates second-stage decoding as a structure decoding problem and uses topology recognition with probability-guided search (BeadSolver) as its core method. We also describe a data strategy that combines procedural generation with recognition-feedback annotations. The result is a practical decoding component for real OMR systems and a path to accumulate structured score data for future end-to-end, multimodal, and RL-style methods.
Primary: FindLab
All Institutions: FindLab
The paper introduces a novel two-stage OMR approach that effectively decodes complex polyphonic music into structured formats, significantly advancing the field of music recognition. The methodology leverages innovative techniques to address longstanding challenges in music transcription, with implications for both practical applications and future research directions.
The paper presents a two-stage Optical Music Recognition (OMR) pipeline that innovatively formulates the second stage as a structure decoding problem. The use of topology recognition with a probability-guided search (BeadSolver) is a significant methodological advancement, addressing the complex challenges of voice separation and timing in polyphonic music. The integration of procedural generation with recognition-feedback annotations for training data further enhances the robustness of the proposed method.
The experiments are well-structured, comparing the proposed BeadSolver against rule-based and linear-equations baselines. The results demonstrate clear improvements in the quality of the structured output, indicating that the proposed method effectively addresses the limitations of existing approaches. However, specific quantitative results and metrics used for evaluation could be more explicitly detailed to strengthen the findings.
The paper outlines the methodology and provides a clear description of the data pipeline and model architecture, which aids in reproducibility. However, the absence of publicly available code or datasets limits the ability to fully replicate the results.
The paper does not address potential limitations in handling highly variable music notations or the scalability of the proposed method to broader music genres beyond piano scores. Additionally, the reliance on procedural generation for training data may introduce biases that are not fully explored.
The proposed OMR system has the potential to significantly enhance the accessibility of historical and contemporary music scores, enabling better integration into digital music platforms and educational tools. This could foster greater engagement with music education and preservation efforts. The paper introduces a novel two-stage OMR approach that effectively decodes complex polyphonic music into structured formats, significantly advancing the field of music recognition. The methodology leverages innovative techniques to address longstanding challenges in music transcription, with implications for both practical applications and future research directions.
Evaluation of musical source separation (MSS) has traditionally relied on Blind Source Separation Evaluation (BSS-Eval) metrics. However, recent work suggests that BSS-Eval metrics exhibit low correlation between metrics and perceptual audio quality ratings from a listening test, which is considered the gold standard evaluation method. As an alternative approach in singing voice separation, embedding-based intrusive metrics that leverage latent representations from large self-supervised audio models such as Music undERstanding with large-scale self-supervised Training (MERT) embeddings have been introduced. In this work, we analyze the correlation of perceptual audio quality ratings with two intrusive embedding-based metrics: a mean squared error (MSE) and an intrusive variant of the Fréchet Audio Distance (FAD) calculated on MERT embeddings. Experiments on two independent datasets show that these metrics correlate more strongly with perceptual audio quality ratings than traditional BSS-Eval metrics across all analyzed stem and model types.
Primary: University of Music and Performing Arts Graz
All Institutions: University of Music and Performing Arts Graz
The main contribution of this paper is the introduction of embedding-based intrusive evaluation metrics for musical source separation, which demonstrate stronger correlations with perceptual audio quality ratings than traditional BSS-Eval metrics. This work significantly advances the evaluation methodologies in the field, providing a more perceptually relevant framework for assessing audio separation models.
The paper introduces a novel approach to evaluate musical source separation (MSS) using embedding-based intrusive metrics derived from MERT representations. The methodology is well-structured, leveraging self-supervised audio models to compute metrics that correlate better with human perceptual ratings compared to traditional BSS-Eval metrics. The use of two specific metrics (MSE and an intrusive variant of FAD) is innovative, and the paper provides a clear explanation of how these metrics are calculated and their significance in the context of MSS evaluation.
The experiments are robust, utilizing two independent datasets (Bake-Off and GenSVS) to validate the proposed metrics. The correlation analysis conducted using Spearman's rank correlation coefficient (SRCC) and Pearson's correlation coefficient (PCC) is appropriate and effectively demonstrates the superiority of the embedding-based metrics over traditional methods. The results are well-presented, with clear tables and figures that summarize the findings.
The paper provides sufficient detail about the datasets and the implementation of the metrics, including references to the Python packages used. However, the absence of direct access to the datasets limits full reproducibility for external researchers. The code repository linked in the paper enhances reproducibility for the proposed metrics and analyses.
One limitation is the reliance on specific datasets, which may not fully represent the diversity of musical sources encountered in real-world applications. Additionally, while the proposed metrics show improved correlation with perceptual ratings, the paper does not explore their performance across a broader range of audio genres or separation tasks.
The findings have significant implications for the field of audio processing and music technology, as they suggest a more reliable evaluation framework for MSS models. This could lead to improved development and assessment of audio separation technologies, benefiting applications in music production, audio restoration, and content creation. The approach could also inspire further research into embedding-based evaluation metrics in other audio-related tasks. The main contribution of this paper is the introduction of embedding-based intrusive evaluation metrics for musical source separation, which demonstrate stronger correlations with perceptual audio quality ratings than traditional BSS-Eval metrics. This work significantly advances the evaluation methodologies in the field, providing a more perceptually relevant framework for assessing audio separation models.
Real-time multimodal agents transport raw audio and screenshots using networking stacks designed for human receivers, which optimize for perceptual fidelity and smooth playout. Yet agent models act as event-driven processors with no inherent sense of physical time, consuming task-relevant semantics rather than reconstructing signals in real time. This fundamental difference shifts the transport goal from the technical problem of signal fidelity (Shannon-Weaver Level A) to the semantic problem of meaning preservation (Level B). This mismatch imposes significant overhead. In visual pipelines, screenshot upload accounts for over 60% of end-to-end action latency on constrained uplinks, and in voice pipelines, conventional transport carries massive redundancy, sending 43-64x more data than needed to maintain task accuracy. We present Sema, a semantic transport system that combines discrete audio tokenizers with a hybrid screen representation (lossless accessibility-tree or OCR text, plus compact visual tokens) and bursty token delivery that eliminates jitter buffers. In simulations under emulated WAN conditions, Sema reduces uplink bandwidth by 64x for audio and 130-210x for screenshots while preserving task accuracy within 0.7 percentage points of the raw baseline.
Primary: Unaffiliated
All Institutions: Unaffiliated, Pine AI
The paper presents Sema, a semantic transport system that significantly reduces bandwidth requirements for real-time multimodal agents while maintaining task accuracy. The innovative approach and strong experimental results position this work as a meaningful contribution to the field of machine learning, particularly in audio and multimodal communication contexts.
The methodology presented in the paper introduces a novel semantic transport system, Sema, which shifts the focus from traditional signal fidelity to semantic meaning preservation. The authors effectively combine discrete audio tokenization with a hybrid screen representation, optimizing for real-time multimodal agent communication. The approach is well-structured, leveraging existing technologies in a new context, and the design principles are clearly articulated. However, the paper could benefit from a more detailed exploration of the implementation specifics and potential integration challenges with existing systems.
The experimental evaluation is robust, utilizing simulations under emulated WAN conditions to demonstrate significant reductions in uplink bandwidth for both audio and screenshots while maintaining task accuracy. The results are compelling, showcasing the effectiveness of the proposed system in practical scenarios. However, the reliance on simulation rather than real-world testing limits the generalizability of the findings.
The paper lacks sufficient implementation details that would facilitate reproducibility. While the authors describe their methods and evaluations, the absence of a publicly available codebase or detailed algorithmic descriptions hinders other researchers from replicating the study.
The primary limitations include the lack of real-world testing, which raises questions about the performance of the system in diverse network conditions. Additionally, the paper does not address potential challenges in integrating the proposed system with existing multimodal agent architectures, which could affect its adoption.
The implications of this work are significant, as it addresses a critical bottleneck in multimodal agent communication by optimizing data transport for AI models rather than human users. This could lead to more efficient and responsive AI systems, enhancing applications in various domains such as virtual assistants, gaming, and remote collaboration tools. The paper presents Sema, a semantic transport system that significantly reduces bandwidth requirements for real-time multimodal agents while maintaining task accuracy. The innovative approach and strong experimental results position this work as a meaningful contribution to the field of machine learning, particularly in audio and multimodal communication contexts.