While multimodal large language models have advanced across text, image, and audio, personalization research has remained primarily vision-language, with unified omnimodal benchmarking that jointly covers text, image, and audio still limited, and lacking the methodological rigor to account for absent-persona scenarios or systematic grounding studies. We introduce Omni-Persona, the first comprehensive benchmark for omnimodal personalization. We formalize the task as cross-modal routing over the Persona Modality Graph, encompassing 4 task groups and 18 fine-grained tasks across {sim}750 items. To rigorously diagnose grounding behavior, we propose Calibrated Accuracy (mathrm{Cal)}, which jointly rewards correct grounding and appropriate abstention, incorporating absent-persona queries within a unified evaluation framework. On our dedicated experiments, three diagnostic findings emerge: (i) open-source models show a consistent audio-vs-visual grounding gap that RLVR partially narrows via dense rule-based supervision; (ii) answerable recall and parameter scale are incomplete diagnostics, since strong recall can coexist with absent-persona hallucination and larger models do not always achieve higher Cal, exposing calibration as a separate evaluation axis; and (iii) SFT is bounded by the difficulty of constructing annotated ground-truth supervision at scale, while RLVR generalizes more consistently through outcome-level verifiable feedback yet drifts toward conservative behavior and lower generation quality under our reward design. Omni-Persona thus serves as a diagnostic framework that surfaces the pitfalls of omnimodal personalization, guiding future post-training and reward design.
Primary: Seoul National University
All Institutions: Seoul National University, University of Seoul
The paper presents the Omni-Persona benchmark, a pioneering framework for evaluating omnimodal personalization that significantly enhances the understanding of model performance across audio, text, and visual modalities. The comprehensive methodology and experimental rigor contribute valuable insights into the challenges and potential solutions in the field of multimodal AI systems.
The paper introduces the Omni-Persona benchmark, a novel framework for evaluating omnimodal personalization that incorporates audio, text, and visual modalities. The methodology is rigorous, employing the Persona Modality Graph (PMG) to formalize user profiles and cross-modal routing tasks. The introduction of Calibrated Accuracy (Cal) as a metric is particularly noteworthy, as it addresses the limitations of traditional recall metrics by incorporating both correct grounding and appropriate abstention. The systematic treatment of absent-persona scenarios is a significant advancement in the field, allowing for a more realistic evaluation of model performance in real-world conditions.
The experiments are well-structured, comparing various models under different training regimes (SFT and RLVR). The findings reveal critical insights into the performance of open-source models, particularly highlighting the audio-visual grounding gap and the limitations of scaling SFT datasets. The use of diverse models and the detailed analysis of their performance across multiple tasks provide a comprehensive understanding of the challenges in omnimodal personalization. The results are robust, demonstrating the effectiveness of the proposed benchmark in revealing model weaknesses that traditional metrics may overlook.
The paper provides sufficient details regarding the experimental setup, including model architectures, training regimes, and evaluation metrics. However, the reliance on synthetic data and the use of LLM-as-a-judge for evaluation may introduce biases that could affect reproducibility. Future work should aim to validate these findings with human-annotated datasets to enhance generalizability.
The benchmark relies on synthetic audio and text data, which may not fully capture the complexities of real-world scenarios. Additionally, the use of LLM-as-a-judge could introduce biases in evaluation, and the paper acknowledges the need for further human verification. The trade-off observed in RLVR, where models may become overly conservative in their abstention behavior, also highlights a potential area for improvement in reward design.
The Omni-Persona benchmark has significant implications for the development of personalized AI systems, particularly in applications requiring nuanced understanding across multiple modalities. By addressing the challenges of absent-persona scenarios and grounding accuracy, this work could lead to more reliable and effective personal assistants that better serve user needs in real-world contexts. The paper presents the Omni-Persona benchmark, a pioneering framework for evaluating omnimodal personalization that significantly enhances the understanding of model performance across audio, text, and visual modalities. The comprehensive methodology and experimental rigor contribute valuable insights into the challenges and potential solutions in the field of multimodal AI systems.
While speech Large Language Models (LLMs) excel at conventional tasks like basic speech recognition, they lack fine-grained, multi-dimensional perception. This deficiency is evident in their struggle to disentangle complex features like micro-acoustic cues, acoustic scenes, and paralinguistic signals. This resulting incomplete comprehension of real-world speech fundamentally bottlenecks the development of perceptive and empathetic next-generation speech systems. At its core, this persistent perceptual limitation primarily stems from three interacting factors: scarce high-quality expressive data, absent fine-grained modeling for multi-dimensional attributes, and reliance on restricted coverage, coarse-grained benchmarks. We address these challenges through three pillars: First, our robust data curation pipeline resolves complex acoustic environments and long-audio timestamp alignment challenges to extract a high-quality spontaneous speech corpus from audiovisual sources. Second, we construct FMSU-Bench, a pioneering benchmark covering 14 speech attribute dimensions to rigorously assess the fine-grained, multi-dimensional speech understanding capabilities of current models. Third, empowered by our curated corpus, we introduce FM-Speech. Driven by a decoupled attribute modeling and progressive curriculum fine-tuning framework, it substantially elevates fine-grained, multi-dimensional acoustic perception. Extensive evaluations on FMSU-Bench reveal that current speech LLMs still require significant improvement in multi-dimensional, fine-grained understanding. In contrast, FM-Speech substantially outperforms current open-source models, establishing a robust paradigm for real-world speech understanding.
Primary: Northwestern Polytechnical University
All Institutions: Northwestern Polytechnical University, Shanghai Lingguang Zhaxian Technology
The paper makes a significant contribution to the field of machine learning by addressing the limitations of existing speech models and proposing a comprehensive framework for fine-grained multi-dimensional speech understanding. The innovative methodologies and rigorous evaluations presented establish a strong foundation for future research in this area.
The paper introduces a comprehensive methodology for fine-grained multi-dimensional speech understanding through a robust data curation pipeline, a pioneering benchmark (FMSU-Bench), and a novel model (FM-Speech). The data pipeline effectively addresses the challenges of extracting high-quality spontaneous speech from audiovisual sources, while the benchmark provides a structured framework for evaluating speech models across 14 distinct dimensions. The progressive curriculum fine-tuning framework for FM-Speech is particularly innovative, allowing for the decoupling of complex auditory attributes and improving model performance on fine-grained tasks.
The experiments conducted on FMSU-Bench demonstrate the effectiveness of the proposed methods. The evaluation includes a systematic comparison of FM-Speech against 11 advanced speech LLMs, showcasing significant improvements in multi-dimensional understanding. The rigorous data filtering and human verification processes enhance the reliability of the benchmark, while the use of innovative evaluation metrics (such as PATA) adds depth to the assessment of model performance.
The paper provides detailed descriptions of the methodology, data curation pipeline, and experimental setup, which contribute to reproducibility. However, the lack of access to the proprietary models used for comparison may limit the ability of others to fully replicate the results. The project URL provides access to the code and resources, which is beneficial for reproducibility.
One limitation is the reliance on specific audiovisual sources (movies and TV shows), which may not fully represent the diversity of real-world speech. Additionally, while the benchmark covers a wide range of speech attributes, the complexity of human speech may still present challenges in capturing all nuances. The paper also does not address potential biases in the data or the models used.
The advancements presented in this paper have significant implications for the development of next-generation speech systems that require fine-grained understanding and empathy in human-computer interactions. The establishment of FMSU-Bench sets a new standard for evaluating speech models, potentially influencing future research and applications in audio processing, speech recognition, and human-computer interaction. The paper makes a significant contribution to the field of machine learning by addressing the limitations of existing speech models and proposing a comprehensive framework for fine-grained multi-dimensional speech understanding. The innovative methodologies and rigorous evaluations presented establish a strong foundation for future research in this area.
Decoding speech from non-invasive brain signals is challenging. For the LibriBrain 2025 Speech Detection task, we propose a novel two-step framework that bypasses direct reconstruction. First, a contrastive learning model retrieves the matching speech segment for the given test MEG from a large-scale audio library (LibriVox). Second, a speech detection model generates the binary silence/speech sequence directly from this retrieved audio. With this approach, our team Sherlock Holmes achieved first place in the extended track (F1-score: 0.962), demonstrating that leveraging external audio databases is a highly effective strategy.
Primary: Peking University
All Institutions: College of Future Technology, Academy for Advanced Interdisciplinary Studies, Center for BioMed-X Research, Institute of Molecular Medicine, National Biomedical Imaging Center, Peking-Tsinghua Center for Life Sciences, School of Intelligence Science and Technology, Speech and Hearing Research Center, State Key Laboratory of General Artificial Intelligence, State Key Laboratory of Membrane Biology
The paper presents a novel two-step framework for speech detection from MEG signals, achieving state-of-the-art results by leveraging large-scale audio retrieval. This work demonstrates a significant advancement in the field of non-invasive BCIs and opens new avenues for research in audio processing and brain signal interpretation.
The proposed two-step framework is innovative in its approach to bypass direct reconstruction of speech from MEG signals by leveraging a large-scale audio library for retrieval. The use of contrastive learning for matching MEG segments with audio segments is a novel application in this context, highlighting the potential of match-mismatch tasks over traditional regression methods. The methodology is well-structured, with clear steps outlined for both the retrieval and detection phases, although the paper could benefit from more detailed explanations of the model architectures and hyperparameter choices.
The experiments are robust, with a clear description of data preparation, model training, and testing procedures. The authors achieved an impressive F1-score of 0.962, which is a significant contribution to the field, particularly given the challenges associated with decoding speech from noisy brain signals. However, the paper lacks a comparative analysis with other existing methods, which would strengthen the claims of superiority.
While the paper provides a good overview of the methods and results, it lacks detailed implementation specifics such as code availability, which is crucial for reproducibility. The absence of a public repository or demo limits the ability of other researchers to replicate the results.
One limitation is the reliance on a specific audio library (LibriVox), which may not generalize well to other datasets or real-world applications. Additionally, the method's performance on diverse speech types or accents is not addressed, which could affect its applicability. The paper also does not discuss the computational resources required for the proposed approach, which may limit accessibility for some researchers.
This research has the potential to significantly advance non-invasive brain-computer interfaces (BCIs) and improve communication methods for individuals with speech impairments. The innovative use of audio retrieval could inspire further exploration in related fields, such as cognitive neuroscience and assistive technologies. The paper presents a novel two-step framework for speech detection from MEG signals, achieving state-of-the-art results by leveraging large-scale audio retrieval. This work demonstrates a significant advancement in the field of non-invasive BCIs and opens new avenues for research in audio processing and brain signal interpretation.
Subretinal injection is a delicate vitreoretinal procedure requiring precise needle placement within the subretinal space while avoiding perforation of the retinal pigment epithelium (RPE), a layer directly beneath the target with extremely limited regenerative capacity. To enhance depth perception during cannula advancement, intraoperative optical coherence tomography (iOCT) offers high-resolution cross-sectional visualization of needle-tissue interaction; however, interpreting these images requires sustained visual attention alongside the en face microscope view, thereby increasing cognitive load during critical phases and placing additional demands on the surgeon's proprioceptive control. In this paper, we propose a structured, real-time sonification framework designed for extensible mapping of iOCT-derived anatomical features into perceptual auditory feedback. The method employs a physics-inspired acoustic model driven by segmented retinal layers from a stream of iOCT B-scans, with needle motion and injection-induced retinal layer displacements serving as excitation inputs to the sound model, enabling perception of tool position and retinal deformation. In a controlled user study (n=34), the proposed sonification achieved high retinal layer identification accuracy and robust detection of retinal deformation-related events, significantly outperforming a state-of-the-art baseline in overall event identification (83.4% vs. 60.6%, p < 0.001), with gains driven primarily by enhanced detection of injection-induced retinal deformation. Evaluation by experts (n=4) confirmed the clinical relevance and potential intraoperative applicability of the method. These results establish structured iOCT sonification as a viable complementary modality for real-time surgical guidance in subretinal injection.
Primary: Princeton University
All Institutions: Princeton University, Technische Universität München, Rotterdam Eye Hospital, Centre for Tactile Internet with Human-in-the-Loop, Technische Universität Dresden, Munich Center for Machine Learning, Chair for Social Affective Touch
This paper presents a novel real-time sonification framework for enhancing surgical guidance during subretinal injections, demonstrating significant improvements in event identification accuracy through innovative auditory feedback mechanisms. The methodology and experimental results indicate a strong potential for clinical impact, although further validation in diverse surgical contexts is necessary for widespread adoption.
The proposed methodology introduces a structured sonification framework that effectively maps iOCT-derived anatomical features into auditory feedback, leveraging a physics-inspired acoustic model. The approach is well-defined, utilizing real-time updates based on segmented retinal layers and employing a mass-spring-damper system to reflect dynamic interactions during subretinal injections. The integration of both tool-driven and anatomy-driven excitations is innovative, enhancing the auditory feedback's relevance to surgical contexts. However, the reliance on a specific anatomical model may limit generalizability across different surgical scenarios.
The user study involving 34 participants provides robust evidence of the proposed method's effectiveness, demonstrating significant improvements in event identification accuracy compared to a baseline. The statistical significance of the results (p < 0.001) strengthens the claims of enhanced performance. The qualitative evaluations and feedback from expert surgeons further validate the clinical applicability of the framework. However, additional details on participant demographics and the specific experimental setup would enhance the evaluation's transparency.
The paper provides a GitHub repository link for the code, which is a positive step towards reproducibility. However, the implementation details could be more thoroughly documented to facilitate easier replication by other researchers. The reliance on specific software libraries (e.g., miPhysics) should also be clearly stated to avoid potential compatibility issues.
The study's limitations include a small sample size for expert feedback and the potential for bias in participant selection. The framework's performance in diverse surgical scenarios beyond subretinal injection remains untested. Additionally, the auditory feedback's effectiveness may vary based on individual surgeon preferences and experiences, which could affect its adoption in clinical practice.
The proposed sonification framework has the potential to significantly enhance surgical precision and reduce cognitive load during delicate procedures like subretinal injections. By providing real-time auditory feedback, it could improve patient outcomes and streamline surgical workflows. The approach may also inspire further research into auditory feedback systems in other medical domains, potentially leading to broader applications in minimally invasive surgeries. This paper presents a novel real-time sonification framework for enhancing surgical guidance during subretinal injections, demonstrating significant improvements in event identification accuracy through innovative auditory feedback mechanisms. The methodology and experimental results indicate a strong potential for clinical impact, although further validation in diverse surgical contexts is necessary for widespread adoption.
As audio-first agents become increasingly common in physical AI, conversational robots, and screenless wearables, audio large language models (audio-LLMs) must integrate speaker-specific understanding to support user authorization, personalization, and context-aware interaction. This requires modeling who is speaking, how the voice sounds, and how recording conditions affect speaker cues. Conventional speaker verification systems provide strong scalar scores but little linguistic evidence, while current audio-LLMs and speaker-aware language models have limited ability to organize speaker information beyond binary labels or descriptive profiles. We present SpeakerLLM, a speaker-specialized audio-LLM framework that unifies single-utterance speaker profiling, recording-condition understanding, utterance-pair speaker comparison, and evidence-organized verification reasoning within a natural-language interface. We construct verification-reasoning targets and a decision-composition policy that separate profile-level evidence from the final same-or-different decision and organize recording condition, profile evidence, and the decision into a structured trace. At its core, SpeakerLLM uses a hierarchical speaker tokenizer designed to capture multiple granularities of speaker evidence. Utterance-level speaker embeddings summarize identity and profile-level cues, whereas frame-level speaker features preserve fine-grained acoustic descriptors. Experiments show that SpeakerLLM-Base improves speaker-profile and recording-condition understanding over general audio-LLMs, while SpeakerLLM-VR preserves strong generated-verdict accuracy and produces decision traces grounded in the supervised verification reasoning schema. We will release the metadata-enriched supervision dataset and target-construction code for reproducibility.
Primary: Korea Advanced Institute of Science and Technology (KAIST)
All Institutions: Korea Advanced Institute of Science and Technology (KAIST), University of Seoul
The main contribution of this paper is the introduction of SpeakerLLM, a speaker-specialized audio-LLM framework that effectively integrates speaker understanding and verification reasoning within a natural-language interface. This work significantly advances the field of audio processing by enhancing the explainability and accuracy of speaker verification systems, making it a valuable addition to the literature.
The paper presents a well-structured methodology with a clear two-stage training process for SpeakerLLM, which effectively integrates speaker profiling, recording condition understanding, and verification reasoning. The hierarchical speaker tokenizer is a novel approach that captures different granularities of speaker evidence, enhancing the model's ability to process and understand speaker-specific cues. The decision-composition policy that separates profile-level evidence from the final decision is a significant advancement in explainability for speaker verification systems.
The experiments are comprehensive, demonstrating the effectiveness of SpeakerLLM-Base and SpeakerLLM-VR through various tasks, including speaker profiling and verification reasoning. The results show substantial improvements over general audio-LLMs, especially in tasks requiring fine-grained acoustic evidence. The use of a controlled dataset and clear evaluation metrics strengthens the findings.
The authors commit to releasing the metadata-enriched supervision dataset and target-construction code, which is crucial for reproducibility. However, the paper could benefit from additional details on the implementation of the models and the specific configurations used during training.
The paper acknowledges limitations, including the need for further evaluation of the model in real-world noisy environments and the necessity of consent-aware interfaces for user privacy. The reliance on specific datasets may limit the generalizability of the findings.
The proposed framework has significant implications for the development of audio-first AI systems, particularly in enhancing user interaction through personalized and context-aware speaker verification. The ability to provide explainable decisions in speaker verification can improve trust and usability in applications like conversational agents and security systems. The main contribution of this paper is the introduction of SpeakerLLM, a speaker-specialized audio-LLM framework that effectively integrates speaker understanding and verification reasoning within a natural-language interface. This work significantly advances the field of audio processing by enhancing the explainability and accuracy of speaker verification systems, making it a valuable addition to the literature.
Multimedia verification requires not only accurate conclusions but also transparent and contestable reasoning. We propose a contestable multi-agent framework that integrates multimodal large language models, external verification tools, and arena-based quantitative bipolar argumentation (A-QBAF) as a submission to the ICMR 2026 Grand Challenge on Multimedia Verification. Our method decomposes each case into claim-centered sections, retrieves targeted evidence, and converts evidence into structured support and attack arguments with provenance and strength scores. These arguments are resolved through small local argument graphs with selective clash resolution and uncertainty-aware escalation. The resulting system generates section-wise verification reports that are transparent, editable, and computationally practical for real-world multimedia verification. Our implementation is public at: https://github.com/Analytics-Everywhere-Lab/MV2026_the_liems.
Primary: University of New Brunswick
All Institutions: University of New Brunswick, FPT Software, University of Science
The paper presents a contestable multi-agent framework for multimedia verification that integrates multimodal large language models and an arena-based argumentation approach. The methodology is innovative and addresses critical issues in multimedia verification, although empirical validation and detailed experimental results are needed to fully assess its impact.
The proposed methodology is innovative, integrating multimodal large language models with an arena-based quantitative bipolar argumentation framework. The multi-agent approach effectively decomposes multimedia verification tasks into claim-centered sections, allowing for structured argumentation and transparent reasoning. The use of selective clash resolution and uncertainty-aware escalation enhances the system's robustness and practicality for real-world applications.
The paper lacks detailed experimental results or benchmarks that validate the proposed framework's effectiveness. While it describes the methodology in depth, the absence of empirical data or comparisons against existing methods limits the assessment of its performance and impact.
The implementation is publicly available on GitHub, which is a positive aspect for reproducibility. However, the paper does not provide sufficient details on the datasets used, evaluation metrics, or specific experimental setups, which could hinder full reproducibility.
The paper does not address potential limitations in terms of scalability, the complexity of the argumentation process, or the handling of ambiguous cases. Additionally, the reliance on external verification tools may introduce variability in results based on the quality of those tools.
The framework has significant implications for multimedia verification, particularly in combating misinformation in digital media. Its emphasis on contestability and transparency could enhance trust in automated verification systems, making it a valuable tool for journalists, fact-checkers, and the general public. The paper presents a contestable multi-agent framework for multimedia verification that integrates multimodal large language models and an arena-based argumentation approach. The methodology is innovative and addresses critical issues in multimedia verification, although empirical validation and detailed experimental results are needed to fully assess its impact.
Persian music, with its unique tonalities, modal systems (Dastgah), and rhythmic structures, presents significant challenges for music generation models trained primarily on Western music. We address this gap by curating the first large-scale dataset of Persian songs, comprising over 900 hours high-quality audio samples across diverse sub-genres, including pop, traditional, and contemporary styles. This dataset captures the rich melodic and cultural diversity of Persian music and serves as the foundation for fine-tuning MusicGen, a state-of-the-art generative music model. We adapt MusicGen to this domain and evaluate its performance by utilizing subjective and objective metrics. To assess the semantic alignment between generated music and intended style tags, we report the proportion of relevant tags accurately reflected in the generated outputs. Our results demonstrate that the fine-tuned model produces compositions that more align with Persian stylistic conventions. This work introduces a new resource for generative music research and illustrates the adaptability of music generation models to underrepresented cultural and linguistic contexts.
Primary: Sharif University of Technology
All Institutions: Sharif University of Technology, Independent Researcher
This paper presents the first large-scale dataset of Persian music and successfully adapts a state-of-the-art generative model to this culturally rich domain. The comprehensive methodology and promising results underscore the potential for AI to engage with and celebrate diverse musical traditions.
The methodology is robust, featuring a comprehensive dataset curation process that addresses the significant gap in Persian music resources. The authors employed a sophisticated approach for audio segmentation, tagging, and conditioning using state-of-the-art models. The three-stage training pipeline for adapting MusicGen to Persian music is well-structured, emphasizing unsupervised domain adaptation, instrument-focused fine-tuning, and supervised fine-tuning, which collectively enhance the model's cultural fidelity and stylistic accuracy. However, the reliance on automated tagging and the absence of expert validation for some aspects of the dataset may introduce noise and inaccuracies.
The experimental evaluation is thorough, utilizing both objective metrics (KLD and Chroma Cosine Similarity) and a hybrid evaluation strategy. The results indicate that the fine-tuned model significantly outperforms the baseline in generating culturally coherent Persian music. However, the evaluation could benefit from a more extensive subjective assessment involving trained musicians to capture perceptual qualities that are critical in music generation.
The paper provides a clear description of the dataset creation process and model training, which facilitates reproducibility. However, some details regarding the specific configurations used during training and the exact nature of the evaluation metrics could be elaborated upon to enhance clarity for future researchers attempting to replicate the study.
Key limitations include the dataset's skewed genre distribution towards Persian pop, which may affect the model's generalizability across other Persian music styles. The automatic tagging process may introduce inaccuracies, and the evaluation metrics used do not fully capture the richness of Persian music, particularly in terms of microtonal fidelity and ornamentation. Additionally, the model's performance may be constrained by the smaller variant of MusicGen used for fine-tuning.
This research has significant implications for the field of generative music, particularly in promoting cultural diversity in AI-generated content. By addressing the underrepresentation of Persian music in generative models, this work opens avenues for further exploration of other non-Western musical traditions. The dataset created can serve as a valuable resource for future research in music generation, potentially influencing the development of more culturally-aware AI systems. This paper presents the first large-scale dataset of Persian music and successfully adapts a state-of-the-art generative model to this culturally rich domain. The comprehensive methodology and promising results underscore the potential for AI to engage with and celebrate diverse musical traditions.
LLM-based automatic speech recognition models demonstrate strong performance by connecting audio encoders and LLMs. However, data scarcity of paired speech and transcription often hinders their adaptation to new domains, making text-only domain adaptation crucial. Existing methods typically rely on either fine-tuning the LLM alone or employing pseudo-audio prompts. The former neglects essential acoustic context, while the latter either suffers from limited scalability in data-scarce conditions, or yields inexpressive prompts by leveraging only textual features, ignoring audio modality. To address this, we propose an enhanced framework that explicitly models speech-text alignment. Our method efficiently generates highly expressive pseudo-audio prompts that bridges the modality gap, enabling effective target-domain adaptation. Experiments demonstrate that our approach outperforms existing text-only methods, improving both overall error rates and out-of-vocabulary coverage.
Primary: Kyoto University
All Institutions: Kyoto University, LY Corporation
The main contribution of this paper is the introduction of the TE2SL framework, which enhances text-only domain adaptation in LLM-based ASR by generating expressive pseudo-audio prompts through a learnable refinement module. This work represents a significant advancement in bridging the modality gap in ASR systems, with promising implications for improving performance in data-scarce environments.
The proposed Text-Embedding-to-Speech-Latent (TE2SL) framework innovatively addresses the challenge of text-only domain adaptation in LLM-based ASR by introducing a learnable refinement module that enhances the quality of pseudo-audio prompts. This method effectively bridges the modality gap by ensuring that the synthesized prompts are both sample-dependent and aligned with the characteristics of the audio encoder and projector. The methodology is well-structured, with a clear distinction between training and adaptation phases, and utilizes a Conformer architecture to achieve this refinement. The focus on architecture-aware synthesis is a significant advancement over previous heuristic approaches.
The experiments conducted are thorough, comparing the TE2SL framework against established baselines, including LLM-only fine-tuning and pseudo-audio prompt methods. The results demonstrate substantial improvements in both recognition accuracy and out-of-vocabulary (OOV) recall across multiple datasets in English and Japanese, validating the effectiveness of the proposed method. The use of diverse datasets strengthens the generalizability of the findings, and the metrics employed (WER and CER) are appropriate for evaluating ASR performance.
The paper provides a detailed description of the experimental setup, including model architectures, training configurations, and evaluation metrics. However, the absence of a publicly available code repository or demo limits the reproducibility of the results. Clearer documentation or a supplementary material section with implementation details could enhance reproducibility.
One limitation is the reliance on the quality of the audio encoder and projector, which may vary across different languages or domains. Additionally, while the method shows promise in improving OOV recall, the paper does not extensively discuss the implications of these improvements in practical applications. The scalability of the TE2SL framework in low-resource settings, where high-quality audio encoders may not be available, also warrants further exploration.
The proposed approach has significant potential applications in various domains where ASR systems are deployed, particularly in low-resource languages or specialized fields with limited paired data. By improving domain adaptation capabilities, this work can enhance accessibility and usability of ASR technologies in diverse linguistic contexts. The findings could also inform future research on multimodal learning and integration of audio-visual data in ASR systems. The main contribution of this paper is the introduction of the TE2SL framework, which enhances text-only domain adaptation in LLM-based ASR by generating expressive pseudo-audio prompts through a learnable refinement module. This work represents a significant advancement in bridging the modality gap in ASR systems, with promising implications for improving performance in data-scarce environments.
Target speech extraction remains difficult for compact devices because monaural neural models lack spatial evidence and classical beamformers lose resolving power when the microphone aperture is only a few centimetres. We present IsoNet, a user-selectable audio-visual target speech extraction system for a compact 4-microphone array. IsoNet combines complex multi-channel STFT features, GCC-PHAT spatial cues, face-conditioned visual embeddings, and auxiliary direction-of-arrival supervision inside a U-Net mask estimation network. Three curriculum variants were trained on 25,000 simulated VoxCeleb mixtures with progressively difficult SNR regimes. On a hard test set spanning -1 to 10 dB SNR, IsoNet-CL1 achieves 9.31 dB SI-SDR, a 4.85 dB improvement over the mixture, with PESQ 2.13 and STOI 0.84. Oracle delay-and-sum and MVDR beamformers degrade the same mixtures by 4.82 dB and 6.08 dB SI-SDRi, respectively, showing that the proposed learned multimodal conditioning solves a regime where conventional spatial filtering is ineffective. Ablation studies show consistent gains from visual conditioning, GCC-PHAT features, and extended delay-bin encoding. The results establish a compact-array, face-selectable speech extraction baseline under controlled simulation and identify the remaining barriers to real deployment, especially phase reconstruction, multi-interferer mixtures, and simulation-to-real transfer.
Primary: Institute of Engineering, Tribhuvan University
All Institutions: Institute of Engineering, Tribhuvan University
IsoNet presents a novel approach to audio-visual target speech extraction, effectively addressing the limitations of compact microphone arrays in challenging acoustic environments. The combination of advanced methodologies and thorough experimental validation positions this work as a meaningful contribution to the field of machine learning and audio processing.
The proposed methodology in IsoNet is robust, combining multi-channel STFT features, GCC-PHAT spatial cues, and face-conditioned visual embeddings within a U-Net architecture. The use of curriculum learning to progressively introduce SNR challenges is a thoughtful approach that enhances model robustness. The architecture is designed to address specific failure modes of compact microphone arrays, making it relevant for practical applications. The integration of auxiliary direction-of-arrival supervision is a notable addition that helps regularize the learning process.
The experiments are well-structured, utilizing a large dataset of 25,000 simulated mixtures from VoxCeleb, which is appropriate for the task. The evaluation metrics (SI-SDR, PESQ, and STOI) provide a comprehensive view of both objective and perceptual quality. The results demonstrate significant improvements over baseline methods, particularly in challenging SNR conditions. The ablation studies effectively isolate the contributions of different components of the model, providing clear insights into the efficacy of visual and spatial conditioning.
The paper provides sufficient detail on the experimental setup, including the architecture, training procedures, and evaluation metrics, which supports reproducibility. However, the lack of publicly available code or datasets limits the ability for independent verification of results.
The study primarily focuses on scenarios with a single interfering speaker, which may not fully capture the complexities of real-world environments with multiple speakers and background noise. Additionally, the reliance on simulated data may introduce discrepancies when transitioning to real-world applications. The phase reconstruction method used could also be improved for better performance in low SNR conditions.
The proposed IsoNet system has significant implications for various applications, including voice assistants, hearing aids, and augmented reality devices, where selective listening is crucial. By enhancing the ability to extract target speech in complex acoustic environments, this research could improve user experiences in everyday communication scenarios. IsoNet presents a novel approach to audio-visual target speech extraction, effectively addressing the limitations of compact microphone arrays in challenging acoustic environments. The combination of advanced methodologies and thorough experimental validation positions this work as a meaningful contribution to the field of machine learning and audio processing.
Current methods for creating drum loop audio in digital music production, such as using one-shot samples or resampling, often demand non-trivial efforts of creators. While recent generative models achieve high fidelity and adhere to text, they lack the specific control needed for such a task. Existing symbolic-to-audio research often focuses on single, tonal instruments, leaving the challenge of polyphonic, percussive drum synthesis unaddressed. We address this gap by introducing ``Break-the-Beat!,'' a model capable of rendering a drum MIDI with the timbre of a reference audio. It is built by fine-tuning a pre-trained text-to-audio model with our proposed content encoder and a effective hybrid conditioning mechanism. To enable this, we construct a new dataset of paired target-reference drum audio from existing drum audio datasets. Experiments demonstrate that our model generates high-quality drum audio that follows high-resolution drum MIDI, achieving strong performance across metrics of audio quality, rhythmic alignment, and beat continuity. This offer producers a new, controllable tool for creative production. Demo page: https://ik4sumii.github.io/break-the-beat/
Primary: Sony Group Corporation
All Institutions: Sony Group Corporation, Sony AI
The main contribution of this paper is the introduction of "Break-the-Beat!", a novel model for controllable MIDI-to-drum audio synthesis that combines advanced conditioning mechanisms with a pre-trained audio generation framework. This work not only fills a crucial gap in the existing literature but also offers practical tools for music producers, enhancing the creative process in digital music production.
The methodology presented in the paper is robust and innovative, leveraging a pre-trained text-to-audio model (SAO) and introducing a dual-input content encoder that effectively combines MIDI and reference audio for drum synthesis. The hybrid conditioning mechanism is a noteworthy contribution, allowing for precise control over both rhythm and timbre. The use of a novel dataset constructed from existing drum audio datasets is a significant step towards addressing the lack of resources in this area. The authors provide a clear overview of their approach, detailing the input representations, conditioning mechanisms, and training strategies, which enhances the clarity and reproducibility of their work.
The experimental evaluation is thorough, utilizing a well-defined dataset and a variety of metrics to assess the performance of the proposed model. The results demonstrate significant improvements in audio quality, rhythmic alignment, and beat continuity, particularly when using higher temporal resolutions for MIDI input. The paper effectively compares its method against various baselines and provides qualitative and quantitative analyses, which strengthen the validity of the findings. However, the paper could benefit from additional user studies or subjective evaluations to further substantiate the claims of improved audio quality.
The paper provides sufficient details regarding the model architecture, training procedures, and evaluation metrics, which facilitates reproducibility. However, the absence of a publicly available code repository limits the ability of other researchers to fully replicate the study. Providing access to the trained models or code would significantly enhance the reproducibility of the results.
One limitation of the study is the reliance on a specific dataset, which may not encompass the full diversity of drum sounds and styles encountered in real-world music production. Additionally, while the model performs well on the evaluated metrics, the subjective quality of generated audio in practical scenarios remains to be fully explored. The paper also does not address potential computational costs associated with training and inference, which could be a barrier for some users.
The proposed model has the potential to significantly impact digital music production by providing a tool that allows for greater control and creativity in drum synthesis. This could democratize music production for non-experts and enhance the workflow of professional producers. Furthermore, the findings could inspire future research in the area of symbolic-to-audio synthesis, particularly for other instrument types and musical styles. The main contribution of this paper is the introduction of "Break-the-Beat!", a novel model for controllable MIDI-to-drum audio synthesis that combines advanced conditioning mechanisms with a pre-trained audio generation framework. This work not only fills a crucial gap in the existing literature but also offers practical tools for music producers, enhancing the creative process in digital music production.
Early-stage Parkinson's disease (EarlyPD) detection from speech is clinically meaningful yet underexplored, and published results are hard to compare because studies differ in datasets, languages, tasks, evaluation protocols, and EarlyPD definitions. To address this issue, we propose the first benchmark for speech-based EarlyPD detection, with a speaker-independent split designed for fair and replicable cross-method evaluation on researcher-accessible datasets. The benchmark covers three common speech tasks and evaluates methods under different training-resource settings. We also present multi-dimensional evaluation breakdowns by dataset, aggregation level, gender, and disease stage to support fine-grained comparisons and clinical adoption. Our results provide a replicable reference and actionable insights, encouraging the adoption of this publicly available benchmark to advance robust and clinically meaningful EarlyPD detection from speech.
Primary: Radboud University
All Institutions: Radboud University, Radboud University Medical Center
This paper presents the first benchmark for speech-based EarlyPD detection, addressing a critical gap in the literature. The comprehensive methodology and robust experimental evaluation provide a significant contribution to the field, encouraging further research and development in clinically meaningful detection methods.
The paper introduces a well-structured benchmark for Early-stage Parkinson's Disease (EarlyPD) detection from speech, addressing the critical issue of comparability in existing research. The methodology includes a speaker-independent split for datasets, a clear definition of EarlyPD, and a multi-dimensional evaluation framework that allows for nuanced comparisons across various factors, such as gender and disease stage. The use of diverse training-resource settings and the inclusion of both public and private datasets enhance the robustness of the proposed benchmark.
The experiments are comprehensive, utilizing multiple speech tasks and a variety of machine learning models. The results are presented clearly, with a focus on both aggregate and utterance-level performance. The findings indicate significant improvements in EarlyPD detection when expanding speaker diversity, which is a valuable insight for future research. The evaluation metrics used (AUC and F1) are appropriate for the clinical context, ensuring relevance to real-world applications.
The authors emphasize transparency and reproducibility by providing all necessary resources and protocols for replicating their benchmark. The fixed training and evaluation settings, along with the release of datasets, contribute to a high level of reproducibility. However, the reliance on specific datasets may limit generalizability if future datasets differ significantly.
One limitation is the potential bias introduced by the datasets used, particularly in terms of gender representation and the skewed nature of some datasets. Additionally, while the benchmark is robust, the focus on specific speech tasks may not encompass the full range of speech variability seen in real-world clinical settings. The authors also note that spontaneous speech tasks were not included, which could be a significant aspect of EarlyPD detection.
The proposed benchmark has the potential to significantly advance the field of speech-based EarlyPD detection, promoting more reliable and clinically relevant research. By establishing a standardized evaluation protocol, it encourages the adoption of best practices in the community, ultimately leading to improved diagnostic tools for Parkinson's disease. The emphasis on explainability in model design also aligns with current trends in AI, making the findings particularly relevant for future developments in healthcare technology. This paper presents the first benchmark for speech-based EarlyPD detection, addressing a critical gap in the literature. The comprehensive methodology and robust experimental evaluation provide a significant contribution to the field, encouraging further research and development in clinically meaningful detection methods.
Voice agents, artificial intelligence systems that conduct spoken conversations to complete tasks, are increasingly deployed across enterprise applications. However, no existing benchmark jointly addresses two core evaluation challenges: generating realistic simulated conversations, and measuring quality across the full scope of voice-specific failure modes. We present EVA-Bench, an end-to-end evaluation framework that addresses both. On the simulation side, EVA-Bench orchestrates bot-to-bot audio conversations over dynamic multi-turn dialogues, with automatic simulation validation that detects user simulator error and appropriately regenerates conversations before scoring. On the measurement side, EVA-Bench introduces two composite metrics: EVA-A (Accuracy), capturing task completion, faithfulness, and audio-level speech fidelity; and EVA-X (Experience), capturing conversation progression, spoken conciseness, and turn-taking timing. Both metrics apply to different agent architectures, enabling direct cross-architecture comparison. EVA-Bench includes 213 scenarios across three enterprise domains, a controlled perturbation suite for accent and noise robustness, and pass@1, pass@k, pass^k measurements that distinguish peak from reliable capability. Across 12 systems spanning all three architectures, we find: (1) no system simultaneously exceeds 0.5 on both EVA-A pass@1 and EVA-X pass@1; (2) peak and reliable performance diverge substantially (median pass@k - pass^k gap of 0.44 on EVA-A); and (3) accent and noise perturbations expose substantial robustness gaps, with effects varying across architectures, systems, and metrics (mean up to 0.314). We release the full framework, evaluation suite, and benchmark data under an open-source license.
Primary: ServiceNow
All Institutions: ServiceNow
The main contribution of this paper is the introduction of EVA-Bench, a novel evaluation framework for voice agents that combines realistic simulation with comprehensive metrics to assess performance across various architectures and conditions. This work significantly advances the field by addressing critical evaluation challenges and providing a foundation for future research in voice agent technology.
The methodology presented in this paper is robust and addresses significant gaps in the evaluation of voice agents. The authors introduce an end-to-end framework, EVA-Bench, that combines realistic bot-to-bot audio simulation with comprehensive measurement metrics (EVA-A and EVA-X). The simulation methodology is particularly noteworthy as it incorporates automated validation to ensure the quality of user simulations, which is critical for obtaining reliable evaluation scores. The introduction of controlled perturbations to assess robustness against accent and noise variations further strengthens the methodology, allowing for a nuanced understanding of system performance across different conditions.
The experimental evaluation is thorough, involving 12 systems across three distinct architectures and a total of 213 scenarios. The results reveal critical insights into the performance of voice agents, particularly the divergence between peak and reliable performance, which is a crucial finding for real-world applications. The use of multiple trials and the introduction of pass@1, pass@k, and pass^k metrics provide a comprehensive view of system capabilities, although the paper could benefit from additional comparative analysis against existing benchmarks to contextualize the findings further.
The authors emphasize reproducibility by providing open-source access to the framework, evaluation suite, and benchmark data. They include detailed implementation instructions and configurations, which are essential for other researchers to replicate the study. However, the reliance on commercial model APIs for full reproduction may limit accessibility for some researchers, potentially impacting the overall reproducibility of the findings.
The paper acknowledges several limitations, including potential biases in the LLM-based judges, the lack of multilingual coverage, and the constraints of the user simulator in replicating real human caller behaviors. Additionally, the evaluation does not account for harmful outputs or sensitive information exposure, which is particularly relevant in high-stakes domains. The authors also note that the framework does not assess more complex agent configurations, which may limit its applicability in certain scenarios.
The EVA-Bench framework has significant implications for the development and evaluation of voice agents in enterprise applications. By providing a comprehensive evaluation methodology, it can help improve the reliability and user experience of voice agents, ultimately leading to better deployment in real-world settings. The findings regarding performance gaps and robustness under perturbations can inform future research and development efforts, guiding improvements in voice agent architectures and evaluation practices. The main contribution of this paper is the introduction of EVA-Bench, a novel evaluation framework for voice agents that combines realistic simulation with comprehensive metrics to assess performance across various architectures and conditions. This work significantly advances the field by addressing critical evaluation challenges and providing a foundation for future research in voice agent technology.
High-quality training datasets are essential for the performance of neural networks. However, the audio domain still lacks a large-scale, strongly-labeled, and single-source sound event dataset. The FSD50K dataset, despite being relatively large and open, contains a considerable fraction of multi-source samples where background interference or overlapping events could limit the usefulness of the data. To address this challenge, we introduce a data curation framework designed for large-scale open audio corpora. Our approach leverages a generative diffusion model to synthesize clean single-class events to construct controlled noisy mixtures for supervision. We subsequently employ a pre-trained audio encoder coupled with a discriminative classifier to automatically identify and filter out multi-source samples. Experiments show that our framework achieves strong performance on a human expert-curated test set. Finally, we release FSD50K-Solo, a model-curated subset of FSD50K containing single-source audio samples identified by our method. Beyond FSD50K, our method establishes a scalable paradigm for curating open source audio corpora.
Primary: Stony Brook University
All Institutions: Stony Brook University, Bose Corporation
This paper introduces a systematic framework for automated curation of single-source sound events, addressing critical data quality challenges in audio machine learning. The innovative use of generative models for dataset enhancement and the strong experimental results position this work as a significant contribution to the field.
The proposed methodology employs a generative diffusion model to synthesize clean single-source audio events, which is a novel approach to address the challenge of multi-source interference in existing datasets. The framework's reliance on a pre-trained audio encoder and a discriminative classifier for filtering multi-source samples is a significant advancement in automated data curation. The systematic approach to generating controlled noisy mixtures for supervision demonstrates a thoughtful integration of generative modeling with traditional classification techniques.
The experiments are well-structured, utilizing both generated data and a human-curated internal dataset for evaluation. The performance metrics, including traditional classification metrics and Audiobox Aesthetics scores, provide a robust assessment of the model's effectiveness. The results indicate strong classification performance, particularly on the expert-curated dataset, which underscores the model's practical applicability.
The paper states that the complete clip-level metadata of FSD50K-Solo will be released, supporting reproducibility. However, the lack of a direct link to the dataset or code repository limits immediate access for other researchers. The methodology is described in sufficient detail to allow for replication, but the absence of a public project URL is a drawback.
One limitation acknowledged is the potential domain gap between generated data and real-world audio data, which could affect generalization. Additionally, while the framework shows promise, the exploration of its performance on unseen event classes is still required. The reliance on a human-curated dataset for validation may introduce biases inherent in the curation process.
The release of FSD50K-Solo and the proposed curation framework has the potential to significantly advance audio machine learning research by providing a high-quality dataset that can enhance model training and evaluation. The methodology can be applied to other audio corpora, promoting better practices in dataset curation across the field. The implications of improved audio datasets extend to various applications, including sound event detection, audio synthesis, and machine learning in general. This paper introduces a systematic framework for automated curation of single-source sound events, addressing critical data quality challenges in audio machine learning. The innovative use of generative models for dataset enhancement and the strong experimental results position this work as a significant contribution to the field.
Audio provides critical situational cues, yet current Audio Language Models (ALMs) face an attention bottleneck in long-form recordings where dominant background patterns can dilute rare, salient events. We introduce NAACA, a training-free NeuroAuditory Attentive Cognitive Architecture that reframes attention allocation as an auditory salience filtering problem. At its core is OWM, a neuro-inspired Oscillatory Working Memory that maintains stable attractor-like states and triggers higher-cognition ALM processing only when adaptive energy fluctuations signal perceptual salience, triggering higher-level reasoning. On XD-Violence, NAACA improves AudioQwen's average precision (AP) from 53.50% to 70.60% while reducing unnecessary ALM invocations. Furthermore, qualitative case studies on the Urban Soundscapes of the World (USoW) dataset show that OWM captures novel events and subcategory shifts while remaining robust to transient pauses and ambient urban noise.
Primary: Ghent University
All Institutions: Ghent University, Vrije Universiteit Brussel, Queen Mary University of London
The main contribution of this paper is the introduction of NAACA, a training-free neuro-inspired architecture that employs oscillatory dynamics for salience-driven attention gating in audio processing. This innovative approach addresses critical limitations in existing audio language models, offering a promising direction for future research and applications in audio understanding.
The methodology presented in NAACA is innovative, leveraging a neuro-inspired Oscillatory Working Memory (OWM) to address the attention bottleneck in Audio Language Models (ALMs). The approach of framing salience detection as an auditory filtering problem is well-grounded in cognitive neuroscience, and the training-free aspect of the architecture is particularly noteworthy. The use of oscillatory dynamics to maintain stable memory states while adapting to salient changes in audio streams is a significant advancement over traditional methods that rely on extensive historical data or training. The detailed formulation of OWM and its integration into the NAACA framework is technically sound, although the complexity of the model may pose challenges for practical implementation.
The experiments conducted on the XD-Violence and Urban Soundscapes of the World (USoW) datasets provide robust evidence of NAACA's effectiveness. The reported improvement in average precision (AP) demonstrates a clear performance gain over existing models, and the qualitative case studies further illustrate the model's ability to detect salient events in complex audio environments. However, the paper could benefit from a more comprehensive comparison with a wider range of baseline models to fully contextualize its performance.
The paper provides a thorough description of the methods and implementation details, which enhances reproducibility. However, the lack of publicly available code or datasets limits the ability of other researchers to replicate the findings. Including a demo or project URL would greatly enhance the paper's impact and usability within the community.
The primary limitation noted is the dependency on the performance of the chosen audio encoder, which may restrict the model's applicability to out-of-distribution sound events. Additionally, the hard-gating mechanism may overlook contextual information that could be preserved with more flexible attention mechanisms. The evaluation metrics focus mainly on anomaly detection, suggesting that future work should explore broader audio understanding tasks.
The implications of this research are significant, particularly in fields such as public safety surveillance, environmental monitoring, and any domain where audio analysis is critical. By improving the efficiency and effectiveness of audio processing in real-time applications, NAACA has the potential to enhance situational awareness and response capabilities in various contexts. The main contribution of this paper is the introduction of NAACA, a training-free neuro-inspired architecture that employs oscillatory dynamics for salience-driven attention gating in audio processing. This innovative approach addresses critical limitations in existing audio language models, offering a promising direction for future research and applications in audio understanding.
Symbolic-control drum generation requires preserving explicit event timing and dynamics while synthesizing acoustically plausible waveforms. We present Sec2Drum-DAC, a conditional latent-diffusion model for symbolic-to-audio drum rendering. The model conditions on event features sampled in physical time at codec-frame locations and predicts standardized principal-component coordinates of frozen DAC summed-codebook embeddings rather than waveform samples. In the evaluated DAC configuration, 72 principal components capture the observed training-frame summed-latent subspace under the stated SVD threshold, yielding a compact continuous denoising target with a deterministic reconstruction path to the 1024-dimensional DAC latent space before waveform decoding. Across 1,733 held-out four-beat windows, PCA diffusion improves paired spectral and transient metrics over deterministic PCA regression and a symbolic rendering baseline, while direct regression remains stronger on phase-sensitive waveform L1. Auxiliary RVQ cross-entropy improves short-step diffusion on mel error, onset-flux cosine, and waveform L1, with the most favorable trade-offs occurring at 6-25 denoising steps depending on the metric.
Primary: Hellenic Mediterranean University
All Institutions: Hellenic Mediterranean University, Athena RC
This paper contributes a significant advancement in symbolic-to-audio drum rendering through a novel latent-diffusion model that preserves event timing and dynamics while synthesizing realistic audio. The comprehensive methodology and robust experimental evaluation position it as a meaningful contribution to the field of machine learning in audio applications.
The paper presents a novel approach to symbolic-to-audio drum rendering using a conditional latent-diffusion model, which aligns symbolic conditioning to physical time and utilizes PCA for dimensionality reduction in the latent space. The methodology is well-structured, incorporating auxiliary RVQ cross-entropy for improved performance and demonstrating a clear pipeline from symbolic input to audio output. The use of PCA coordinates as a denoising target rather than direct waveform samples is innovative and addresses the challenges of maintaining control over the generated audio while ensuring acoustic fidelity.
The experimental setup is robust, utilizing a substantial dataset of 11,523 training examples and a variety of evaluation metrics that capture different aspects of audio quality, including spectral fidelity and transient accuracy. The results indicate significant improvements over baseline methods, particularly in spectral and transient metrics, although direct regression outperforms on phase-sensitive waveform metrics. The comprehensive evaluation across multiple configurations and the use of statistical testing to validate findings enhance the credibility of the results.
The paper outlines the training and evaluation processes in detail, including hyperparameters and data preprocessing steps, which supports reproducibility. However, the lack of a public repository at the time of review limits immediate reproducibility. The authors mention plans to release the code, which would further aid in this aspect.
The study is narrow in scope, focusing on short four-beat segments rather than full musical compositions, which may limit the generalizability of the findings. Additionally, the reliance on automatic evaluation metrics without a human listening study raises questions about perceived audio quality. The fixed PCA representation may not be optimal for all contexts, and the evaluation does not account for sampling variability.
The proposed method has significant implications for music technology, particularly in enhancing the controllability and fidelity of drum synthesis in various applications, including music production and interactive audio systems. The approach could inspire further research into symbolic-to-audio translation methods and their integration into broader music generation frameworks. This paper contributes a significant advancement in symbolic-to-audio drum rendering through a novel latent-diffusion model that preserves event timing and dynamics while synthesizing realistic audio. The comprehensive methodology and robust experimental evaluation position it as a meaningful contribution to the field of machine learning in audio applications.
Fine-tuning multilingual ASR models like Whisper for low-resource languages often improves read speech but degrades spontaneous audio performance, a phenomenon we term studio-bias. To diagnose this mismatch, we introduce Vividh-ASR, a complexity-stratified benchmark for Hindi and Malayalam across four tiers: studio, broadcast, spontaneous, and synthetic noise. Through a controlled study of learning-rate timing and curriculum ordering, we find that early large parameter updates improve global WER by 12 absolute points, while a hard-to-easy curriculum adds gains for spontaneous speech. These findings motivate reverse multi-stage fine-tuning (R-MFT), a training recipe that enables a parameter-efficient 244M Whisper model to match or exceed conventionally fine-tuned 769M counterparts. Representational analysis via CKA and SVD reveals effective schedules concentrate adaptation in the decoder, preserving the pre-trained encoder's acoustic geometry. We release the benchmark and models.
Primary: Adalat AI, India
All Institutions: Adalat AI, India
The paper presents Vividh-ASR, a complexity-tiered benchmark and a novel training strategy (R-MFT) that significantly enhances the performance of ASR systems for low-resource Indic languages. This work is a valuable contribution to the field, addressing critical challenges in adapting multilingual ASR models while preserving their foundational acoustic capabilities.
The paper introduces a systematic factorial design to dissect the effects of learning rate timing and curriculum ordering on ASR performance. The proposed Reverse Multi-Stage Fine-Tuning (R-MFT) method is well-structured, allowing for a clear understanding of how different training strategies impact model adaptation. The complexity-tiered benchmark, Vividh-ASR, is a significant methodological contribution, providing a structured way to evaluate ASR models across varying levels of acoustic complexity.
The experiments are rigorous, employing a controlled factorial design that isolates key variables affecting performance. The results are clearly presented, demonstrating substantial improvements in WER through the proposed methods. The analysis of internal model representations using CKA and SVD adds depth to the evaluation, linking empirical results to theoretical insights about model adaptation.
The paper provides sufficient details on the implementation, including model architectures, training stages, and hyperparameters, which facilitates reproducibility. However, the lack of a publicly available demo or project URL limits the ease of access to the exact experimental setup.
The study primarily focuses on Hindi and Malayalam, which may limit the generalizability of the findings to other Indic languages or low-resource languages in general. Additionally, while the paper discusses the preservation of the encoder's acoustic geometry, it does not fully explore the implications of this for other model architectures or training paradigms.
The findings have significant implications for improving ASR systems in low-resource languages, potentially enhancing accessibility and usability in diverse linguistic contexts. The introduction of a complexity-tiered benchmark could inspire further research and development in ASR, particularly for languages that have been historically underrepresented in machine learning research. The paper presents Vividh-ASR, a complexity-tiered benchmark and a novel training strategy (R-MFT) that significantly enhances the performance of ASR systems for low-resource Indic languages. This work is a valuable contribution to the field, addressing critical challenges in adapting multilingual ASR models while preserving their foundational acoustic capabilities.
Developing text-driven symbolic music generation models remains challenging due to the scarcity of aligned text-music datasets and the unreliability of automated captioning pipelines. While most efforts have focused on MIDI, sheet music representations are largely underexplored in text-driven generation. We present Text2Score, a two-stage framework comprising a planning stage and an execution stage for generating sheet music from natural language prompts. By deriving supervision signals directly from symbolic XML data, we propose an alternative training paradigm that bypasses noisy or scarce text-music pairs. In the planning stage, an LLM orchestrator translates a natural language prompt into a structured measure-wise plan defining musical attributes such as instruments, key, time signatures, harmony, etc. This plan is then consumed by a generative model in the execution stage to produce interleaved ABC notation conditioned on the plan's structural constraints. To assess output quality, we introduce an evaluation framework covering playability, readability, instrument utilization, structural complexity, and prompt adherence, validated by expert musicians. Text2Score consistently outperforms both a pure LLM-based agentic framework and three end-to-end baselines across objective and subjective dimensions. We open-source the dataset, code, evaluation set and LLM prompts used in this work; a demo is available on our project page (https://keshavbhandari.github.io/portfolio/text2score).
Primary: unknown
All Institutions: unknown
Text2Score presents a novel two-stage framework for generating sheet music from natural language prompts, significantly advancing the state of symbolic music generation. The methodology effectively separates planning and execution, yielding high-quality outputs that outperform existing models, while the comprehensive evaluation framework sets a new standard for assessing music generation systems.
The methodology presented in Text2Score is innovative, utilizing a two-stage framework that separates the planning and execution phases of music generation. This approach allows for more structured reasoning about musical attributes, which is a significant advancement over traditional end-to-end models. The use of an LLM orchestrator to create a structured measure-wise plan is particularly noteworthy, as it addresses issues related to the lack of aligned text-music datasets. The integration of a generative model that processes this plan through a hierarchical decoder further enhances the robustness of the generation process. The detailed definition of the structural plan and the metrics for evaluation are well-articulated, providing a clear framework for assessing the generated outputs.
The experimental evaluation is thorough, employing both objective metrics and subjective assessments from expert musicians. The paper provides a comprehensive suite of evaluation metrics that cover playability, readability, and prompt adherence, which are crucial for assessing the quality of generated sheet music. The results demonstrate that Text2Score outperforms several baseline models, indicating the effectiveness of the proposed framework. However, the paper could benefit from a more detailed discussion of the dataset's diversity and the specific prompts used in evaluations to better contextualize the results.
The paper includes sufficient details regarding the implementation, including the architecture of the models and the training procedures. The use of ModernBERT and a hierarchical decoder is clearly described, and the authors have made their dataset and code available, which supports reproducibility. However, the lack of specific details about the dataset curation process and the exact nature of the prompts used in evaluations could hinder full reproducibility.
One limitation noted is the potential for the LLM-generated inference plan to diverge from training plans, which could lead to discrepancies in output quality. Additionally, while the evaluation metrics are comprehensive, they may not capture all aspects of musical quality, particularly in terms of expressive nuances that could be important for professional compositions. The paper also acknowledges the need for richer annotations to capture finer musical details, which could enhance the model's performance.
The implications of this work are significant for the fields of music generation and artificial intelligence. By providing a framework that can generate high-quality sheet music from textual prompts, Text2Score opens new avenues for composers and musicians, potentially streamlining the creative process. The open-sourcing of the dataset and code encourages further research and development in this area, promoting collaboration and innovation. The integration of LLMs in music generation also highlights the potential for AI to assist in creative fields, which could lead to broader applications in music education and composition. Text2Score presents a novel two-stage framework for generating sheet music from natural language prompts, significantly advancing the state of symbolic music generation. The methodology effectively separates planning and execution, yielding high-quality outputs that outperform existing models, while the comprehensive evaluation framework sets a new standard for assessing music generation systems.
Automatic detection of speaker confidence is critical for adaptive computing but remains constrained by limited labelled data and the subjectivity of paralinguistic annotations. This paper proposes a semi-supervised hybrid framework that fuses deep semantic embeddings from the Whisper encoder with an interpretable acoustic feature vector composed of eGeMAPS descriptors and auxiliary probability estimates of vocal stress and disfluency. To mitigate reliance on scarce ground truth data, we introduce an Uncertainty-Aware Pseudo-Labelling strategy where a model generates labels for unlabelled data, retaining only high-quality samples for training. Experimental results demonstrate that the proposed approach achieves a Macro-F1 score of 0.751, outperforming self-supervised baselines, including WavLM, HuBERT, and Wav2Vec 2.0. The hybrid architecture also surpasses the unimodal Whisper baseline, yielding a 3\% improvement in the minority class, confirming that explicit prosodic and auxiliary features provide necessary corrective signals which are otherwise lost in deep semantic representations. Ablation studies further show that a curated set of high confidence pseudo-labels outperforms indiscriminate large scale augmentation, confirming that data quality outweighs quantity for perceived confidence detection.
Primary: Durham University
All Institutions: Durham University, IEEE Publication Technology Group
The paper presents a novel semi-supervised framework for speech confidence detection that integrates deep learning with interpretable acoustic features, significantly advancing the field of affective computing. The methodology is innovative, addressing key challenges in data scarcity and subjective annotation, while the experimental results demonstrate strong performance and robustness.
The paper introduces a semi-supervised hybrid framework for speech confidence detection that effectively combines deep semantic embeddings from the Whisper model with interpretable acoustic features. The methodology is robust, employing an Uncertainty-Aware Pseudo-Labelling strategy that prioritizes high-quality pseudo-labels over indiscriminate augmentation, which is a significant advancement in dealing with limited labelled data. The integration of eGeMAPS descriptors and auxiliary features for vocal stress and disfluency detection enhances the model's ability to capture nuanced acoustic signals associated with speaker confidence. The late fusion strategy used to combine different modalities is well-justified and effectively addresses the limitations of relying solely on deep semantic representations.
The experimental evaluation is thorough, employing a well-structured 5-fold cross-validation approach to ensure the reliability of results. The paper reports a Macro-F1 score of 0.751, which is a notable improvement over various self-supervised baselines, demonstrating the effectiveness of the proposed method. The ablation studies provide insights into the importance of data quality and the contributions of different components of the model, reinforcing the claims made about the hybrid architecture's advantages. However, the paper could benefit from additional qualitative evaluations or user studies to further validate the model's performance in real-world scenarios.
The paper provides a detailed description of the methodology, including dataset creation, feature extraction, and model architecture, which aids in reproducibility. However, the absence of publicly available datasets and code repositories limits the ability of other researchers to replicate the study fully. The authors should consider releasing their code and datasets to enhance transparency and facilitate further research.
The primary limitation of the study is the reliance on a relatively small ground truth dataset (N=600), which may affect the generalizability of the findings. Additionally, the model's performance may be constrained by the subjective nature of confidence detection and the cultural variability in expressing confidence. The focus on short audio clips also limits the model's ability to capture context and fluctuations in confidence over longer interactions.
The proposed framework has significant implications for applications in affective computing, particularly in enhancing human-computer interaction and adaptive learning environments. By enabling machines to detect speaker confidence, the framework could improve the responsiveness of virtual assistants and educational tools, fostering more engaging and supportive user experiences. Furthermore, the insights gained from this research could contribute to mental health monitoring and interventions by identifying low confidence as a precursor to anxiety. The paper presents a novel semi-supervised framework for speech confidence detection that integrates deep learning with interpretable acoustic features, significantly advancing the field of affective computing. The methodology is innovative, addressing key challenges in data scarcity and subjective annotation, while the experimental results demonstrate strong performance and robustness.
Despite advances in text and visual generation, creating coherent long-form audio narratives remains challenging. Existing frameworks often exhibit limitations such as mismatched character settings with voice performance, insufficient self-correction mechanisms, and limited human interactivity. To address these challenges, we propose AuDirector, a self-reflective closed-loop multi-agent framework. Specifically, it involves an Identity-Aware Pre-production mechanism that transforms narrative texts into character profiles and utterance-level emotional instructions to retrieve suitable voice candidates and guide expressive speech synthesis, thereby promoting context-aligned voice adaptation. To enhance quality, a Collaborative Synthesis and Correction module introduces a closed-loop self-correction mechanism to systematically audit and regenerate defective audio components. Furthermore, a Human-Guided Interactive Refinement module facilitates user control by interpreting natural language feedback to interactively refine the underlying scripts. Experiments demonstrate that AuDirector achieves superior performance compared to state-of-the-art baselines in structural coherence, emotional expressiveness, and acoustic fidelity. Audio samples can be found at https://anonymous-itsh.github.io/.
Primary: Shanghai Artificial Intelligence Laboratory
All Institutions: Shanghai Artificial Intelligence Laboratory, Tsinghua University
AuDirector presents a significant advancement in immersive audio storytelling through a self-reflective closed-loop multi-agent framework. Its innovative approach to integrating character profiling, emotional instruction, and user interaction sets a new standard in the field of audio generation, addressing key limitations of existing systems while demonstrating substantial technical impact and potential for broad applications.
The methodology of AuDirector is innovative, integrating a multi-agent framework that combines identity-aware pre-production, collaborative synthesis and correction, and human-guided interactive refinement. Each component is well-defined, with a clear flow from narrative input to audio output. The use of a closed-loop self-correction mechanism is particularly noteworthy, as it addresses common pitfalls in generative audio systems, such as quality inconsistency and lack of user control. The framework's reliance on large language models (LLMs) for character profiling and emotional instruction generation is a strong point, allowing for nuanced audio generation that aligns closely with narrative context.
The experiments conducted are robust, comparing AuDirector against state-of-the-art baselines like WavJourney and PodAgent across both objective and subjective metrics. The use of diverse datasets, including podcasts and radio dramas, enhances the evaluation's credibility. The results indicate significant improvements in structural coherence, emotional expressiveness, and acoustic fidelity, validating the proposed framework's effectiveness. The ablation study further strengthens the findings by demonstrating the importance of the self-correction mechanism.
The paper provides a comprehensive overview of the implementation details, including the specific agents used and the evaluation protocols. However, the lack of a publicly accessible code repository limits reproducibility. Future work could benefit from releasing the code and detailed instructions to facilitate further research and validation by the community.
While AuDirector shows promise, it still faces challenges in generating non-speech audio tracks, particularly regarding acoustic diversity and nuance. This limitation could impact the overall immersion of the audio narratives. Additionally, the complexity of the system may pose challenges for users unfamiliar with audio production, potentially limiting its accessibility.
The potential applications of AuDirector are vast, ranging from enhancing audio storytelling in entertainment to educational tools that require immersive audio experiences. The framework could significantly impact industries such as gaming, film, and interactive media, where high-quality audio narratives are essential. Furthermore, the integration of user feedback into the generation process could lead to more personalized and engaging content. AuDirector presents a significant advancement in immersive audio storytelling through a self-reflective closed-loop multi-agent framework. Its innovative approach to integrating character profiling, emotional instruction, and user interaction sets a new standard in the field of audio generation, addressing key limitations of existing systems while demonstrating substantial technical impact and potential for broad applications.
Omni-modal language models are intended to jointly understand audio, visual inputs, and language, but benchmark gains can be inflated when visual evidence alone is enough to answer a query. We study whether current omni-modal benchmarks separate visual shortcuts from genuine audio-visual-language evidence integration, and how post-training behaves under a visually debiased evaluation setting. We audit nine omni-modal benchmarks with visual-only probing, remove visually solvable queries, and retain full subsets when filtering is undefined or would make comparisons unstable. This yields OmniClean, a cleaned evaluation view with 8,551 retained queries from 16,968 audited queries. On OmniClean, we evaluate OmniBoost, a three-stage post-training recipe based on Qwen2.5-Omni-3B: mixed bi-modal SFT, mixed-modality RLVR, and SFT on self-distilled data. Balanced bi-modal SFT gives limited and uneven gains, RLVR provides the first broad improvement, and self-distillation reshapes the benchmark profile. After SFT on self-distilled data, the 3B model reaches performance comparable to, and in aggregate slightly above, Qwen3-Omni-30B-A3B-Instruct without using a stronger omni-modal teacher. These results show that omni-modal progress is easier to interpret when evaluation controls visual leakage, and that small omni-modal models can benefit from staged post-training with self-distilled omni-query supervision.
Primary: StepFun
All Institutions: StepFun, Imperial College London, Peking University, Shanghai Jiao Tong University, The University of New South Wales
The main contribution of this paper is the introduction of OmniClean, a visually debiased evaluation framework for omni-modal language models, and the demonstration of its effectiveness through a comprehensive staged post-training approach. This work significantly enhances the interpretability of model performance and sets a new standard for evaluating multi-modal capabilities in machine learning.
The paper introduces a novel evaluation framework, OmniClean, which effectively filters out visually solvable queries to assess the true omni-modal capabilities of language models. This methodology is significant as it addresses the prevalent issue of visual shortcuts in existing benchmarks, providing a more accurate measure of model performance across audio-visual-language tasks. The staged post-training approach, OmniBoost, combines mixed bi-modal supervised fine-tuning, mixed-modality reinforcement learning, and self-distillation, showcasing a comprehensive strategy for enhancing model performance.
The experiments are well-structured, utilizing a large dataset of 16,968 queries, from which 8,551 were retained after visual-only probing. The results demonstrate that the proposed methods yield meaningful improvements in model performance, particularly in the context of the cleaned evaluation view. The comparison against existing models, including larger counterparts, highlights the effectiveness of the proposed training stages.
The paper provides detailed descriptions of the experimental setup, including data sources, training protocols, and evaluation metrics. However, the lack of access to the actual model weights and code may hinder full reproducibility. The authors do release the OmniClean dataset, which aids in replicating the evaluation process.
One limitation is the reliance on the specific model architecture (Qwen2.5-Omni-3B) for the experiments, which may not generalize to other architectures or larger models. Additionally, while the visually debiased evaluation is a significant improvement, it does not eliminate all forms of bias in the benchmarks. The paper also acknowledges that the self-distillation results are profile-dependent, indicating variability in effectiveness across different datasets.
The findings have substantial implications for the development of omni-modal language models, as they provide a clearer understanding of model capabilities and limitations. By addressing visual leakage in evaluations, the work encourages the design of more robust benchmarks that can better assess true multi-modal integration. This could lead to advancements in applications requiring comprehensive understanding across audio, visual, and textual modalities. The main contribution of this paper is the introduction of OmniClean, a visually debiased evaluation framework for omni-modal language models, and the demonstration of its effectiveness through a comprehensive staged post-training approach. This work significantly enhances the interpretability of model performance and sets a new standard for evaluating multi-modal capabilities in machine learning.
We propose the Chunkwise Aligner, a novel architecture for streaming automatic speech recognition (ASR). While the Transducer is the standard model for streaming ASR, its training is costly due to the need to compute all possible audio-label alignments. The recently introduced Aligner reduces this cost by discarding explicit alignments, but this modification makes it unsuitable for streaming. Our approach overcomes this limitation by dividing the audio into chunks and aligning each label to the leftmost frames of its chunk, whereas transitions between chunks are managed by a learned end-of-chunk probability. Experiments show that the Chunkwise Aligner not only matches the Transducer's accuracy in both offline and streaming scenarios, but also offers superior training and decoding efficiencies.
Primary: University of Electro-Communications
All Institutions: University of Electro-Communications, NTT, Inc.
The main contribution of this paper is the introduction of the Chunkwise Aligner, which enhances streaming ASR by effectively managing audio segmentation and alignment, resulting in improved efficiency and comparable accuracy to existing models. This work represents a meaningful advancement in the field of speech recognition, particularly for applications requiring real-time processing.
The proposed Chunkwise Aligner introduces a novel architecture that effectively addresses the limitations of existing models in streaming automatic speech recognition (ASR). By segmenting audio into chunks and utilizing a learned end-of-chunk probability for transitions, the methodology enhances both training efficiency and decoding speed while maintaining accuracy comparable to the Transducer model. The architecture's reliance on self-transduction within chunks is innovative and provides a clear advantage in streaming scenarios, showcasing a thoughtful adaptation of existing techniques.
The experiments conducted on well-known datasets, LibriSpeech and CSJ, provide a robust evaluation of the Chunkwise Aligner's performance. The reported results demonstrate that the model achieves competitive word error rates (WER) and character error rates (CER) while significantly improving decoding speed over the Transducer. The inclusion of various configurations and the analysis of alignment strategies add depth to the evaluation, supporting the claims made regarding the model's efficiency and effectiveness.
The paper outlines the system configuration, including model architecture, training parameters, and data preprocessing steps, which contributes to reproducibility. However, the lack of publicly available code or a project URL limits the ability for independent verification of results. Providing access to the implementation would enhance the reproducibility of the findings.
One notable limitation is the dependency on forced alignments for training, which may affect performance when the timing of ground truth labels does not align with the model's predictions. Additionally, the paper acknowledges potential degradation in performance with varying alignment strategies, indicating that further exploration is needed to improve robustness in diverse scenarios.
The Chunkwise Aligner has significant implications for real-time speech recognition applications, particularly in environments where low latency is critical. Its ability to maintain accuracy while reducing computational costs makes it suitable for deployment in various devices and applications, potentially enhancing user experiences in voice-activated systems and automated transcription services. The main contribution of this paper is the introduction of the Chunkwise Aligner, which enhances streaming ASR by effectively managing audio segmentation and alignment, resulting in improved efficiency and comparable accuracy to existing models. This work represents a meaningful advancement in the field of speech recognition, particularly for applications requiring real-time processing.
Singing Voice Conversion (SVC) aims to transform a source singing voice into a target singer while preserving lyrics and melody. Most existing SVC methods depend on F0 extractors to capture the lead melody from clean vocals. However, no existing method can reliably extract clean vocals from accompanied recordings without leaving residual harmonies behind. In this paper, we innovatively propose Poly-SVC, a zero-shot, cross-lingual singing voice conversion system designed to process residual harmonies. Poly-SVC is composed of three key components: a Constant-Q Transform (CQT)-based pitch extractor to preserve both the lead melody and residual harmony, a random sampler to reduce interference information from the CQT and a diffusion decoder based on Conditional Flow Matching (CFM) that fuses pitch, content, and timbre features into natural-sounding polyphonic outputs. Experiments demonstrate that Poly-SVC surpasses the baseline models in naturalness, timbre similarity and harmony reconstruction across both harmony-rich and single-melody recordings.
Primary: Beijing University of Civil Engineering and Architecture
All Institutions: Beijing University of Civil Engineering and Architecture, Lyra Lab, Tencent Music Entertainment, Beijing Key Laboratory of Super Intelligent Technology for Urban Architecture
The main contribution of this paper is the introduction of Poly-SVC, a novel singing voice conversion system that effectively handles residual harmonies in polyphonic scenarios, significantly improving the quality of voice conversion outputs. The technical contributions, particularly in pitch extraction and model architecture, represent a meaningful advancement in the field of audio processing and machine learning.
The proposed Poly-SVC framework introduces a novel approach to singing voice conversion that addresses the challenges posed by residual harmonies in accompanied recordings. The use of a Constant-Q Transform (CQT) for pitch extraction is innovative, as it allows for the preservation of both lead melodies and residual harmonies, which is crucial for high-fidelity audio synthesis. The integration of a random sampler to mitigate interference and a Conditional Flow Matching (CFM)-based diffusion decoder further enhances the model's robustness. The methodology is well-structured, with clear delineation of components and their functions, although it could benefit from more detailed explanations of the CFM loss and its implications.
The experiments are comprehensive, utilizing a diverse set of datasets that include both speech and singing data across multiple languages. The subjective evaluation framework, employing Mean Opinion Score (MOS) and Similarity-MOS (SIM-MOS), is appropriate for assessing the quality of voice conversion in both single-melody and harmony-rich scenarios. The results demonstrate that Poly-SVC significantly outperforms existing baselines, particularly in preserving harmonic structures, which is a key contribution to the field. However, the paper could enhance its experimental rigor by including objective metrics alongside subjective evaluations.
The paper provides a reasonable level of detail regarding the implementation, including the choice of models and datasets. However, it lacks specific URLs or repositories for code and data, which are essential for reproducibility. Clearer descriptions of hyperparameters and training procedures would also aid in replicating the results.
One limitation is the reliance on subjective evaluations, which can be influenced by individual preferences and biases. Additionally, the model's performance in extremely complex polyphonic scenarios remains to be fully explored. The paper also acknowledges the challenge of content overlapping, which is not fully addressed in the current framework.
The implications of this research extend to various applications in music production, entertainment, and accessibility. By improving the quality of singing voice conversion, Poly-SVC could facilitate personalized music experiences, enhance karaoke applications, and support language learning through singing. The approach may also inspire further research into polyphonic audio processing and machine learning applications in music. The main contribution of this paper is the introduction of Poly-SVC, a novel singing voice conversion system that effectively handles residual harmonies in polyphonic scenarios, significantly improving the quality of voice conversion outputs. The technical contributions, particularly in pitch extraction and model architecture, represent a meaningful advancement in the field of audio processing and machine learning.
Over the past two decades, the task of musical beat tracking has transitioned from heuristic onset detection algorithms to highly capable deep neural networks (DNN). Although DNN-based beat tracking models achieve near-perfect performance on mainstream, percussive datasets, the SMC dataset has stubbornly yielded low F-measure scores. By testing how well state-of-the-art models detect beats on individual tracks in the SMC dataset, we identify three distinct failure modes: octave errors, continuity errors, and complete tracking failure where all metrics fall below 0.3. We reveal that state-of-the-art models tend to generate "confident-but-wrong" activations. Furthermore, we show that the standard DBN's default minimum tempo of 55 BPM prevents it from inferring the correct tempo for 21\% of SMC tracks, forcing double-tempo predictions on slow music. By exposing such fundamental oversights, we provide concrete directions for improving beat and downbeat detection, specifically emphasizing training data diversification and multi-hypothesis tempo estimation.
Primary: University
All Institutions: Company, Department of Computer Science, International Laboratories, University
This paper provides a critical analysis of beat tracking failures in state-of-the-art models, identifying specific weaknesses and proposing directions for future research. The combination of detailed diagnostics and practical recommendations positions this work as a valuable contribution to the field of music information retrieval.
The paper presents a thorough diagnostic analysis of beat tracking models on the SMC dataset, identifying specific failure modes that have not been previously documented. The methodology involves a systematic evaluation of three state-of-the-art models, categorizing the difficulties of the dataset into four axes and analyzing the activation functions to pinpoint the causes of errors. This approach is innovative as it combines qualitative analysis with quantitative metrics, providing a nuanced understanding of model performance.
The experiments are well-structured, employing an 8-fold cross-validation setup to ensure robust evaluation. The use of various metrics, including F-measure and continuity metrics, allows for a comprehensive assessment of model performance. The results reveal significant insights into the limitations of current models, particularly in handling tempo instability and metrical ambiguity, which are critical for advancing the field.
While the paper outlines the experimental setup and evaluation metrics, it lacks detailed information on the implementation of the models and the specific configurations used. This could hinder reproducibility for other researchers attempting to replicate the results or build upon the findings.
One limitation of the study is the reliance on the SMC dataset, which, while challenging, may not fully represent the diversity of musical styles encountered in real-world applications. Additionally, the findings suggest a need for more diverse training data to overcome the identified activation ceiling, indicating that current models may not generalize well to other datasets.
The insights gained from this research have the potential to significantly influence future work in music information retrieval, particularly in improving beat tracking algorithms. By addressing the identified failure modes and suggesting enhancements to model training and architecture, this work could lead to more robust systems that better handle complex musical structures. This paper provides a critical analysis of beat tracking failures in state-of-the-art models, identifying specific weaknesses and proposing directions for future research. The combination of detailed diagnostics and practical recommendations positions this work as a valuable contribution to the field of music information retrieval.
We present STRUM (Spectral Transcription and Rhythm Understanding Model), an audio-to-chart pipeline that converts raw recordings into playable Clone Hero / YARG charts for drums, guitar, bass, vocals, and keys without any oracle metadata. STRUM is a multi-stage hybrid: a two-stage CRNN onset detector and a six-model ensemble classifier for drums; neural onset detectors with monophonic pitch tracking for guitar and bass; word-aligned ASR for vocals; and spectral keyboard detection for keys. We evaluate on a 30-song in-envelope benchmark constructed by screening candidate songs on a single audio-quality criterion -- the median 1-second drum-stem RMS after htdemucs_6s source separation. On this benchmark STRUM achieves drums onset F1 = 0.838, bass F1 = 0.694, guitar F1 = 0.651, and vocals F1 = 0.539 at a +/- 100 ms tolerance with per-song global offset search. We report a complete ablation of seven drum-pipeline components with paired per-song Wilcoxon tests, an analysis of ground-truth-to-audio timing distributions in community Clone Hero charts, and a per-class confusion matrix for the drum classifier. Code, model weights, and the full benchmark manifest are released.
Primary: Independent Researcher
All Institutions: Independent Researcher
STRUM presents a comprehensive audio-to-chart pipeline for rhythm games, addressing a critical bottleneck in chart creation. The paper's technical contributions, including a detailed methodology and thorough evaluation, position it as a meaningful advancement in the field of automatic music transcription.
The methodology employed in STRUM is robust and multifaceted, utilizing a combination of CRNNs, ensemble classifiers, and various detection techniques tailored to different instruments. The use of a two-stage CRNN for drums, alongside a comprehensive pipeline for other instruments, reflects a thoughtful approach to the complexities of audio transcription. The ablation studies provide valuable insights into the contributions of individual components, enhancing the credibility of the results. However, the reliance on pre-processed audio stems and the lack of a fully end-to-end model may limit the generalizability of the findings.
The experimental evaluation is thorough, featuring a well-defined benchmark of 30 songs and clear metrics for performance assessment. The F1 scores reported for different instruments indicate a solid performance, particularly for drums, which is the focus of the paper. The use of a specific audio-quality criterion for song selection is a commendable approach that adds rigor to the evaluation process. However, the limited sample size and the specific genre focus may restrict the applicability of the results across a broader range of music.
The authors have made significant efforts to ensure reproducibility by releasing code, model weights, and a benchmark manifest. This transparency is crucial for the research community, allowing others to validate and build upon the work. The detailed descriptions of the methodologies and the evaluation protocols further enhance the reproducibility of the study.
The paper acknowledges several limitations, including the rejection of songs based on audio quality, which may exclude potentially valuable data. The vocal transcription performance is notably lower than that of the instruments, indicating a misalignment between detected onsets and community charting practices. Additionally, the blue lane's low accuracy highlights challenges in accurately distinguishing between similar sounds in the context of rhythm games.
STRUM has the potential to significantly impact the rhythm game community by streamlining the chart creation process, making it more accessible for newcomers while providing experienced charters with a solid starting point. The open-source nature of the project encourages collaboration and further development, which could lead to advancements in automatic music transcription and related fields. STRUM presents a comprehensive audio-to-chart pipeline for rhythm games, addressing a critical bottleneck in chart creation. The paper's technical contributions, including a detailed methodology and thorough evaluation, position it as a meaningful advancement in the field of automatic music transcription.
Neural speech codecs provide discrete representations for speech language models, but emotional cues are often degraded during quantization. Existing codecs mainly optimize acoustic reconstruction, leaving emotion expressiveness insufficiently modeled at the representation level. We propose an emotion-guided neural speech codec that explicitly preserves emotional information while maintaining semantic fidelity and prosodic naturalness. Our framework combines emotion-semantic guided latent modulation, relation-preserving emotional-semantic distillation, and emotion-weighted semantic alignment to retain emotionally salient cues under compression. Extensive evaluations across speech reconstruction, emotion recognition, and downstream text-to-speech generation demonstrate improved emotion consistency and perceptual quality without sacrificing content accuracy.
Primary: College of William & Mary
All Institutions: College of William & Mary, Emory University, George Mason University
This paper introduces AffectCodec, a neural speech codec that prioritizes emotion preservation in speech representations, significantly enhancing the emotional expressiveness of synthesized speech while maintaining high levels of semantic fidelity and prosodic naturalness. The innovative integration of emotion-aware optimization strategies marks a substantial advancement in the field of audio processing and speech synthesis.
The paper presents a novel approach to neural speech coding by explicitly integrating emotion preservation into the codec's optimization objectives. The methodology is well-structured, comprising three key components: Emotion-Semantic Guided Latent Modulation, Relation-Preserving Emotional-Semantic Distillation, and Emotion-Weighted Semantic Alignment. This multi-faceted approach is innovative as it addresses a significant gap in existing codecs that typically overlook emotional expressiveness during quantization. The use of frozen emotion and semantic models to guide the codec's learning process is particularly noteworthy, enhancing the emotional fidelity of the generated speech.
The experiments are extensive and rigorously designed, evaluating the proposed codec across multiple benchmarks, including EMO-SUPERB and LibriSpeech, for both speech emotion recognition and text-to-speech synthesis. The results demonstrate significant improvements in emotional consistency, perceptual quality, and content preservation compared to baseline models. The use of both objective metrics (like WER and PESQ) and subjective evaluations (like UTMOS and MUSHRA) provides a comprehensive assessment of the codec's performance.
The paper provides detailed implementation information, including training procedures, model architectures, and evaluation metrics. However, while the methodology is well-documented, the actual code repository or supplementary materials for reproduction are not provided, which could hinder reproducibility for other researchers.
The paper acknowledges the computational efficiency of the codec but suggests that future work could explore lighter-weight architectures. Additionally, while the emotional preservation is significantly improved, the paper does not address potential challenges in real-world applications, such as variability in emotional expression across different speakers or contexts.
The proposed codec has the potential to enhance applications in conversational AI, voice synthesis, and emotional computing, where preserving emotional nuances is critical. By improving the fidelity of emotional expression in synthesized speech, this work could lead to more engaging and human-like interactions in voice-based systems. This paper introduces AffectCodec, a neural speech codec that prioritizes emotion preservation in speech representations, significantly enhancing the emotional expressiveness of synthesized speech while maintaining high levels of semantic fidelity and prosodic naturalness. The innovative integration of emotion-aware optimization strategies marks a substantial advancement in the field of audio processing and speech synthesis.
Generating realistic drum audio directly from symbolic representations is a challenging task at the intersection of music perception and machine learning. We propose a system that transforms an expressive drum grid, a time-aligned MIDI representation with microtiming and velocity information, into drum audio by predicting discrete codes of a neural audio codec. Our approach uses a Transformer-based model to map the drum grid input to a sequence of codec tokens, which are then converted to waveform audio via a pre-trained codec decoder. We experiment with multiple state-of-the-art neural codecs, namely EnCodec, DAC, and X-Codec, to assess how the choice of audio representation impacts the quality of the generated drums. The system is trained and evaluated on the Expanded Groove MIDI Dataset, E-GMD, a large collection of human drum performances with paired MIDI and audio. We evaluate the fidelity and musical alignment of the generated audio using objective metrics. Overall, our results establish codec-token prediction as an effective route for drum grid-to-audio generation and provide practical insights into selecting audio tokenizers for percussive synthesis.
Primary: Hellenic Mediterranean University
All Institutions: Hellenic Mediterranean University, Athena RC
The paper presents a novel approach to drum synthesis that effectively combines symbolic MIDI representations with neural audio codecs. This work contributes to the field by providing a structured methodology for generating high-quality drum audio, with implications for both music technology and machine learning research.
The paper introduces a novel approach to drum synthesis by leveraging a Transformer-based model to predict neural audio codec tokens from expressive drum grids. This method is innovative as it employs a two-stage process where the first stage involves mapping MIDI-derived grids to token sequences, and the second stage decodes these tokens into audio waveforms. The use of state-of-the-art neural codecs (EnCodec, DAC, and X-Codec) for this task is particularly noteworthy, as it allows for a controlled comparison of different audio representations and their impact on synthesis quality. The methodology is well-structured, with clear definitions of the expressive drum grid representation and the training process.
The experiments are comprehensive, utilizing the Expanded Groove MIDI Dataset (E-GMD), which is a substantial dataset with aligned MIDI and audio data. The evaluation metrics are robust, including both token-level and audio-level assessments, which provide a multifaceted view of the model's performance. The results demonstrate that EnCodec consistently outperforms the other codecs in terms of audio quality and token accuracy, indicating the effectiveness of the proposed approach. However, the paper could benefit from additional qualitative evaluations, such as user studies, to complement the objective metrics.
The paper provides detailed information about the methodology, including the architecture of the Transformer model, training procedures, and evaluation metrics. However, the absence of a direct implementation or demo could hinder reproducibility for researchers who wish to replicate the results. The project URL does provide access to the code, which is a positive aspect for reproducibility.
One limitation is the lack of subjective evaluations, such as listening tests, which are crucial for assessing the perceptual quality of the generated audio. Additionally, the paper notes that increasing model capacity can lead to instability, which is a significant concern for future work. The reliance on objective metrics alone may not fully capture the nuances of audio quality and musicality.
The proposed system has the potential to significantly impact music production by automating the generation of realistic drum audio from MIDI inputs, thereby facilitating creative processes for musicians and producers. The insights gained from codec comparisons could also inform future developments in audio synthesis and machine learning applications in music. The paper presents a novel approach to drum synthesis that effectively combines symbolic MIDI representations with neural audio codecs. This work contributes to the field by providing a structured methodology for generating high-quality drum audio, with implications for both music technology and machine learning research.
Neural audio codecs provide compact discrete representations for speech generation and manipulation. However, most codecs organize tokens as frame-level sequences, making it difficult to study or intervene on global factors of variation. In this work, we propose the Latent Audio Tokenizer for Token-space Editing (LATTE) that appends a fixed set of learnable latent tokens to the audio feature sequence and retains only these tokens for quantization and decoding. This design produces a compact, non-temporally aligned bottleneck in which each token can aggregate global information across the full utterance. We show that the resulting tokenizer preserves competitive reconstruction quality in low-bitrate speech coding settings while enabling simple token-space interventions. In particular, we find that swapping selected latent token positions between utterances can modify global attributes, such as speaker identity and background noise, and we evaluate these interventions on voice conversion and denoising tasks. Our results suggest that compact latent audio tokenizers can support controllable audio manipulation without supervision in task-specific editing models.
Primary: Mila -- Québec AI Institute
All Institutions: Mila -- Québec AI Institute, Université Laval, Concordia University
The paper presents LATTE, a novel latent audio tokenizer that enables controllable audio manipulation through a compact representation of speech. This work significantly advances the field of audio processing by introducing a methodology that balances efficient representation with the ability to manipulate global audio attributes, thus opening new avenues for research and application in speech technology.
The paper introduces LATTE, a novel latent audio tokenizer that replaces traditional frame-level tokenization with a compact set of learnable latent tokens. This approach allows for the aggregation of global information across entire utterances, facilitating targeted interventions in audio attributes such as speaker identity and background noise. The methodology is well-structured, leveraging existing frameworks like FocalCodec while innovatively adapting them to create a more interpretable and controllable representation of audio data. The use of slot importance scoring to analyze and manipulate latent tokens is a significant methodological advancement.
The experiments are comprehensive, evaluating LATTE against several benchmarks (LibriSpeech, VoiceBank, Libri1Mix) for resynthesis quality, and demonstrating competitive performance relative to existing models. The analysis of slot importance across various factors provides strong evidence for the model's effectiveness in manipulating audio attributes. The results are quantitatively supported by multiple metrics (UTMOS, DNSMOS, dWER, speaker similarity), showcasing the model's robustness in different scenarios.
The paper provides detailed descriptions of the architecture, training procedures, and evaluation metrics, which enhances reproducibility. However, the reliance on specific pretrained models and datasets may introduce variability in results if replicated with different resources. The authors could improve reproducibility by providing access to their codebase and trained models.
The primary limitation noted is the scale and diversity of the training data, which may restrict the model's generalizability and robustness across varied audio conditions. The authors acknowledge that the current dataset is smaller compared to competitors, which could impact the specialization of latent slots for different audio factors.
The implications of this research are significant, as controllable audio tokenization can enhance applications in speech restoration, privacy-preserving analysis, and generative audio modeling. However, the potential for misuse in voice conversion and identity manipulation raises ethical considerations that must be addressed. The authors emphasize the need for responsible deployment and usage guidelines. The paper presents LATTE, a novel latent audio tokenizer that enables controllable audio manipulation through a compact representation of speech. This work significantly advances the field of audio processing by introducing a methodology that balances efficient representation with the ability to manipulate global audio attributes, thus opening new avenues for research and application in speech technology.
As video becomes increasingly central to information dissemination and multimodal large language models (MLLMs) continue to advance, evaluating video retrieval has become increasingly important. In realistic search scenarios, this requires matching short user queries to long-form content using both visual and auditory evidence. Yet existing retrieval benchmarks are still dominated by short clips, single modalities, and caption-based evaluation. We introduce FLARE, a full-modality long-video audiovisual retrieval benchmark with user-simulated queries. Built from 399 carefully screened Video-MME videos (10--60 min, 225.4 h) to ensure source quality and diversity, FLARE contains 87,697 clips annotated with vision, audio, and unified audiovisual captions, together with 274,933 user-style queries. Cross-modal queries are further filtered by a hard bimodal constraint, requiring retrieval to fail under either modality alone but succeed when both are combined. FLARE evaluates models under two regimes, caption-based and query-based retrieval, across vision, audio, and unified audiovisual settings. Experiments with 15 representative retrievers show that user-style queries substantially change model behavior, strong caption-based performance does not always transfer to query-based retrieval, and audio--language alignment remains a key bottleneck for unified audiovisual retrieval. Our code and data are released at https://flarebench.github.io/
Primary: University of Science and Technology Beijing
All Institutions: University of Science and Technology Beijing, Peking University, Institute of Automation, Chinese Academy of Sciences, Zhongguancun Academy
The paper presents FLARE, a pioneering benchmark for long-video audiovisual retrieval that integrates user-simulated queries and a rigorous bimodal constraint. This work is significant as it addresses critical gaps in existing benchmarks and offers valuable insights into the performance of multimodal retrieval systems, ultimately pushing the boundaries of research in this area.
The paper introduces FLARE, a comprehensive benchmark for long-video audiovisual retrieval that incorporates user-simulated queries and a hard bimodal constraint. The methodology is robust, combining automated processes with human review to ensure high-quality data generation and annotation. The segmentation of videos into coherent clips based on both visual and auditory cues is particularly noteworthy, as is the dual-regime evaluation protocol that isolates the impact of query formulation on model performance.
The experiments are well-structured, evaluating 15 representative models under both caption-based and query-based regimes. The findings reveal significant differences in model performance based on the type of query used, highlighting the limitations of existing models in handling user-style queries. The results are comprehensive, covering various modalities and retrieval directions, and they effectively demonstrate the benchmark's utility in exposing the strengths and weaknesses of current retrieval systems.
The paper provides sufficient details on the benchmark construction, evaluation protocols, and model configurations to allow for reproducibility. The authors have also released their code and data, which enhances the reproducibility of their findings.
The paper acknowledges several limitations, including the potential biases introduced by the automated annotation process and the reliance on a specific set of video sources. The user-simulated queries, while innovative, may not fully capture the diversity of real-world user queries. Additionally, the benchmark may not cover all relevant domains or languages.
The FLARE benchmark has the potential to significantly advance the field of audiovisual retrieval by providing a more realistic evaluation framework that aligns with user behavior. However, the authors also caution against potential misuse of advanced retrieval systems for invasive surveillance or biased content discovery. The paper presents FLARE, a pioneering benchmark for long-video audiovisual retrieval that integrates user-simulated queries and a rigorous bimodal constraint. This work is significant as it addresses critical gaps in existing benchmarks and offers valuable insights into the performance of multimodal retrieval systems, ultimately pushing the boundaries of research in this area.
The advancement of diffusion-based text-to-music generation has opened new avenues for zero-shot music editing. However, existing methods fail to achieve stem-specific timbre transfer, which requires altering specific stems while strictly preserving the background accompaniment. This limitation severely hinders practical application, since real-world production necessitates precise manipulation of components within dense mixtures. Our key finding is that, while vanilla cross-attention captures semantic features of stems, it lacks the spectral resolution to strictly localize targets in dense mixtures, leading to boundary leakage. To resolve this dilemma, we propose Polyphonia, a zero-shot editing framework with Acoustic-Informed Attention Calibration. Rather than relying solely on diffuse semantic attention, Polyphonia leverages a probabilistic acoustic prior to establish coarse boundaries, enabling non-target stems preserved precise semantic synthesis. For evaluation, we propose PolyEvalPrompts, a standardized prompt set with 1,170 timbre transfer tasks in polyphonic music. Specifically, Polyphonia achieves an increase of 15.5% in target alignment compared to baselines, while maintaining competitive music fidelity and non-target integrity.
Primary: South China University of Technology
All Institutions: South China University of Technology
The main contribution of this work is the introduction of Polyphonia, a framework that advances controllable music generation by enabling precise intra-stem editing in polyphonic music. This paper significantly enhances the state-of-the-art in audio processing by providing a novel methodology that effectively tackles the challenges of timbre transfer while maintaining the integrity of complex audio mixtures.
The methodology presented in Polyphonia is innovative, addressing the critical challenge of stem-specific timbre transfer in polyphonic music through a novel framework that combines Acoustic-Informed Attention Calibration with a probabilistic acoustic prior. This dual-path mechanism effectively resolves semantic-acoustic misalignment, allowing for precise editing while maintaining the integrity of non-target stems. The use of Ideal Ratio Mask (IRM) as an acoustic prior is particularly noteworthy, as it enhances the model's ability to discern and manipulate specific audio components within complex mixtures. The paper also introduces a comprehensive evaluation framework, PolyEvalPrompts, which is essential for assessing the performance of the proposed method across a wide range of timbre transfer tasks.
The experiments conducted are robust, utilizing well-established datasets (MUSDB18-HQ and MusicDelta) and a diverse set of state-of-the-art baselines for comparison. The results demonstrate significant improvements in target alignment and non-target integrity, with quantitative metrics such as CLAP, LPAPS, and CQT1-PCC providing a thorough assessment of the model's performance. The inclusion of subjective evaluations through Mean Opinion Scores (MOS) adds depth to the experimental analysis, confirming the effectiveness of the proposed method in real-world applications.
The paper provides detailed implementation information, including the architecture used (AudioLDM 2) and the specifics of the acoustic prior extraction process. However, the absence of a public code repository limits full reproducibility. While the methodology is well-documented, access to the actual implementation would enhance the ability of other researchers to replicate the results.
One limitation noted is the reliance on pre-trained Blind Source Separation (BSS) models, which may introduce biases based on the training data used, potentially affecting the generalizability of the approach to non-Western music or less common instruments. Additionally, the method's performance may degrade in extremely dense mixtures, although it shows graceful degradation rather than catastrophic failure.
The potential applications of Polyphonia are significant, particularly in the realm of music production and editing, where precise control over individual audio stems is crucial. However, ethical considerations regarding intellectual property and the potential for misuse in unauthorized remixes must be addressed. The authors advocate for the development of provenance tracking and audio watermarking technologies to safeguard artistic integrity. The main contribution of this work is the introduction of Polyphonia, a framework that advances controllable music generation by enabling precise intra-stem editing in polyphonic music. This paper significantly enhances the state-of-the-art in audio processing by providing a novel methodology that effectively tackles the challenges of timbre transfer while maintaining the integrity of complex audio mixtures.
Reconstructing a 3D sound field from sparse microphone measurements is a fundamental yet ill-posed problem, which we address through Acoustic Transfer Function (ATF) magnitude estimation. ATF magnitude encapsulates key perceptual and acoustic properties of a physical space with applications in room characterization and correction. Although recent generative paradigms such as Flow Matching (FM) have achieved state-of-the-art performance in speech and music generation, their potential in spatial audio remains underexplored. We propose a novel framework for 3D ATF magnitude reconstruction as a guided generation task, with a 3D U-Net conditioned by a permutation-invariant set encoder. This architecture enables reconstruction from an arbitrary number of sparse inputs while leveraging the stable and efficient training properties of FM. Experimental results demonstrate that SF-Flow achieves accurate reconstruction up to \SI{1}{kHz}, trains substantially faster than the autoencoder baseline, and improves significantly with dataset size.
Primary: National Institute of Informatics
All Institutions: King's College London, National Institute of Informatics, National Institute of Advanced Industrial Science and Technology
The main contribution of this paper is the introduction of SF-Flow, a novel framework for 3D Acoustic Transfer Function magnitude estimation that leverages Flow Matching for efficient and accurate reconstruction from sparse microphone measurements. This work significantly advances the field of spatial audio by providing a new methodology that outperforms traditional approaches while maintaining computational efficiency.
The proposed SF-Flow method introduces a novel approach to sound field magnitude estimation by framing it as a guided generative task using Flow Matching (FM). The architecture employs a 3D U-Net conditioned by a permutation-invariant set encoder, which allows for reconstruction from an arbitrary number of sparse inputs. This methodology is innovative as it leverages the advantages of FM, such as simulation-free training and stable dynamics, to tackle the challenges of sparse measurements in spatial audio. The use of a permutation-invariant encoder is particularly noteworthy, as it addresses the unordered nature of the input data effectively.
The experimental setup is robust, utilizing simulated Room Impulse Responses (RIRs) to evaluate the performance of SF-Flow against established baselines such as autoencoders and kernel ridge regression. The results demonstrate that SF-Flow achieves lower Log-Spectral Distortion (LSD) than the autoencoder up to 468 Hz, while maintaining faster training times. The experiments also explore the impact of dataset size on performance, showing that larger datasets significantly improve model accuracy. The comprehensive evaluation across different frequencies and observation counts provides strong evidence of the method's effectiveness.
The paper provides clear details regarding the training procedure, dataset generation, and evaluation metrics, which enhances reproducibility. However, the lack of a direct link to the experimental code or a demo could hinder full reproducibility for other researchers. The authors do mention that the source code and dataset are available, which is a positive aspect.
One limitation of the study is that the results are primarily based on simulated data, which may not fully capture the complexities of real-world environments. Additionally, while the method shows strong performance in magnitude estimation, the authors acknowledge the need for future work to jointly model magnitude and phase, which is crucial for applications like immersive audio rendering.
The implications of this research are significant for various applications in spatial audio, including room acoustics analysis, immersive audio rendering in AR/VR, and audio correction systems. By improving the accuracy and efficiency of sound field reconstruction from sparse measurements, this work could enhance the quality of audio experiences in both consumer and professional settings. The main contribution of this paper is the introduction of SF-Flow, a novel framework for 3D Acoustic Transfer Function magnitude estimation that leverages Flow Matching for efficient and accurate reconstruction from sparse microphone measurements. This work significantly advances the field of spatial audio by providing a new methodology that outperforms traditional approaches while maintaining computational efficiency.
Most recent advances in audio dereverberation focus almost exclusively on speech, leaving percussive and drum signals largely unexplored despite their importance in music production. Percussive dereverberation poses distinct challenges due to sharp transients and dense temporal structure. In this work, we propose a cold diffusion framework for dereverberating stereo drum stems (downmixes), modeling reverberation as a deterministic degradation process that progressively transforms anechoic signals into reverberant ones. We investigate two reverse-process parameterizations, Direct (next-state) and a Delta-normalized residual (velocity-style) prediction, and implement the framework using both a UNet and a diffusion Transformer backbone. The models are trained and evaluated on curated datasets comprising both acoustic and electronic drum recordings, with reverberation generated using a combination of synthetic and real room impulse responses. Extensive experiments on in-domain and fully out-of-domain test sets demonstrate that the proposed method consistently outperforms strong score-based and conditional diffusion baselines, evaluated using signal-based and perceptual metrics tailored to percussive audio.
Primary: Hellenic Mediterranean University
All Institutions: Hellenic Mediterranean University, XLN Audio
The paper presents a novel cold diffusion framework for percussive dereverberation, addressing a critical gap in audio enhancement research. Its innovative methodology, rigorous experimental evaluation, and practical implications for the music industry underscore its significance in advancing the field of audio processing.
The proposed cold diffusion framework is a novel approach to dereverberation specifically tailored for percussive audio, which has been largely overlooked in previous research focused on speech. The methodology effectively models reverberation as a deterministic degradation process and introduces two reverse-process parameterizations that leverage both UNet and diffusion Transformer architectures. The use of a structured forward process and the careful design of training objectives tailored to the unique characteristics of percussive signals demonstrate a thoughtful and innovative approach to the problem.
The experiments are comprehensive, utilizing a well-curated dataset that combines both acoustic and electronic drum recordings. The authors provide extensive evaluations on both in-domain and out-of-domain test sets, showcasing the robustness of their method against strong baselines. The results indicate significant improvements across various signal-based and perceptual metrics, particularly in transient preservation and reduction of late reverberation, highlighting the effectiveness of the proposed approach.
The paper includes sufficient implementation details, including the architecture of the models, training configurations, and the datasets used. The availability of code and audio examples on GitHub further enhances reproducibility, allowing other researchers to validate and build upon this work.
While the results are promising, the paper acknowledges that the proposed models still face challenges under strong domain shifts and may not generalize well to all types of reverberation, particularly production-style effects. Additionally, the reliance on a curated dataset may limit the generalizability of the findings.
This research has significant implications for music production and audio engineering, providing tools that can enhance the quality of percussive recordings. The focus on percussive dereverberation opens up new avenues for research and application in audio enhancement, potentially benefiting musicians, producers, and audio engineers. The paper presents a novel cold diffusion framework for percussive dereverberation, addressing a critical gap in audio enhancement research. Its innovative methodology, rigorous experimental evaluation, and practical implications for the music industry underscore its significance in advancing the field of audio processing.
Explainable AI (XAI) has achieved remarkable success in image classification, yet the audio domain lacks equally mature solutions. Current methods apply vision-based attribution techniques to spectrograms, overlooking fundamental differences between visual and acoustic signals. While prototype reasoning is promising, acoustic similarity remains multidimensional. We introduce APEX (Audio Prototype EXplanations), a post-hoc framework for interpreting pre-trained audio classifiers. Crucially, APEX requires no fine-tuning of the original backbone and strictly preserves output invariance. APEX disentangles explanations into four perspectives: Square-based prototypes to localize transient events, Time-based for temporal patterns, Frequency-based highlighting spectral bands, and Time-Frequency-based integrating both. This yields intuitive, example-based explanations that respect acoustic properties, providing greater semantic clarity than standard gradient-based methods.
Primary: Wroclaw University of Science and Technology
All Institutions: Wroclaw University of Science and Technology, Resemble AI, IDEAS Research Institute, Jagiellonian University
The main contribution of this paper is the introduction of APEX, a post-hoc prototype-based interpretability framework for audio classifiers that preserves model performance while providing intuitive, example-based explanations. This work significantly advances the field of explainable AI in audio processing, addressing critical gaps in existing methodologies and offering a robust framework for future research in audio interpretability.
The proposed APEX framework introduces a novel approach to audio classification interpretability by leveraging prototype-based reasoning without requiring fine-tuning of the original model. The methodology is well-structured, incorporating a Disentanglement Module that effectively separates latent features into interpretable components. The four distinct prototype extraction schemes (Square-based, Time-based, Frequency-based, and Time-Frequency-based) are innovative and tailored to the unique characteristics of audio data, addressing the limitations of existing methods that often apply visual techniques to audio spectrograms. The requirement for output invariance while optimizing the feature space is a significant contribution, ensuring that the interpretability does not compromise the model's performance.
The experiments conducted on the WaveFake dataset for audio deepfake detection and the BirdSet dataset for bioacoustic classification demonstrate the effectiveness of the APEX framework. The results show that APEX maintains the classification performance of the underlying model while providing meaningful and localized explanations. The ablation studies, particularly the targeted masking experiments, validate the importance of the highlighted regions, confirming that the method accurately captures the critical acoustic features used by the classifier.
The paper provides sufficient details regarding the implementation of the APEX framework, including the architecture, training procedures, and evaluation metrics. However, the lack of a publicly available code repository or demo limits the reproducibility of the results. Future work should consider releasing code and datasets to facilitate further research and validation.
A notable limitation of the APEX framework is its applicability to architectures that follow a specific structure, which may restrict its use in more diverse model types. Additionally, while the method shows promise in interpretability, the evaluation metrics primarily focus on classification performance, which may not fully capture the nuances of interpretability in audio contexts.
The APEX framework has significant implications for the deployment of audio classifiers in safety-critical applications, where interpretability is crucial for ethical and legal compliance. By providing clear and semantically meaningful explanations, APEX can enhance trust in audio processing systems, particularly in fields such as healthcare and security. The main contribution of this paper is the introduction of APEX, a post-hoc prototype-based interpretability framework for audio classifiers that preserves model performance while providing intuitive, example-based explanations. This work significantly advances the field of explainable AI in audio processing, addressing critical gaps in existing methodologies and offering a robust framework for future research in audio interpretability.
In new media art creation, the mapping between vision and hearing is often subjective. As a classic carrier of sound visualization, Chladni patterns have great potential in building audio-visual mapping mechanisms. However, existing tools face pain points: high technical barriers for simulation, offline computing failing real-time interaction, and uncontrollable mapping rules in general sonification tools. To address these, this paper proposes ChladniSonify, a real-time visual-acoustic mapping method for Chladni patterns. Based on Kirchhoff-Love plate theory, we build a paired dataset via numerical programming and calibrate it using ANSYS finite element simulation. Focusing on the slender nodal lines of Chladni patterns, we adopt a lightweight CNN with CBAM to achieve high-precision, low-latency pattern classification. Finally, we build an end-to-end system in Python and Max/MSP, mapping recognized patterns to corresponding sine wave frequencies. Results show the system has excellent usability: the classification module achieves 99.33% accuracy on the test set with 7.03 ms inference latency; the mapped frequency matches the theoretical value with zero deviation; the average end-to-end latency is under 50 ms, meeting real-time interactive needs. This work provides a reproducible engineering prototype for Chladni audio-visual art creation.
Primary: Shenyang Conservatory of Music
All Institutions: Shenyang Conservatory of Music
The main contribution of this paper is the development of ChladniSonify, a real-time visual-acoustic mapping system that effectively bridges the gap between visual patterns and sound generation, leveraging advanced machine learning techniques and classical physics principles. The technical contributions are significant, addressing existing challenges in the field of audio-visual mapping and providing a foundation for future artistic and technological innovations.
The methodology is robust, leveraging classical physics principles (Kirchhoff-Love theory) to create a paired dataset for Chladni patterns and vibration frequencies. The use of a lightweight CNN with a CBAM attention mechanism is innovative and tailored for the specific task of recognizing slender nodal lines. The end-to-end system design, integrating Python and Max/MSP for real-time audio-visual mapping, is well thought out and addresses existing gaps in the field.
The experiments are comprehensive, demonstrating high accuracy (99.33%) and low latency (7.03 ms) for pattern recognition, with a full-link latency under 50 ms. The results validate the proposed mapping system's effectiveness, showing a complete match with theoretical values. However, the experiments primarily rely on synthetic data, which may limit the generalizability of the findings.
The paper provides sufficient detail regarding the dataset construction, model architecture, and experimental setup, which should allow for reproducibility. However, the lack of a publicly available dataset or code repository hinders full reproducibility.
The system is currently limited to specific Chladni patterns (square plates with center excitation) and only includes 15 modes. The reliance on synthetic data and the absence of real-world testing may affect the robustness of the findings. Additionally, the system lacks advanced music creation functionalities, limiting its use for non-technical artists.
This work has significant implications for new media art, enabling artists to create interactive installations that link visual patterns to sound in real-time. It opens avenues for further exploration in audio-visual art and could inspire future research in multimodal systems. The main contribution of this paper is the development of ChladniSonify, a real-time visual-acoustic mapping system that effectively bridges the gap between visual patterns and sound generation, leveraging advanced machine learning techniques and classical physics principles. The technical contributions are significant, addressing existing challenges in the field of audio-visual mapping and providing a foundation for future artistic and technological innovations.
The performance of audio latent diffusion models is primarily governed by generator expressivity and the modelability of the underlying latent space. While recent research has focused primarily on the former, as well as improving the reconstruction fidelity of audio codecs, we demonstrate that latent modelability can be significantly improved through explicit factor disentanglement. We present PoDAR (Power-Disentangled Audio Representation), a framework that utilizes a randomized power augmentation and latent consistency objective to decouple signal power from invariant semantic content. This factorization makes the latent space easier to model, which both accelerates the convergence of downstream generative models and improves final overall performance. When applied to a Stable Audio 1.0 VAE with an F5-TTS generator, PoDAR achieves about a $2\times$ acceleration in convergence to match baseline performance, while increasing final speaker similarity by 0.055 and UTMOS by 0.22 on the LibriSpeech-PC dataset. Furthermore, isolating power into dedicated channels enables the application of CFG exclusively to power-invariant content, effectively extending the stable guidance regime to higher scales.
Primary: Descript
All Institutions: Descript
The main contribution of this paper is the introduction of PoDAR, a framework that enhances the modelability of audio latent spaces through power disentanglement, leading to faster convergence and improved generative performance. This work represents a meaningful advancement in the field of audio generative modeling, addressing critical challenges in latent space representation and providing a pathway for future research in multimodal audio systems.
The methodology presented in PoDAR leverages a randomized power augmentation and a latent consistency objective to disentangle signal power from semantic content within the latent space of audio representations. This approach is innovative as it addresses the challenge of latent modelability, which is often overlooked in favor of reconstruction fidelity. By explicitly separating power and semantic content, the authors provide a structured framework that enhances the efficiency of generative models. The use of partial Classifier-Free Guidance (CFG) to selectively apply guidance only to the power-invariant channels further demonstrates a thoughtful approach to improving model robustness.
The experimental evaluation is comprehensive, utilizing well-established benchmarks such as the LibriSpeech-PC dataset. The authors report significant improvements in convergence speed and speaker similarity metrics, demonstrating the effectiveness of their proposed framework. The results are quantitatively supported by metrics like UTMOS and speaker similarity, which are critical in the audio domain. However, the paper could benefit from additional qualitative assessments or user studies to complement the quantitative findings.
The paper provides detailed descriptions of the experimental setup, including the architecture of the autoencoders, training configurations, and the metrics used for evaluation. However, the absence of publicly available code or a demo limits the reproducibility of the results. Providing a GitHub repository or similar resource would greatly enhance the ability of other researchers to replicate the findings.
The primary limitation noted in the paper is the increased computational overhead associated with the dual encoder passes required for the consistency objective. Additionally, the framework has only been tested within the speech domain, which may limit its generalizability to other audio modalities or applications. The authors also acknowledge that the disentanglement focuses solely on power, leaving other potential nuisance variables unaddressed.
The implications of this research are significant, particularly in enhancing the efficiency and quality of audio generation systems. By improving the modelability of latent representations, PoDAR could facilitate advancements in various applications, including text-to-speech synthesis, audio restoration, and music generation. The increased efficiency also suggests a potential reduction in the carbon footprint associated with training large generative models, which is an important consideration in the current landscape of machine learning research. The main contribution of this paper is the introduction of PoDAR, a framework that enhances the modelability of audio latent spaces through power disentanglement, leading to faster convergence and improved generative performance. This work represents a meaningful advancement in the field of audio generative modeling, addressing critical challenges in latent space representation and providing a pathway for future research in multimodal audio systems.
Evaluating expressive speech remains challenging, as existing methods mainly assess emotional intensity and overlook whether a speech sample is expressively appropriate for its contextual setting. This limitation hinders reliable evaluation of speech systems used in narrative-driven and interactive applications, such as audiobooks and conversational agents. We introduce CEAEval, a Context-rich framework for Evaluating Expressive Appropriateness in speech, which assesses whether a speech sample expressively aligns with the underlying communicative intent implied by its discourse-level narrative context. To support this task, we construct CEAEval-D, the first context-rich speech dataset with real human performances in Mandarin conversational speech, providing narrative descriptions together with fifteen dimensions of human annotations covering expressive attributes and expressive appropriateness. We further develop CEAEval-M, a model that integrates knowledge distillation, planner-based multi-model collaboration, adaptive audio attention bias, and reinforcement learning to perform context-rich expressive appropriateness evaluation. Experiments on a human-annotated test set demonstrate that CEAEval-M substantially outperforms existing speech evaluation and analysis systems.
Primary: Tianjin University
All Institutions: Tianjin University, Nanyang Technological University, Shanghai Jiao Tong University, Kuaishou Technology, TeleAI, China Telecom, National University of Singapore, Tencent
This paper introduces a novel framework for evaluating expressive appropriateness in speech that integrates contextual understanding and advanced modeling techniques. The comprehensive approach, robust experimental validation, and potential for broad applications mark a significant contribution to the field of machine learning and audio processing.
The paper presents a comprehensive methodology for evaluating expressive appropriateness in speech, which is a significant advancement over traditional methods that focus primarily on emotional intensity or isolated utterance characteristics. The introduction of the CEAEval framework, along with the CEAEval-D dataset and CEAEval-M model, showcases a well-structured approach that integrates knowledge distillation, adaptive audio attention bias, and reinforcement learning. The separation of the expressive planner and scoring model is particularly innovative, allowing for better handling of long-range contextual information. The use of a multi-dimensional annotation system for expressive attributes is a strong methodological contribution, ensuring a nuanced evaluation of speech expressiveness.
The experimental evaluation is robust, with a clear comparison against existing benchmarks. The authors provide detailed results demonstrating that CEAEval-M significantly outperforms other models across various context sizes. The use of human-annotated data enhances the credibility of the findings, and the reported metrics (LCC and ACC) effectively convey the model's performance. The ablation studies further strengthen the evaluation by isolating the impact of different components of the proposed framework.
The paper outlines a clear methodology for data collection and model training, which supports reproducibility. However, the lack of publicly available raw audio data limits the ability of other researchers to fully replicate the study. The authors do commit to releasing model checkpoints and parameters, which is a positive step towards enhancing reproducibility.
The study is primarily focused on Mandarin speech, which may limit its applicability to other languages and cultural contexts. The authors acknowledge this limitation and express intentions to expand the framework to additional languages in future work. Additionally, the subjective nature of expressive appropriateness could introduce variability in human annotations, although the high inter-annotator agreement suggests reliability.
The proposed framework has significant implications for various applications, including conversational agents, audiobook generation, and interactive storytelling systems. By providing a systematic way to evaluate expressive speech, this research could enhance user experiences in narrative-driven applications and contribute to the development of more emotionally aware AI systems. This paper introduces a novel framework for evaluating expressive appropriateness in speech that integrates contextual understanding and advanced modeling techniques. The comprehensive approach, robust experimental validation, and potential for broad applications mark a significant contribution to the field of machine learning and audio processing.
Metric-induced discrete flow matching (MI-DFM) exploits token-latent geometry for discrete generation, but its practical use is limited by two issues: heuristic schedulers requiring hyperparameter search, and finite-step path-tracking error from its first-order continuous-time Markov chain (CTMC) solver. We address both issues. First, we derive a kinetic-optimal scheduler for prescribed scalar-parameterized probability paths, and instantiate it for MI-DFM as a training-free numerical schedule that traverses the path at constant Fisher-Rao speed. Second, we introduce a finite-step moment correction that adjusts the jump probability while preserving the CTMC jump destination distribution. We validate the resulting method, GibbsTTS, on codec-based zero-shot text-to-speech (TTS). Under controlled comparisons with a unified architecture and large-scale dataset, GibbsTTS achieves the best objective naturalness and is preferred in subjective evaluations over masked discrete generative baselines. Additionally, in comparison with the evaluated state-of-the-art TTS systems, GibbsTTS shows strong speaker similarity, achieving the highest similarity on three of four test sets and ranking second on the fourth. Project page: https://ydqmkkx.github.io/GibbsTTSProject
Primary: The University of Tokyo
All Institutions: The University of Tokyo, Independent Researcher
The paper presents a novel approach to improving zero-shot text-to-speech systems through innovative scheduling and correction techniques. The technical contributions are significant, addressing key limitations in existing methodologies and demonstrating strong experimental results.
The paper introduces a kinetic-optimal scheduler and a finite-step moment correction for metric-induced discrete flow matching (MI-DFM) in zero-shot text-to-speech (TTS). The methodology is well-structured, addressing two significant limitations of MI-DFM: the reliance on heuristic schedulers and finite-step path-tracking errors. The kinetic-optimal scheduler is derived from Fisher-Rao geometry, providing a training-free numerical approach that avoids hyperparameter tuning. The moment correction effectively mitigates discretization errors in CTMC sampling. Overall, the methods are innovative and present a clear advancement in the field of TTS.
The experiments are robust, utilizing a large-scale dataset for evaluation and controlled comparisons against state-of-the-art systems. The results demonstrate that GibbsTTS achieves superior objective naturalness and speaker similarity metrics, with subjective evaluations further validating the model's performance. The paper effectively communicates the experimental setup and results, providing a comprehensive analysis of the proposed methods.
The paper provides sufficient detail on the model architecture, training procedures, and evaluation metrics, which facilitates reproducibility. However, the lack of a public code repository or demo limits the ability for others to fully replicate the results.
The primary limitation is the focus on zero-shot TTS, which may restrict the generalizability of the proposed methods to other domains. Additionally, the reliance on a specific codec and the absence of exploration into alternative distance metrics for token embeddings could limit the applicability of the findings.
The advancements in TTS technology have significant implications for applications in accessibility, entertainment, and human-computer interaction. The proposed methods could enhance the naturalness and speaker similarity in synthesized speech, potentially improving user experiences in various applications. The paper presents a novel approach to improving zero-shot text-to-speech systems through innovative scheduling and correction techniques. The technical contributions are significant, addressing key limitations in existing methodologies and demonstrating strong experimental results.
Multimodal Intent Recognition (MIR) aims to understand complex user intentions by leveraging text, video, and audio signals. However, existing approaches face two key challenges: (1) overlooking intricate cross-modal interactions for distinguishing consistent and inconsistent cues, and (2) ineffectively modeling multimodal conflicts, leading to semantic cancellation. To address these, we propose a novel Cognitive Dual-Pathway Reasoning (CDPR) framework, which constructs a stable semantic foundation via the intuition pathway and mitigates high-level semantic conflicts through the reasoning pathway, cooperatively establishing deep semantic relations. Specifically, we first employ a representation disentanglement strategy to extract modality-invariant and specific features. Subsequently, the intuition pathway aggregates cross-modal consensus using shared features for solid global representations. The reasoning pathway introduces an inconsistency perception mechanism, combining semantic prototype matching with statistical probability calibration to precisely quantify conflict severity, and dynamically adjusting the weights between both pathways. Furthermore, a multi-view loss function is adopted to alleviate modality laziness and learn structured features at different stages. Extensive experiments on two benchmarks show that CDPR achieves SOTA performance and superior robustness in mitigating multimodal inconsistency. The code is available at https://github.com/Hebust-NLP/CDPR.
Primary: Hebei University of Science and Technology
All Institutions: Hebei University of Science and Technology, Hebei University of Economics and Business
The main contribution of this paper is the introduction of the CDPR framework, which effectively addresses the challenges of multimodal inconsistency in intent recognition through a novel dual-pathway reasoning approach. This work significantly advances the state of the art in MIR by providing a robust methodology and demonstrating its effectiveness through rigorous experimentation.
The proposed Cognitive Dual-Pathway Reasoning (CDPR) framework presents a novel approach to Multimodal Intent Recognition (MIR) by introducing a dual-pathway architecture that simulates human cognitive processes. The methodology effectively disentangles modality-invariant and specific features, employs an inconsistency perception mechanism, and utilizes a multi-view loss function to enhance learning. The integration of intuitive and reasoning pathways is innovative, allowing for adaptive regulation based on conflict levels, thereby addressing significant challenges in existing MIR approaches. The detailed explanation of feature extraction, decoupling, and the dual-pathway mechanism demonstrates a robust theoretical foundation.
The experiments conducted on two benchmark datasets (MIntRec and MIntRec2.0) are comprehensive, showcasing the effectiveness of the CDPR framework. The reported state-of-the-art (SOTA) performance across various metrics (accuracy, F1-score, etc.) substantiates the claims made in the paper. The ablation studies further validate the contributions of individual components, indicating that each part of the proposed method plays a crucial role in achieving superior performance. The robustness tests against noise also highlight the practical applicability of the model in real-world scenarios.
The paper provides sufficient implementation details, including the architecture, datasets, training protocols, and hyperparameters, which facilitate reproducibility. The availability of the code on GitHub enhances the potential for other researchers to replicate and build upon the work.
While the CDPR framework shows promising results, the paper does not extensively discuss the limitations of the proposed method. Potential weaknesses could include the reliance on specific datasets for training and evaluation, which may not generalize well to other multimodal contexts. Additionally, the complexity of the model may pose challenges in terms of computational efficiency in real-time applications.
The advancements in multimodal intent recognition have significant implications for various applications, including human-computer interaction, autonomous systems, and multimedia retrieval. The ability to effectively handle multimodal inconsistencies can enhance user experience in interactive systems, making them more intuitive and responsive to user intentions. The main contribution of this paper is the introduction of the CDPR framework, which effectively addresses the challenges of multimodal inconsistency in intent recognition through a novel dual-pathway reasoning approach. This work significantly advances the state of the art in MIR by providing a robust methodology and demonstrating its effectiveness through rigorous experimentation.
Timbre transfer aims to modify the timbral identity of a musical recording while preserving the original melody and rhythm. While single-instrument timbre transfer has made substantial progress, existing approaches to multi-instrument settings rely on separate-then-transfer pipelines that propagate source separation artifacts and produce incoherent synthesized timbres across stems. This paper proposes MixtureTT, to the best of our knowledge the first system for flexible per-stem timbre transfer directly from a polyphonic mixture. Given a mixture and a separate timbre reference for each target voice, MixtureTT jointly transfers all stems to the specified instruments through a shared diffusion process. Modeling the dependencies across the per-stem content and cross-stem harmonic, the proposed joint stem diffusion transformer eliminates cascaded separation error, reduces inference cost by a factor equal to the number of stems, and yields more coherent multi-stem outputs. Despite operating under a strictly harder input condition, evaluations on the SATB choral dataset show that MixtureTT outperforms single-instrument baselines on both objective and subjective metrics demonstrating the necessity of dedicated multi-instrument timbre transfer over the naive separate-then-transfer pipelines. As a result, this work confirms that the cross-stem modeling is essential for mixture-level timbre transfer as the proposed joint setting consistently exceeds an equivalent single-stem ablation.
Primary: unknown
All Institutions: unknown
This paper presents MixtureTT, a novel approach for per-stem timbre transfer directly from polyphonic mixtures, significantly advancing the field of music audio processing. The technical contributions, particularly the joint diffusion model and its evaluation, mark a meaningful step towards more coherent and efficient audio manipulation techniques.
The proposed methodology, MixtureTT, innovatively addresses the challenge of timbre transfer in polyphonic music by employing a joint stem diffusion transformer that operates directly on polyphonic mixtures without requiring explicit source separation. This approach is a significant departure from traditional separate-then-transfer pipelines, which are prone to artifacts and inconsistencies. The architecture effectively balances per-stem independence with cross-stem coordination, allowing for coherent audio generation. The use of a shared diffusion process and the introduction of disentanglement losses further enhance the model's ability to maintain timbral fidelity while preserving content integrity.
The experimental evaluation is robust, utilizing both objective metrics (e.g., Fréchet Audio Distance, Jaccard Distance, Chroma Cosine Similarity) and subjective assessments through a listening test with human participants. The results consistently demonstrate that MixtureTT outperforms single-instrument baselines, even when those baselines are provided with isolated stems. This is a strong validation of the proposed method's effectiveness. The use of the SATB choral dataset is appropriate, though it may limit generalizability to other musical contexts.
While the paper details the training process and architecture, the lack of specific implementation details, such as hyperparameters and the exact training environment, may hinder reproducibility. The authors mention a demo URL, which could provide additional insights into the model's performance, but the absence of a public code repository is a drawback.
One limitation is the reliance on a specific dataset (CocoChorales), which may not fully represent the diversity of musical styles and genres. Additionally, while the model shows promise, the scalability to larger ensembles or more complex musical structures remains untested. The paper also does not address potential computational costs associated with training and inference, which could be a barrier for broader adoption.
The implications of this work are significant for music production and audio engineering, as it enables more flexible and coherent manipulation of musical recordings. This could streamline workflows for musicians and producers, allowing for innovative creative possibilities in music composition and arrangement. Furthermore, the findings encourage further exploration of mixture-level modeling in generative music tasks, potentially influencing future research directions. This paper presents MixtureTT, a novel approach for per-stem timbre transfer directly from polyphonic mixtures, significantly advancing the field of music audio processing. The technical contributions, particularly the joint diffusion model and its evaluation, mark a meaningful step towards more coherent and efficient audio manipulation techniques.
Text-to-image (T2I) generation using multiple conditions enables fine-grained user control on the generated image. Yet, incorporating multi-condition inputs incurs substantial computation and communication overhead, due to additional preprocessing subtasks and control optimizations. It hence leads to unacceptable generation latency. In this paper, we propose an end-edge collaborative system design to accelerate multi-condition T2I generation through adaptive condition offloading and pruning. Extensive offline profiling reveal that, different conditions exhibit significant diversity in computation and communication costs. To this end, we propose a \textit{Subtask Manager} that jointly optimizes condition inference offloading and bandwidth allocation using a heuristic algorithm, balancing local and edge execution delays to minimize overall preprocessing latency. Then, we design a lightweight feature-driven \textit{Conditioning Scale Estimator} that evaluates the contribution of each condition by analyzing its feature activation strength and overlap with other conditions. This allows adaptive conditioning scale selection and pruning of insignificant conditions, thereby accelerating the denoising process. Extensive experimental results show that our system reduces latency by nearly 25\% and improves 6\% average generation quality, outperforming other benchmarks.
Primary: Huazhong University of Science and Technology
All Institutions: Huazhong University of Science and Technology, Central South University
The main contribution of this paper is the introduction of an end-edge collaborative system that effectively accelerates multi-condition T2I generation through adaptive condition offloading and pruning. This work represents a meaningful advancement in the field, addressing critical challenges in computational efficiency and user control in AI-generated content.
The paper presents a novel end-edge collaborative system design that addresses the computational and communication overhead associated with multi-condition text-to-image (T2I) generation. The proposed Subtask Manager optimizes condition inference offloading and bandwidth allocation using a heuristic algorithm, which is a significant improvement over existing methods. The Conditioning Scale Estimator further enhances the system by evaluating the contribution of each condition, allowing for adaptive pruning of insignificant conditions. This dual approach effectively reduces latency while maintaining image quality, showcasing a well-thought-out methodology that balances local and edge processing.
The experimental results are robust, demonstrating a 25% reduction in latency and a 6% improvement in average generation quality compared to existing benchmarks. The authors conduct extensive profiling and performance evaluations across various hardware setups, which strengthens the validity of their claims. However, the paper could benefit from more detailed comparisons with a broader range of existing methods to contextualize the improvements more effectively.
The paper provides a clear description of the experimental setup, including the hardware used and the specific configurations for the algorithms. However, the lack of a publicly accessible code repository or demo limits the reproducibility of the results. Future work should consider making the implementation available to facilitate further research and validation.
One limitation of the proposed system is its reliance on specific hardware configurations, which may not generalize to all user devices. Additionally, the heuristic nature of the optimization may not guarantee the absolute best performance in all scenarios, particularly in highly variable network conditions. The paper also does not address potential scalability issues when the number of users or conditions increases significantly.
The proposed system has significant implications for real-time applications in AI-generated content, particularly in scenarios where user interaction and control are paramount. By reducing latency and improving generation quality, this work could enhance user experiences in creative industries, gaming, and virtual reality. The approach also opens avenues for further research in edge computing and collaborative AI systems. The main contribution of this paper is the introduction of an end-edge collaborative system that effectively accelerates multi-condition T2I generation through adaptive condition offloading and pruning. This work represents a meaningful advancement in the field, addressing critical challenges in computational efficiency and user control in AI-generated content.
Current omni-modal benchmarks mainly evaluate models under settings where multiple modalities are provided simultaneously, while the ability to start from audio alone and actively search for cross-modal evidence remains underexplored. In this paper, we introduce \textbf{Omni-DeepSearch}, a benchmark for audio-driven omni-modal deep search. Given one or more audio clips and a related question, models must infer useful clues from audio, invoke text, image, and video search tools, and perform multi-hop reasoning to produce a short, objective, and verifiable answer. Omni-DeepSearch contains 640 samples across 15 fine-grained categories, covering four retrieval target modalities and four audio content types. A multi-stage filtering pipeline ensures audio dependence, retrieval necessity, visual modality necessity, and answer uniqueness. Experiments on recent closed-source and open-source omni-modal models show that this task remains highly challenging: the strongest evaluated model, Gemini-3-Pro, achieves only 43.44\% average accuracy. Further analyses illustrate key bottlenecks in audio entity inference, query formulation, tool-use reliability, multi-hop retrieval, and cross-modal verification. These results highlight audio-driven omni-modal deep search as an important and underexplored direction for future multimodal agents.
Primary: Chinese Academy of Sciences (CASIA)
All Institutions: Chinese Academy of Sciences (CASIA), University of Chinese Academy of Sciences (UCAS), Beijing Academy of Artificial Intelligence (BAAI), Peking University, Tsinghua University
The paper presents Omni-DeepSearch, a benchmark for audio-driven omni-modal deep search, highlighting the challenges and limitations of current models while providing a structured methodology for future research in this underexplored area.
The paper introduces a novel benchmark, Omni-DeepSearch, which focuses on audio-driven omni-modal deep search, a largely unexplored area in multimodal learning. The methodology is well-structured, with a clear definition of the task and a multi-stage filtering pipeline that ensures the quality and relevance of the dataset. The authors emphasize audio dependence and multi-hop reasoning, which are critical for evaluating models that must infer and retrieve information across different modalities based solely on audio input. The task taxonomy and dataset construction are thorough, providing a solid foundation for future research.
The experiments conducted on various models, including both closed-source and open-source, reveal significant challenges in the task, with the best-performing model achieving only 43.44% accuracy. This highlights the complexity of audio-driven retrieval and reasoning, as well as the limitations of current models. The ablation studies and case analyses provide valuable insights into specific failure modes, such as dominant clue bias and misclassification, which are critical for understanding the limitations of existing approaches.
While the paper provides a comprehensive description of the dataset construction and evaluation metrics, it lacks detailed implementation specifics that would facilitate reproducibility. The absence of a publicly available dataset or code repository further limits the ability of other researchers to replicate the results or build upon this work.
The paper acknowledges several limitations, including the inherent ambiguity of audio signals and the reliance on external knowledge for retrieval. Additionally, the performance gap between closed-source and open-source models suggests that there is still much work to be done in improving model capabilities in this domain. The lack of a publicly available dataset or code also hinders broader adoption and experimentation.
The introduction of Omni-DeepSearch has the potential to significantly impact the field of multimodal learning by providing a new benchmark that emphasizes audio as a primary modality for information retrieval. This could lead to advancements in various applications, including voice-activated assistants, audio-based search engines, and enhanced human-computer interaction systems. By addressing the challenges of audio-driven reasoning, this work opens up new avenues for research and development in multimodal AI. The paper presents Omni-DeepSearch, a benchmark for audio-driven omni-modal deep search, highlighting the challenges and limitations of current models while providing a structured methodology for future research in this underexplored area.
Language model (LM)-based speech enhancement (SE) can generate natural-sounding speech, but under severe noise it often suffers from unreliable conditioning, leading to perceptually plausible yet linguistically incorrect outputs. To address this issue, we propose L3-SE, a noise-invariant acoustic-semantic distillation framework for reducing linguistic hallucination in LM-based SE. The proposed method learns a noise-invariant conditioning encoder from noisy speech by jointly distilling two complementary clean-speech targets: an acoustic target for reconstruction fidelity and a semantic target for linguistic consistency. The resulting noise-invariant acoustic-semantic representations are used to condition a decoder-only autoregressive language model, which predicts clean acoustic tokens that are decoded into enhanced speech. To support high-quality generation, we further employ a high-fidelity codec built on learnable weighted WavLM layer representations as the discrete acoustic interface. By improving the reliability of conditioning under adverse conditions, the proposed framework substantially reduces hallucination and improves content faithfulness. Experiments show that the proposed method consistently outperforms prior LM-based speech enhancement baselines on linguistic consistency metrics, with especially clear gains under low-SNR and reverberant conditions, while maintaining competitive perceptual quality. Audio samples are available at https://max1wz.github.io/L3-SE-Demo-Page/. The complete source code will be released after the manuscript is accepted.
Primary: Nanjing University
All Institutions: Nanjing University, MiLM Plus, Xiaomi Inc.
The paper presents L3-SE, a novel framework for reducing linguistic hallucination in LM-based speech enhancement through noise-invariant acoustic-semantic distillation, demonstrating significant improvements in linguistic consistency and perceptual quality under challenging conditions.
The proposed L3-SE framework introduces a novel approach to speech enhancement by utilizing a noise-invariant acoustic-semantic distillation strategy. This dual-target distillation method, which leverages both acoustic fidelity and semantic consistency, is innovative in addressing the issue of linguistic hallucination in generative speech models. The architecture effectively combines a shared backbone with task-specific heads, allowing for robust conditioning that enhances the model's performance under noisy conditions. The integration of a high-fidelity codec further supports the quality of the generated speech, making the methodology both comprehensive and well-structured.
The experiments are thorough, utilizing a variety of datasets and evaluation metrics that cover perceptual quality, linguistic consistency, and speaker preservation. The results demonstrate that L3-SE outperforms existing baselines, particularly in challenging conditions such as low-SNR and reverberation. The use of both objective and subjective metrics strengthens the evaluation, providing a well-rounded assessment of the framework's capabilities.
The paper mentions that the complete source code will be released upon acceptance, which is a positive aspect for reproducibility. However, the implementation details are somewhat dense, and while they provide a comprehensive overview of the training process, clearer guidelines or supplementary materials could enhance reproducibility further.
One limitation is the reliance on specific datasets for training and evaluation, which may affect generalizability to other speech enhancement scenarios. Additionally, while the framework shows improvements in linguistic consistency, the perceptual quality metrics could still be further optimized to match or exceed the best-performing models in all conditions.
The proposed framework has significant implications for applications in speech recognition, communication technologies, and assistive devices, where clarity and accuracy in speech are crucial. By addressing linguistic hallucination effectively, it could enhance user experience in various real-world applications, making it a valuable contribution to the field of audio processing and machine learning. The paper presents L3-SE, a novel framework for reducing linguistic hallucination in LM-based speech enhancement through noise-invariant acoustic-semantic distillation, demonstrating significant improvements in linguistic consistency and perceptual quality under challenging conditions.
Audio deepfake detection systems are increasingly deployed in high-stakes security applications, yet their fairness across demographic groups remains critically underexamined. Prior work measures gender disparity but does not investigate where it comes from or how to fix it systematically. We present the first diagnosis-first framework that identifies bias source before applying targeted mitigation, evaluated on two models, AASIST and Wav2Vec2+ResNet18, on ASVSpoof5. Our diagnosis shows that bias does not stem from imbalanced training data but from acoustic representation differences, gender leakage in learned features, and structural evaluation asymmetry. We test mitigation strategies across in-processing, post-processing and combined families, including novel methods introduced in this work. Adjusting the decision threshold separately per gender reduces unfairness by 54% to 75% at no cost to detection accuracy, and our new epoch-level fairness regularisation method outperforms existing per-batch approaches. Adversarial debiasing succeeds only when gender leakage is localised, and fails when it is diffuse, an outcome correctly predicted by our diagnosis before training. No single method fully closes the fairness gap, confirming that bias sources must be identified before fixes are applied and that fairer benchmark design is equally important
Primary: Wichita State University
All Institutions: Wichita State University, Institut national de la recherche scientifique (INRS-EMT), INRS-UQO Mixed Research Unit on Cybersecurity
This paper presents a pioneering diagnosis-first framework for addressing gender bias in audio deepfake detection systems, significantly advancing the understanding and mitigation of bias in machine learning applications. The comprehensive methodology and rigorous experimental evaluation contribute valuable insights to the field, highlighting the importance of systematic bias diagnosis before applying mitigation strategies.
The paper introduces a systematic diagnosis-first framework for identifying and mitigating gender bias in audio deepfake detection. This approach is innovative as it emphasizes understanding the sources of bias before applying mitigation strategies, which is a significant departure from existing methods that often apply fixes without thorough diagnosis. The methodology is well-structured, detailing a comprehensive evaluation of bias sources at data, model, and decision levels, and it introduces novel mitigation techniques such as EAFR, SGFS, and GNEA. Each method is clearly defined, and the rationale for their implementation is well-articulated.
The experimental setup is robust, utilizing the ASVSpoof5 dataset, which is appropriate for the study's focus on gender fairness in audio deepfake detection. The paper conducts extensive experiments across multiple models (AASIST and Wav2Vec2+ResNet18) and evaluates various mitigation strategies, providing a thorough analysis of their effectiveness. The results are presented clearly, with a focus on multiple fairness metrics, which enhances the credibility of the findings. However, the reliance on a single dataset may limit the generalizability of the results.
The paper provides sufficient detail regarding the experimental setup, model architectures, and evaluation protocols, which supports reproducibility. However, the absence of publicly available code or a project repository limits the ability for others to reproduce the findings directly. Including a demo or project URL would enhance the reproducibility aspect significantly.
The study is limited to a single dataset (ASVSpoof5) and focuses on binary gender labels, which may not capture the full spectrum of gender representation. Additionally, while the paper identifies multiple sources of bias, it acknowledges that no single method completely closes the fairness gap, indicating that further research is needed to address these issues comprehensively.
The implications of this work are significant, particularly in high-stakes applications such as security and identity verification, where fairness and bias in detection systems can have profound societal impacts. By addressing gender bias in audio deepfake detection, the paper contributes to the broader discourse on fairness in AI systems, emphasizing the need for equitable treatment across demographic groups. This paper presents a pioneering diagnosis-first framework for addressing gender bias in audio deepfake detection systems, significantly advancing the understanding and mitigation of bias in machine learning applications. The comprehensive methodology and rigorous experimental evaluation contribute valuable insights to the field, highlighting the importance of systematic bias diagnosis before applying mitigation strategies.