While speech Large Language Models (LLMs) excel at conventional tasks like basic speech recognition, they lack fine-grained, multi-dimensional perception. This deficiency is evident in their struggle to disentangle complex features like micro-acoustic cues, acoustic scenes, and paralinguistic signals. This resulting incomplete comprehension of real-world speech fundamentally bottlenecks the development of perceptive and empathetic next-generation speech systems. At its core, this persistent perceptual limitation primarily stems from three interacting factors: scarce high-quality expressive data, absent fine-grained modeling for multi-dimensional attributes, and reliance on restricted coverage, coarse-grained benchmarks. We address these challenges through three pillars: First, our robust data curation pipeline resolves complex acoustic environments and long-audio timestamp alignment challenges to extract a high-quality spontaneous speech corpus from audiovisual sources. Second, we construct FMSU-Bench, a pioneering benchmark covering 14 speech attribute dimensions to rigorously assess the fine-grained, multi-dimensional speech understanding capabilities of current models. Third, empowered by our curated corpus, we introduce FM-Speech. Driven by a decoupled attribute modeling and progressive curriculum fine-tuning framework, it substantially elevates fine-grained, multi-dimensional acoustic perception. Extensive evaluations on FMSU-Bench reveal that current speech LLMs still require significant improvement in multi-dimensional, fine-grained understanding. In contrast, FM-Speech substantially outperforms current open-source models, establishing a robust paradigm for real-world speech understanding.
Primary: Northwestern Polytechnical University
All Institutions: Northwestern Polytechnical University, Shanghai Lingguang Zhaxian Technology
The paper makes a significant contribution to the field of machine learning by addressing the limitations of existing speech models and proposing a comprehensive framework for fine-grained multi-dimensional speech understanding. The innovative methodologies and rigorous evaluations presented establish a strong foundation for future research in this area.
The paper introduces a comprehensive methodology for fine-grained multi-dimensional speech understanding through a robust data curation pipeline, a pioneering benchmark (FMSU-Bench), and a novel model (FM-Speech). The data pipeline effectively addresses the challenges of extracting high-quality spontaneous speech from audiovisual sources, while the benchmark provides a structured framework for evaluating speech models across 14 distinct dimensions. The progressive curriculum fine-tuning framework for FM-Speech is particularly innovative, allowing for the decoupling of complex auditory attributes and improving model performance on fine-grained tasks.
The experiments conducted on FMSU-Bench demonstrate the effectiveness of the proposed methods. The evaluation includes a systematic comparison of FM-Speech against 11 advanced speech LLMs, showcasing significant improvements in multi-dimensional understanding. The rigorous data filtering and human verification processes enhance the reliability of the benchmark, while the use of innovative evaluation metrics (such as PATA) adds depth to the assessment of model performance.
The paper provides detailed descriptions of the methodology, data curation pipeline, and experimental setup, which contribute to reproducibility. However, the lack of access to the proprietary models used for comparison may limit the ability of others to fully replicate the results. The project URL provides access to the code and resources, which is beneficial for reproducibility.
One limitation is the reliance on specific audiovisual sources (movies and TV shows), which may not fully represent the diversity of real-world speech. Additionally, while the benchmark covers a wide range of speech attributes, the complexity of human speech may still present challenges in capturing all nuances. The paper also does not address potential biases in the data or the models used.
The advancements presented in this paper have significant implications for the development of next-generation speech systems that require fine-grained understanding and empathy in human-computer interactions. The establishment of FMSU-Bench sets a new standard for evaluating speech models, potentially influencing future research and applications in audio processing, speech recognition, and human-computer interaction. The paper makes a significant contribution to the field of machine learning by addressing the limitations of existing speech models and proposing a comprehensive framework for fine-grained multi-dimensional speech understanding. The innovative methodologies and rigorous evaluations presented establish a strong foundation for future research in this area.
Decoding speech from non-invasive brain signals is challenging. For the LibriBrain 2025 Speech Detection task, we propose a novel two-step framework that bypasses direct reconstruction. First, a contrastive learning model retrieves the matching speech segment for the given test MEG from a large-scale audio library (LibriVox). Second, a speech detection model generates the binary silence/speech sequence directly from this retrieved audio. With this approach, our team Sherlock Holmes achieved first place in the extended track (F1-score: 0.962), demonstrating that leveraging external audio databases is a highly effective strategy.
Primary: Peking University
All Institutions: College of Future Technology, Academy for Advanced Interdisciplinary Studies, Center for BioMed-X Research, Institute of Molecular Medicine, National Biomedical Imaging Center, Peking-Tsinghua Center for Life Sciences, School of Intelligence Science and Technology, Speech and Hearing Research Center, State Key Laboratory of General Artificial Intelligence, State Key Laboratory of Membrane Biology
The paper presents a novel two-step framework for speech detection from MEG signals, achieving state-of-the-art results by leveraging large-scale audio retrieval. This work demonstrates a significant advancement in the field of non-invasive BCIs and opens new avenues for research in audio processing and brain signal interpretation.
The proposed two-step framework is innovative in its approach to bypass direct reconstruction of speech from MEG signals by leveraging a large-scale audio library for retrieval. The use of contrastive learning for matching MEG segments with audio segments is a novel application in this context, highlighting the potential of match-mismatch tasks over traditional regression methods. The methodology is well-structured, with clear steps outlined for both the retrieval and detection phases, although the paper could benefit from more detailed explanations of the model architectures and hyperparameter choices.
The experiments are robust, with a clear description of data preparation, model training, and testing procedures. The authors achieved an impressive F1-score of 0.962, which is a significant contribution to the field, particularly given the challenges associated with decoding speech from noisy brain signals. However, the paper lacks a comparative analysis with other existing methods, which would strengthen the claims of superiority.
While the paper provides a good overview of the methods and results, it lacks detailed implementation specifics such as code availability, which is crucial for reproducibility. The absence of a public repository or demo limits the ability of other researchers to replicate the results.
One limitation is the reliance on a specific audio library (LibriVox), which may not generalize well to other datasets or real-world applications. Additionally, the method's performance on diverse speech types or accents is not addressed, which could affect its applicability. The paper also does not discuss the computational resources required for the proposed approach, which may limit accessibility for some researchers.
This research has the potential to significantly advance non-invasive brain-computer interfaces (BCIs) and improve communication methods for individuals with speech impairments. The innovative use of audio retrieval could inspire further exploration in related fields, such as cognitive neuroscience and assistive technologies. The paper presents a novel two-step framework for speech detection from MEG signals, achieving state-of-the-art results by leveraging large-scale audio retrieval. This work demonstrates a significant advancement in the field of non-invasive BCIs and opens new avenues for research in audio processing and brain signal interpretation.
As audio-first agents become increasingly common in physical AI, conversational robots, and screenless wearables, audio large language models (audio-LLMs) must integrate speaker-specific understanding to support user authorization, personalization, and context-aware interaction. This requires modeling who is speaking, how the voice sounds, and how recording conditions affect speaker cues. Conventional speaker verification systems provide strong scalar scores but little linguistic evidence, while current audio-LLMs and speaker-aware language models have limited ability to organize speaker information beyond binary labels or descriptive profiles. We present SpeakerLLM, a speaker-specialized audio-LLM framework that unifies single-utterance speaker profiling, recording-condition understanding, utterance-pair speaker comparison, and evidence-organized verification reasoning within a natural-language interface. We construct verification-reasoning targets and a decision-composition policy that separate profile-level evidence from the final same-or-different decision and organize recording condition, profile evidence, and the decision into a structured trace. At its core, SpeakerLLM uses a hierarchical speaker tokenizer designed to capture multiple granularities of speaker evidence. Utterance-level speaker embeddings summarize identity and profile-level cues, whereas frame-level speaker features preserve fine-grained acoustic descriptors. Experiments show that SpeakerLLM-Base improves speaker-profile and recording-condition understanding over general audio-LLMs, while SpeakerLLM-VR preserves strong generated-verdict accuracy and produces decision traces grounded in the supervised verification reasoning schema. We will release the metadata-enriched supervision dataset and target-construction code for reproducibility.
Primary: Korea Advanced Institute of Science and Technology (KAIST)
All Institutions: Korea Advanced Institute of Science and Technology (KAIST), University of Seoul
The main contribution of this paper is the introduction of SpeakerLLM, a speaker-specialized audio-LLM framework that effectively integrates speaker understanding and verification reasoning within a natural-language interface. This work significantly advances the field of audio processing by enhancing the explainability and accuracy of speaker verification systems, making it a valuable addition to the literature.
The paper presents a well-structured methodology with a clear two-stage training process for SpeakerLLM, which effectively integrates speaker profiling, recording condition understanding, and verification reasoning. The hierarchical speaker tokenizer is a novel approach that captures different granularities of speaker evidence, enhancing the model's ability to process and understand speaker-specific cues. The decision-composition policy that separates profile-level evidence from the final decision is a significant advancement in explainability for speaker verification systems.
The experiments are comprehensive, demonstrating the effectiveness of SpeakerLLM-Base and SpeakerLLM-VR through various tasks, including speaker profiling and verification reasoning. The results show substantial improvements over general audio-LLMs, especially in tasks requiring fine-grained acoustic evidence. The use of a controlled dataset and clear evaluation metrics strengthens the findings.
The authors commit to releasing the metadata-enriched supervision dataset and target-construction code, which is crucial for reproducibility. However, the paper could benefit from additional details on the implementation of the models and the specific configurations used during training.
The paper acknowledges limitations, including the need for further evaluation of the model in real-world noisy environments and the necessity of consent-aware interfaces for user privacy. The reliance on specific datasets may limit the generalizability of the findings.
The proposed framework has significant implications for the development of audio-first AI systems, particularly in enhancing user interaction through personalized and context-aware speaker verification. The ability to provide explainable decisions in speaker verification can improve trust and usability in applications like conversational agents and security systems. The main contribution of this paper is the introduction of SpeakerLLM, a speaker-specialized audio-LLM framework that effectively integrates speaker understanding and verification reasoning within a natural-language interface. This work significantly advances the field of audio processing by enhancing the explainability and accuracy of speaker verification systems, making it a valuable addition to the literature.
Training data attribution (TDA) for music generation must answer two questions that copyright analysis requires, namely which training songs influence a generated output and along which musical aspects the influence operates. Existing methods reduce influence to a single scalar, without revealing which musical aspects are dominant in that influence. We propose ARIA, a framework that decomposes attribution along musical aspects (five for symbolic music, three for audio) and pairs the decomposition with reliability diagnostics computed from the segment-level score matrix. It measures within-group similarity among the top-K attributed tracks against random reference groups drawn from the training pool, and diagnoses the score matrix through its singular value decomposition and column statistics. On a symbolic-music model where attribution ground truth is available through counterfactual retraining, the reliability diagnostics rank four attribution methods identically to that ground truth. On an audio music generation model, ARIA reveals attribution behaviors that vary substantially across TDA methods, flags score matrices whose retrieved tracks are nearly identical across queries rather than reflecting per-query attribution, and characterizes embedding-similarity retrieval baselines by the musical aspect each encoder surfaces. Together, ARIA produces per-aspect attribution evidence aligned with the musical aspects considered under the idea-expression distinction in copyright analysis.
Primary: Chalmers University of Technology
All Institutions: Chalmers University of Technology, University of Gothenburg
The paper presents ARIA, a novel framework for music training data attribution that effectively decomposes influence along musical aspects and provides reliability diagnostics, addressing a critical need in the intersection of machine learning and copyright law.
The proposed ARIA framework innovatively decomposes training data attribution (TDA) along multiple musical aspects, addressing a significant gap in existing methods that reduce influence to a single scalar. The methodology includes reliability diagnostics based on segment-level score matrices and singular value decomposition, which are crucial for understanding the attribution behavior of different methods. This multi-faceted approach is particularly relevant in the context of music generation and copyright analysis, as it aligns with the legal framework of idea-expression distinction.
The experiments conducted on both symbolic and audio music generation models are well-structured, utilizing a benchmark with ground truth for validation and exploring the performance of various attribution methods. The results demonstrate the effectiveness of ARIA in revealing the influence of training songs on generated outputs and highlight the variability of attribution behaviors across different methods. The use of statistical measures to assess within-group similarity adds robustness to the findings.
The paper provides comprehensive details on the experimental setup, including model architectures, datasets, and evaluation metrics, which enhances reproducibility. However, the absence of publicly available code or a demo limits the practical reproducibility of the results.
One limitation is the reliance on existing benchmarks and the challenges associated with creating ground truth for audio attribution, which may affect the generalizability of the findings. Additionally, the framework's performance may vary with different types of music or genres, which is not fully explored in the experiments.
The implications of this research extend to the legal domain, particularly in copyright analysis, as it provides a framework for understanding the influence of training data on generative models. This could aid in developing fair compensation mechanisms for artists and inform future regulations regarding AI-generated content. The framework also sets a foundation for further research in music generation and attribution, potentially influencing how generative models are evaluated and utilized in practice. The paper presents ARIA, a novel framework for music training data attribution that effectively decomposes influence along musical aspects and provides reliability diagnostics, addressing a critical need in the intersection of machine learning and copyright law.
Toxic speech detection has become a crucial challenge in maintaining safe online communication environments. However, existing approaches to toxic speech detection often neglect the contribution of paralinguistic cues, such as emotion, intonation, and speech rate, which are key to detecting speech toxicity. Moreover, current toxic speech datasets are predominantly text-based, limiting the development of models that can capture paralinguistic cues.To address these challenges, we present ToxiAlert-Bench, a large-scale audio dataset comprising over 30,000 audio clips annotated with seven major toxic categories and twenty fine-grained toxic labels. Uniquely, our dataset annotates toxicity sources -- distinguishing between textual content and paralinguistic origins -- for comprehensive toxic speech analysis.Furthermore, we propose a dual-head neural network with a multi-stage training strategy tailored for toxic speech detection. This architecture features two task-specific classification headers: one for identifying the source of sensitivity (textual or paralinguistic), and the other for categorizing the specific toxic type. The training process involves independent head training followed by joint fine-tuning to reduce task interference. To mitigate data class imbalance, we incorporate class-balanced sampling and weighted loss functions.Our experimental results show that leveraging paralinguistic features significantly improves detection performance. Our method consistently outperforms existing baselines across multiple evaluation metrics, with a 21.1% relative improvement in Macro-F1 score and a 13.0% relative gain in accuracy over the strongest baseline, highlighting its enhanced effectiveness and practical applicability.
Primary: Zhejiang University
All Institutions: Zhejiang University, Zhejiang Provincial Natural Science Foundation, National Natural Science Foundation of China
The main contribution of this work is the introduction of ToxiAlert-Bench, a comprehensive dataset for paralinguistic-aware toxic speech detection, and a dual-head neural network that significantly improves detection performance by integrating both textual and paralinguistic features. This paper represents a meaningful advancement in the field of audio-based machine learning, addressing a critical gap in existing research and providing a robust framework for future studies.
The paper introduces a novel dual-head neural network architecture designed specifically for detecting toxic speech by leveraging both textual and paralinguistic cues. The methodology is well-structured, involving a multi-stage training strategy that effectively reduces task interference and addresses data imbalance through class-balanced sampling and weighted loss functions. The dataset, ToxiAlert-Bench, is comprehensive, comprising over 30,000 audio clips with detailed annotations that allow for nuanced analysis of toxicity sources. The use of both real and synthesized audio samples enhances the dataset's robustness and diversity.
The experiments are thorough, comparing the proposed method against several state-of-the-art baselines. The results demonstrate significant improvements in detection performance, particularly in identifying toxicity conveyed through paralinguistic cues. The paper provides detailed metrics, including accuracy and Macro-F1 scores, which support the claims of the model's effectiveness. The ablation studies further validate the contributions of the model's components, reinforcing the robustness of the findings.
The authors have taken steps to ensure reproducibility by documenting the dataset construction process and providing a GitHub repository for the model. However, the paper could benefit from more detailed implementation specifics, such as hyperparameter settings and training protocols, to facilitate easier replication by other researchers.
One limitation is the reliance on the quality of the synthetic data generated, which may not fully capture the complexity of real-world toxic speech. Additionally, while the dataset is extensive, the focus on English may limit the applicability of the findings to other languages and cultural contexts. The paper does not address potential biases in the dataset or the model's performance across different demographics.
This research has significant implications for online communication platforms, particularly in enhancing moderation systems for audio content. By addressing the nuances of toxic speech that are often overlooked in text-based moderation, the findings could lead to more effective tools for preventing harassment and promoting safer online environments. The dataset and model could serve as foundational resources for future research in audio-based toxicity detection. The main contribution of this work is the introduction of ToxiAlert-Bench, a comprehensive dataset for paralinguistic-aware toxic speech detection, and a dual-head neural network that significantly improves detection performance by integrating both textual and paralinguistic features. This paper represents a meaningful advancement in the field of audio-based machine learning, addressing a critical gap in existing research and providing a robust framework for future studies.
Autoregressive music generation depends strongly on the audio tokenizer. Existing high-fidelity codecs often use residual multi-codebook quantization, which preserves reconstruction quality but complicates language modeling after sequence flattening, as the residual hierarchy imposes strong sequential dependencies and can amplify error accumulation. We propose BandTok, a generation-oriented 2D Mel-spectrogram tokenizer that represents each frame with Mel-frequency band tokens from a single shared codebook. This design yields a physically interpretable time-frequency token grid with a more independent token structure, making it better suited for autoregressive modeling. BandTok improves reconstruction with a multi-scale PatchGAN objective and EMA codebook updates. We further introduce an autoregressive language model with 2D Rotary Position Embedding (2D RoPE) to preserve temporal and frequency-band structure during generation. Experiments show that BandTok improves over residual-codebook tokenizers and achieves strong results in a data-limited setting. The source code and generation demos for this work are publicly available.
Primary: Central Conservatory of Music
All Institutions: Central Conservatory of Music, Zhipu AI
The main contribution of this paper is the introduction of BandTok, a novel 2D Mel-spectrogram tokenizer that enhances autoregressive music generation through improved token independence and reconstruction fidelity. This work significantly advances the field by addressing limitations of existing tokenization methods and providing a robust framework for future research in audio generation.
The paper presents BandTok, a novel 2D Mel-spectrogram tokenizer specifically designed for autoregressive music generation. The methodology is well-structured, focusing on improving token independence and reducing error propagation through a shared codebook of Mel-frequency band tokens. The use of a multi-scale PatchGAN discriminator and EMA codebook updates enhances reconstruction fidelity, while the introduction of 2D Rotary Position Embedding (RoPE) effectively preserves the temporal and frequency-band structure during generation. The approach is innovative, leveraging a unique tokenization strategy that contrasts with traditional residual multi-codebook methods.
The experiments are comprehensive, comparing BandTok against existing tokenizers and evaluating both reconstruction quality and generation performance. The use of objective metrics like FAD and CLAP scores, alongside subjective assessments, provides a robust evaluation framework. The results indicate that BandTok outperforms residual-codebook tokenizers, demonstrating its effectiveness in a data-limited setting. However, the paper could benefit from more extensive ablation studies to isolate the impact of each component of the proposed method.
The paper provides sufficient implementation details, including training configurations, datasets, and evaluation metrics, which should facilitate reproducibility. The source code and generation demos are publicly available, further supporting the reproducibility of the results. However, the lack of a clear description of the datasets used for training and evaluation could pose challenges for researchers attempting to replicate the study.
One limitation is the reliance on specific datasets, which may affect the generalizability of the results. The paper also does not address potential biases in the training data, which could influence the quality of generated music. Additionally, while the proposed method shows improvements over existing approaches, the paper does not explore the scalability of BandTok with larger datasets or more complex music generation tasks.
The proposed method has significant implications for the field of music generation, particularly in enhancing the quality and fidelity of generated audio. By improving tokenization strategies, BandTok could facilitate advancements in various applications, including music composition, sound design, and interactive audio systems. The integration of multimodal aspects, such as text conditioning, opens avenues for more sophisticated music generation frameworks that could benefit artists and content creators. The main contribution of this paper is the introduction of BandTok, a novel 2D Mel-spectrogram tokenizer that enhances autoregressive music generation through improved token independence and reconstruction fidelity. This work significantly advances the field by addressing limitations of existing tokenization methods and providing a robust framework for future research in audio generation.
Generative models are capable to address difficult problems with non-unique solutions like bandwidth extension and gap filling, removing highly non-linear artifacts from codecs, clipping and distortion, as opposed to removing linear additive components like noise and reverb. While large offline processing models have shown impressive results, these tasks have not been solved with real-time capable models with low latency and compute. We propose a few-step flow matching model using Data Prediction Mean Flows in combination with suitable novel low-latency architecture to make flow matching models an attractive choice under theses constraints. Compared to state-of-the-art, our proposed mean flow model uses 120x less compute and introduces no algorithmic latency other than the STFT, while achieving similar audio quality.
Primary: Microsoft Research
All Institutions: Microsoft Research
This work presents a significant advancement in real-time speech restoration using generative models, demonstrating a 120x reduction in computational complexity while maintaining audio quality. The combination of innovative methodologies and thorough experimental validation positions this research as a notable contribution to the field of machine learning and audio processing.
The paper introduces a novel few-step flow matching model utilizing Data Prediction Mean Flows (DP-MF) for real-time speech restoration. The methodology is well-structured, addressing the limitations of existing generative models in terms of latency and computational efficiency. The combination of innovative training techniques, such as the introduction of a data prediction loss and the careful design of flow time distributions, demonstrates a significant advancement in the field. The architecture is designed to minimize latency while maximizing audio quality, which is critical for real-time applications.
The experiments are comprehensive, utilizing a large-scale dataset that simulates real-world audio degradation scenarios. The evaluation metrics include both subjective (MOS, WER) and objective (DNSMOS SIG) measures, which provide a balanced view of the model's performance. The results indicate that the proposed model outperforms existing state-of-the-art models in terms of quality while significantly reducing computational requirements, showcasing the effectiveness of the proposed approach.
The paper provides sufficient details regarding the architecture, training data, and evaluation metrics, which would allow for reproducibility. However, the absence of a public code repository limits accessibility for other researchers wishing to replicate or build upon this work.
While the proposed model shows substantial improvements in latency and computational efficiency, there are still gaps in performance compared to non-causal models, particularly in terms of WER. Additionally, the reliance on specific training data and augmentation techniques may limit generalizability to other types of audio restoration tasks.
The advancements made in this paper have significant implications for various applications, including telecommunications, hearing aids, and augmented reality devices. By enabling real-time speech restoration with reduced computational demands, this work could enhance user experiences in environments where audio quality is critical. This work presents a significant advancement in real-time speech restoration using generative models, demonstrating a 120x reduction in computational complexity while maintaining audio quality. The combination of innovative methodologies and thorough experimental validation positions this research as a notable contribution to the field of machine learning and audio processing.
Recent breakthroughs in multi-talker ASR (MT-ASR) and speaker diarization (SD) rely on synthetic data to mitigate the scarcity of large-scale conversational recordings, yet the impact of specific simulation choices remains poorly understood. To mind the gap between simulated mixtures and real-world interactions, we present a study of synthetic data generation for leading MT-ASR (DiCoW) and SD (Sortformer) systems. By introducing FastMSS, a highly efficient open-source simulator, we analyze turn-taking dynamics, source domain, acoustic augmentation, and data mixing strategies. Our findings reveal that optimal simulation recipes are highly task-dependent: increasing speech overlap benefits ASR but degrades diarization. Furthermore, broad source diversity consistently outperforms exact domain matching. Ultimately, synthetic-only training approaches real-data baselines, and combining simulated data with real recordings yields substantial gains over real-only training across both tasks.
Primary: Carnegie Mellon University
All Institutions: Brno University of Technology, Carnegie Mellon University, NVIDIA
The paper presents a comprehensive study on the impact of synthetic conversational data on multi-talker ASR and speaker diarization, revealing critical insights into simulation strategies and their task-dependent effects. The introduction of FastMSS as an open-source toolkit represents a significant advancement in the field, enabling further research and application in multi-talker speech processing.
The paper introduces FastMSS, an open-source simulator that allows for the generation of synthetic multi-talker conversations with configurable parameters. The methodology is robust, systematically varying key factors such as turn-taking dynamics and source domain diversity. The authors provide a clear rationale for their choices and demonstrate the importance of task-specific simulation strategies, which is a significant contribution to the field. The use of two leading models, DiCoW for MT-ASR and Sortformer for SD, adds depth to the analysis, allowing for a comprehensive understanding of how synthetic data can be optimized for different tasks.
The experiments are well-designed, utilizing a variety of datasets that reflect real-world conditions. The results are clearly presented, showing the impact of different simulation strategies on performance metrics such as tcpWER for ASR and DER for diarization. The findings that synthetic data can approach real-data performance and that combining both yields the best results are particularly noteworthy. The paper effectively demonstrates the practical implications of its findings, making it relevant for both academic and industry applications.
The authors emphasize reproducibility by releasing FastMSS as an open-source toolkit, which is a commendable practice in the research community. They provide detailed descriptions of their experimental setup, including datasets and evaluation metrics, which further enhances the reproducibility of their results. However, the reliance on specific configurations and hyperparameters may require careful attention from users to replicate the results exactly.
One limitation noted in the paper is the potential lack of inter-turn semantic coherence in the generated conversations, which could affect the performance of ASR systems. Additionally, while the study covers a range of simulation strategies, the generalizability of the findings to other tasks or domains outside those tested remains uncertain. The paper could also benefit from a more extensive discussion on the ethical implications of using synthetic data in real-world applications.
The research has significant implications for the fields of speech recognition and speaker diarization, particularly in scenarios where real conversational data is scarce. By demonstrating that synthetic data can effectively complement or even substitute real data, this work opens avenues for more efficient training of ASR and diarization systems. The findings could lead to advancements in applications such as virtual assistants, automated meeting transcriptions, and other multi-talker environments. The paper presents a comprehensive study on the impact of synthetic conversational data on multi-talker ASR and speaker diarization, revealing critical insights into simulation strategies and their task-dependent effects. The introduction of FastMSS as an open-source toolkit represents a significant advancement in the field, enabling further research and application in multi-talker speech processing.
Subretinal injection is a delicate vitreoretinal procedure requiring precise needle placement within the subretinal space while avoiding perforation of the retinal pigment epithelium (RPE), a layer directly beneath the target with extremely limited regenerative capacity. To enhance depth perception during cannula advancement, intraoperative optical coherence tomography (iOCT) offers high-resolution cross-sectional visualization of needle-tissue interaction; however, interpreting these images requires sustained visual attention alongside the en face microscope view, thereby increasing cognitive load during critical phases and placing additional demands on the surgeon's proprioceptive control. In this paper, we propose a structured, real-time sonification framework designed for extensible mapping of iOCT-derived anatomical features into perceptual auditory feedback. The method employs a physics-inspired acoustic model driven by segmented retinal layers from a stream of iOCT B-scans, with needle motion and injection-induced retinal layer displacements serving as excitation inputs to the sound model, enabling perception of tool position and retinal deformation. In a controlled user study (n=34), the proposed sonification achieved high retinal layer identification accuracy and robust detection of retinal deformation-related events, significantly outperforming a state-of-the-art baseline in overall event identification (83.4% vs. 60.6%, p < 0.001), with gains driven primarily by enhanced detection of injection-induced retinal deformation. Evaluation by experts (n=4) confirmed the clinical relevance and potential intraoperative applicability of the method. These results establish structured iOCT sonification as a viable complementary modality for real-time surgical guidance in subretinal injection.
Primary: Princeton University
All Institutions: Princeton University, Technische Universität München, Rotterdam Eye Hospital, Centre for Tactile Internet with Human-in-the-Loop, Technische Universität Dresden, Munich Center for Machine Learning, Chair for Social Affective Touch
This paper presents a novel real-time sonification framework for enhancing surgical guidance during subretinal injections, demonstrating significant improvements in event identification accuracy through innovative auditory feedback mechanisms. The methodology and experimental results indicate a strong potential for clinical impact, although further validation in diverse surgical contexts is necessary for widespread adoption.
The proposed methodology introduces a structured sonification framework that effectively maps iOCT-derived anatomical features into auditory feedback, leveraging a physics-inspired acoustic model. The approach is well-defined, utilizing real-time updates based on segmented retinal layers and employing a mass-spring-damper system to reflect dynamic interactions during subretinal injections. The integration of both tool-driven and anatomy-driven excitations is innovative, enhancing the auditory feedback's relevance to surgical contexts. However, the reliance on a specific anatomical model may limit generalizability across different surgical scenarios.
The user study involving 34 participants provides robust evidence of the proposed method's effectiveness, demonstrating significant improvements in event identification accuracy compared to a baseline. The statistical significance of the results (p < 0.001) strengthens the claims of enhanced performance. The qualitative evaluations and feedback from expert surgeons further validate the clinical applicability of the framework. However, additional details on participant demographics and the specific experimental setup would enhance the evaluation's transparency.
The paper provides a GitHub repository link for the code, which is a positive step towards reproducibility. However, the implementation details could be more thoroughly documented to facilitate easier replication by other researchers. The reliance on specific software libraries (e.g., miPhysics) should also be clearly stated to avoid potential compatibility issues.
The study's limitations include a small sample size for expert feedback and the potential for bias in participant selection. The framework's performance in diverse surgical scenarios beyond subretinal injection remains untested. Additionally, the auditory feedback's effectiveness may vary based on individual surgeon preferences and experiences, which could affect its adoption in clinical practice.
The proposed sonification framework has the potential to significantly enhance surgical precision and reduce cognitive load during delicate procedures like subretinal injections. By providing real-time auditory feedback, it could improve patient outcomes and streamline surgical workflows. The approach may also inspire further research into auditory feedback systems in other medical domains, potentially leading to broader applications in minimally invasive surgeries. This paper presents a novel real-time sonification framework for enhancing surgical guidance during subretinal injections, demonstrating significant improvements in event identification accuracy through innovative auditory feedback mechanisms. The methodology and experimental results indicate a strong potential for clinical impact, although further validation in diverse surgical contexts is necessary for widespread adoption.
Segment Anything Model 2 (SAM2) exhibits strong generalisation for promptable segmentation in video clips; however, its integration with the audio modality remains underexplored. Existing approaches either convert audio into visual prompts (e.g., boxes) via foundation models, or inject adapters into the image encoder for audio-visual fusion. Yet both directions fall short in human-in-the-loop scenarios due to limited prompt accuracy and increased inference overhead. In particular, these adapter-based methods often suffer from audio prompt dilution, where the signal gradually weakens as it propagates through the network. In this work, we propose AuralSAM2, which integrates audio into SAM2 while largely preserving its promptable segmentation capability. Its core module, AuralFuser, fuses audio and visual features to generate sparse and dense prompts. Guided by audio and built upon SAM2's feature pyramid, these prompts propagate auditory cues across visual layers, reinforcing cross-modal influence. To further align modalities, we introduce an audio-guided contrastive loss that emphasises auditory relevance in dominant visual features. Our method achieves notable accuracy gains on public benchmarks with only minimal impact on the interactive efficiency of promptable segmentation. Our code is available at https://github.com/yyliu01/AuralSAM2.
Primary: University of Oxford
All Institutions: University of Oxford, Australian Institute for Machine Learning, Stanford University, University of Central Florida, University of Surrey
The paper presents AuralSAM2, a novel framework that enhances the Segment Anything Model 2 by integrating audio features for improved promptable segmentation. This work significantly advances the field of audio-visual integration in machine learning, providing a robust methodology and strong experimental results that demonstrate its potential impact on future research and applications.
The methodology introduces AuralFuser, which effectively integrates audio features into the SAM2 framework without modifying its visual backbone. This is achieved through a novel approach that generates both sparse and dense prompts, enhancing the model's ability to leverage audio cues in segmentation tasks. The introduction of an audio-guided contrastive loss (AudioCon) is particularly innovative as it addresses the challenge of visual dominance in the latent space, ensuring that audio signals are prioritized in the learning process. The hierarchical design of the feature pyramid is a significant methodological advancement that preserves audio influence throughout the network.
The experimental evaluation is robust, utilizing two public benchmarks (Ref-AVS and AVSBench) to demonstrate the efficacy of AuralSAM2. The results show significant improvements in segmentation accuracy compared to existing methods, particularly in human-in-the-loop scenarios, which is a critical application area. The ablation studies effectively highlight the contributions of different components of the proposed method, reinforcing the validity of the results.
The paper provides a link to the code repository, which is essential for reproducibility. However, the implementation details could be more comprehensive, particularly regarding the training setup and hyperparameters used. Clearer documentation would enhance the ability of other researchers to replicate the results.
One limitation is the reliance on the SAM2 framework, which may restrict the generalizability of the proposed method to other architectures. Additionally, while the integration of audio is innovative, the paper does not extensively discuss the potential challenges in real-world applications, such as varying audio quality or background noise.
The integration of audio into visual segmentation tasks has significant implications for various applications, including video analysis, surveillance, and human-computer interaction. By improving the accuracy of segmentation in scenarios where audio cues are present, this work could enhance the usability of AI systems in real-world environments, making them more efficient and effective. The paper presents AuralSAM2, a novel framework that enhances the Segment Anything Model 2 by integrating audio features for improved promptable segmentation. This work significantly advances the field of audio-visual integration in machine learning, providing a robust methodology and strong experimental results that demonstrate its potential impact on future research and applications.
Multimedia verification requires not only accurate conclusions but also transparent and contestable reasoning. We propose a contestable multi-agent framework that integrates multimodal large language models, external verification tools, and arena-based quantitative bipolar argumentation (A-QBAF) as a submission to the ICMR 2026 Grand Challenge on Multimedia Verification. Our method decomposes each case into claim-centered sections, retrieves targeted evidence, and converts evidence into structured support and attack arguments with provenance and strength scores. These arguments are resolved through small local argument graphs with selective clash resolution and uncertainty-aware escalation. The resulting system generates section-wise verification reports that are transparent, editable, and computationally practical for real-world multimedia verification. Our implementation is public at: https://github.com/Analytics-Everywhere-Lab/MV2026_the_liems.
Primary: University of New Brunswick
All Institutions: University of New Brunswick, FPT Software, University of Science
The paper presents a contestable multi-agent framework for multimedia verification that integrates multimodal large language models and an arena-based argumentation approach. The methodology is innovative and addresses critical issues in multimedia verification, although empirical validation and detailed experimental results are needed to fully assess its impact.
The proposed methodology is innovative, integrating multimodal large language models with an arena-based quantitative bipolar argumentation framework. The multi-agent approach effectively decomposes multimedia verification tasks into claim-centered sections, allowing for structured argumentation and transparent reasoning. The use of selective clash resolution and uncertainty-aware escalation enhances the system's robustness and practicality for real-world applications.
The paper lacks detailed experimental results or benchmarks that validate the proposed framework's effectiveness. While it describes the methodology in depth, the absence of empirical data or comparisons against existing methods limits the assessment of its performance and impact.
The implementation is publicly available on GitHub, which is a positive aspect for reproducibility. However, the paper does not provide sufficient details on the datasets used, evaluation metrics, or specific experimental setups, which could hinder full reproducibility.
The paper does not address potential limitations in terms of scalability, the complexity of the argumentation process, or the handling of ambiguous cases. Additionally, the reliance on external verification tools may introduce variability in results based on the quality of those tools.
The framework has significant implications for multimedia verification, particularly in combating misinformation in digital media. Its emphasis on contestability and transparency could enhance trust in automated verification systems, making it a valuable tool for journalists, fact-checkers, and the general public. The paper presents a contestable multi-agent framework for multimedia verification that integrates multimodal large language models and an arena-based argumentation approach. The methodology is innovative and addresses critical issues in multimedia verification, although empirical validation and detailed experimental results are needed to fully assess its impact.
Persian music, with its unique tonalities, modal systems (Dastgah), and rhythmic structures, presents significant challenges for music generation models trained primarily on Western music. We address this gap by curating the first large-scale dataset of Persian songs, comprising over 900 hours high-quality audio samples across diverse sub-genres, including pop, traditional, and contemporary styles. This dataset captures the rich melodic and cultural diversity of Persian music and serves as the foundation for fine-tuning MusicGen, a state-of-the-art generative music model. We adapt MusicGen to this domain and evaluate its performance by utilizing subjective and objective metrics. To assess the semantic alignment between generated music and intended style tags, we report the proportion of relevant tags accurately reflected in the generated outputs. Our results demonstrate that the fine-tuned model produces compositions that more align with Persian stylistic conventions. This work introduces a new resource for generative music research and illustrates the adaptability of music generation models to underrepresented cultural and linguistic contexts.
Primary: Sharif University of Technology
All Institutions: Sharif University of Technology, Independent Researcher
This paper presents the first large-scale dataset of Persian music and successfully adapts a state-of-the-art generative model to this culturally rich domain. The comprehensive methodology and promising results underscore the potential for AI to engage with and celebrate diverse musical traditions.
The methodology is robust, featuring a comprehensive dataset curation process that addresses the significant gap in Persian music resources. The authors employed a sophisticated approach for audio segmentation, tagging, and conditioning using state-of-the-art models. The three-stage training pipeline for adapting MusicGen to Persian music is well-structured, emphasizing unsupervised domain adaptation, instrument-focused fine-tuning, and supervised fine-tuning, which collectively enhance the model's cultural fidelity and stylistic accuracy. However, the reliance on automated tagging and the absence of expert validation for some aspects of the dataset may introduce noise and inaccuracies.
The experimental evaluation is thorough, utilizing both objective metrics (KLD and Chroma Cosine Similarity) and a hybrid evaluation strategy. The results indicate that the fine-tuned model significantly outperforms the baseline in generating culturally coherent Persian music. However, the evaluation could benefit from a more extensive subjective assessment involving trained musicians to capture perceptual qualities that are critical in music generation.
The paper provides a clear description of the dataset creation process and model training, which facilitates reproducibility. However, some details regarding the specific configurations used during training and the exact nature of the evaluation metrics could be elaborated upon to enhance clarity for future researchers attempting to replicate the study.
Key limitations include the dataset's skewed genre distribution towards Persian pop, which may affect the model's generalizability across other Persian music styles. The automatic tagging process may introduce inaccuracies, and the evaluation metrics used do not fully capture the richness of Persian music, particularly in terms of microtonal fidelity and ornamentation. Additionally, the model's performance may be constrained by the smaller variant of MusicGen used for fine-tuning.
This research has significant implications for the field of generative music, particularly in promoting cultural diversity in AI-generated content. By addressing the underrepresentation of Persian music in generative models, this work opens avenues for further exploration of other non-Western musical traditions. The dataset created can serve as a valuable resource for future research in music generation, potentially influencing the development of more culturally-aware AI systems. This paper presents the first large-scale dataset of Persian music and successfully adapts a state-of-the-art generative model to this culturally rich domain. The comprehensive methodology and promising results underscore the potential for AI to engage with and celebrate diverse musical traditions.
LLM-based automatic speech recognition models demonstrate strong performance by connecting audio encoders and LLMs. However, data scarcity of paired speech and transcription often hinders their adaptation to new domains, making text-only domain adaptation crucial. Existing methods typically rely on either fine-tuning the LLM alone or employing pseudo-audio prompts. The former neglects essential acoustic context, while the latter either suffers from limited scalability in data-scarce conditions, or yields inexpressive prompts by leveraging only textual features, ignoring audio modality. To address this, we propose an enhanced framework that explicitly models speech-text alignment. Our method efficiently generates highly expressive pseudo-audio prompts that bridges the modality gap, enabling effective target-domain adaptation. Experiments demonstrate that our approach outperforms existing text-only methods, improving both overall error rates and out-of-vocabulary coverage.
Primary: Kyoto University
All Institutions: Kyoto University, LY Corporation
The main contribution of this paper is the introduction of the TE2SL framework, which enhances text-only domain adaptation in LLM-based ASR by generating expressive pseudo-audio prompts through a learnable refinement module. This work represents a significant advancement in bridging the modality gap in ASR systems, with promising implications for improving performance in data-scarce environments.
The proposed Text-Embedding-to-Speech-Latent (TE2SL) framework innovatively addresses the challenge of text-only domain adaptation in LLM-based ASR by introducing a learnable refinement module that enhances the quality of pseudo-audio prompts. This method effectively bridges the modality gap by ensuring that the synthesized prompts are both sample-dependent and aligned with the characteristics of the audio encoder and projector. The methodology is well-structured, with a clear distinction between training and adaptation phases, and utilizes a Conformer architecture to achieve this refinement. The focus on architecture-aware synthesis is a significant advancement over previous heuristic approaches.
The experiments conducted are thorough, comparing the TE2SL framework against established baselines, including LLM-only fine-tuning and pseudo-audio prompt methods. The results demonstrate substantial improvements in both recognition accuracy and out-of-vocabulary (OOV) recall across multiple datasets in English and Japanese, validating the effectiveness of the proposed method. The use of diverse datasets strengthens the generalizability of the findings, and the metrics employed (WER and CER) are appropriate for evaluating ASR performance.
The paper provides a detailed description of the experimental setup, including model architectures, training configurations, and evaluation metrics. However, the absence of a publicly available code repository or demo limits the reproducibility of the results. Clearer documentation or a supplementary material section with implementation details could enhance reproducibility.
One limitation is the reliance on the quality of the audio encoder and projector, which may vary across different languages or domains. Additionally, while the method shows promise in improving OOV recall, the paper does not extensively discuss the implications of these improvements in practical applications. The scalability of the TE2SL framework in low-resource settings, where high-quality audio encoders may not be available, also warrants further exploration.
The proposed approach has significant potential applications in various domains where ASR systems are deployed, particularly in low-resource languages or specialized fields with limited paired data. By improving domain adaptation capabilities, this work can enhance accessibility and usability of ASR technologies in diverse linguistic contexts. The findings could also inform future research on multimodal learning and integration of audio-visual data in ASR systems. The main contribution of this paper is the introduction of the TE2SL framework, which enhances text-only domain adaptation in LLM-based ASR by generating expressive pseudo-audio prompts through a learnable refinement module. This work represents a significant advancement in bridging the modality gap in ASR systems, with promising implications for improving performance in data-scarce environments.
Target speech extraction remains difficult for compact devices because monaural neural models lack spatial evidence and classical beamformers lose resolving power when the microphone aperture is only a few centimetres. We present IsoNet, a user-selectable audio-visual target speech extraction system for a compact 4-microphone array. IsoNet combines complex multi-channel STFT features, GCC-PHAT spatial cues, face-conditioned visual embeddings, and auxiliary direction-of-arrival supervision inside a U-Net mask estimation network. Three curriculum variants were trained on 25,000 simulated VoxCeleb mixtures with progressively difficult SNR regimes. On a hard test set spanning -1 to 10 dB SNR, IsoNet-CL1 achieves 9.31 dB SI-SDR, a 4.85 dB improvement over the mixture, with PESQ 2.13 and STOI 0.84. Oracle delay-and-sum and MVDR beamformers degrade the same mixtures by 4.82 dB and 6.08 dB SI-SDRi, respectively, showing that the proposed learned multimodal conditioning solves a regime where conventional spatial filtering is ineffective. Ablation studies show consistent gains from visual conditioning, GCC-PHAT features, and extended delay-bin encoding. The results establish a compact-array, face-selectable speech extraction baseline under controlled simulation and identify the remaining barriers to real deployment, especially phase reconstruction, multi-interferer mixtures, and simulation-to-real transfer.
Primary: Institute of Engineering, Tribhuvan University
All Institutions: Institute of Engineering, Tribhuvan University
IsoNet presents a novel approach to audio-visual target speech extraction, effectively addressing the limitations of compact microphone arrays in challenging acoustic environments. The combination of advanced methodologies and thorough experimental validation positions this work as a meaningful contribution to the field of machine learning and audio processing.
The proposed methodology in IsoNet is robust, combining multi-channel STFT features, GCC-PHAT spatial cues, and face-conditioned visual embeddings within a U-Net architecture. The use of curriculum learning to progressively introduce SNR challenges is a thoughtful approach that enhances model robustness. The architecture is designed to address specific failure modes of compact microphone arrays, making it relevant for practical applications. The integration of auxiliary direction-of-arrival supervision is a notable addition that helps regularize the learning process.
The experiments are well-structured, utilizing a large dataset of 25,000 simulated mixtures from VoxCeleb, which is appropriate for the task. The evaluation metrics (SI-SDR, PESQ, and STOI) provide a comprehensive view of both objective and perceptual quality. The results demonstrate significant improvements over baseline methods, particularly in challenging SNR conditions. The ablation studies effectively isolate the contributions of different components of the model, providing clear insights into the efficacy of visual and spatial conditioning.
The paper provides sufficient detail on the experimental setup, including the architecture, training procedures, and evaluation metrics, which supports reproducibility. However, the lack of publicly available code or datasets limits the ability for independent verification of results.
The study primarily focuses on scenarios with a single interfering speaker, which may not fully capture the complexities of real-world environments with multiple speakers and background noise. Additionally, the reliance on simulated data may introduce discrepancies when transitioning to real-world applications. The phase reconstruction method used could also be improved for better performance in low SNR conditions.
The proposed IsoNet system has significant implications for various applications, including voice assistants, hearing aids, and augmented reality devices, where selective listening is crucial. By enhancing the ability to extract target speech in complex acoustic environments, this research could improve user experiences in everyday communication scenarios. IsoNet presents a novel approach to audio-visual target speech extraction, effectively addressing the limitations of compact microphone arrays in challenging acoustic environments. The combination of advanced methodologies and thorough experimental validation positions this work as a meaningful contribution to the field of machine learning and audio processing.
Current methods for creating drum loop audio in digital music production, such as using one-shot samples or resampling, often demand non-trivial efforts of creators. While recent generative models achieve high fidelity and adhere to text, they lack the specific control needed for such a task. Existing symbolic-to-audio research often focuses on single, tonal instruments, leaving the challenge of polyphonic, percussive drum synthesis unaddressed. We address this gap by introducing ``Break-the-Beat!,'' a model capable of rendering a drum MIDI with the timbre of a reference audio. It is built by fine-tuning a pre-trained text-to-audio model with our proposed content encoder and a effective hybrid conditioning mechanism. To enable this, we construct a new dataset of paired target-reference drum audio from existing drum audio datasets. Experiments demonstrate that our model generates high-quality drum audio that follows high-resolution drum MIDI, achieving strong performance across metrics of audio quality, rhythmic alignment, and beat continuity. This offer producers a new, controllable tool for creative production. Demo page: https://ik4sumii.github.io/break-the-beat/
Primary: Sony Group Corporation
All Institutions: Sony Group Corporation, Sony AI
The main contribution of this paper is the introduction of "Break-the-Beat!", a novel model for controllable MIDI-to-drum audio synthesis that combines advanced conditioning mechanisms with a pre-trained audio generation framework. This work not only fills a crucial gap in the existing literature but also offers practical tools for music producers, enhancing the creative process in digital music production.
The methodology presented in the paper is robust and innovative, leveraging a pre-trained text-to-audio model (SAO) and introducing a dual-input content encoder that effectively combines MIDI and reference audio for drum synthesis. The hybrid conditioning mechanism is a noteworthy contribution, allowing for precise control over both rhythm and timbre. The use of a novel dataset constructed from existing drum audio datasets is a significant step towards addressing the lack of resources in this area. The authors provide a clear overview of their approach, detailing the input representations, conditioning mechanisms, and training strategies, which enhances the clarity and reproducibility of their work.
The experimental evaluation is thorough, utilizing a well-defined dataset and a variety of metrics to assess the performance of the proposed model. The results demonstrate significant improvements in audio quality, rhythmic alignment, and beat continuity, particularly when using higher temporal resolutions for MIDI input. The paper effectively compares its method against various baselines and provides qualitative and quantitative analyses, which strengthen the validity of the findings. However, the paper could benefit from additional user studies or subjective evaluations to further substantiate the claims of improved audio quality.
The paper provides sufficient details regarding the model architecture, training procedures, and evaluation metrics, which facilitates reproducibility. However, the absence of a publicly available code repository limits the ability of other researchers to fully replicate the study. Providing access to the trained models or code would significantly enhance the reproducibility of the results.
One limitation of the study is the reliance on a specific dataset, which may not encompass the full diversity of drum sounds and styles encountered in real-world music production. Additionally, while the model performs well on the evaluated metrics, the subjective quality of generated audio in practical scenarios remains to be fully explored. The paper also does not address potential computational costs associated with training and inference, which could be a barrier for some users.
The proposed model has the potential to significantly impact digital music production by providing a tool that allows for greater control and creativity in drum synthesis. This could democratize music production for non-experts and enhance the workflow of professional producers. Furthermore, the findings could inspire future research in the area of symbolic-to-audio synthesis, particularly for other instrument types and musical styles. The main contribution of this paper is the introduction of "Break-the-Beat!", a novel model for controllable MIDI-to-drum audio synthesis that combines advanced conditioning mechanisms with a pre-trained audio generation framework. This work not only fills a crucial gap in the existing literature but also offers practical tools for music producers, enhancing the creative process in digital music production.
Early-stage Parkinson's disease (EarlyPD) detection from speech is clinically meaningful yet underexplored, and published results are hard to compare because studies differ in datasets, languages, tasks, evaluation protocols, and EarlyPD definitions. To address this issue, we propose the first benchmark for speech-based EarlyPD detection, with a speaker-independent split designed for fair and replicable cross-method evaluation on researcher-accessible datasets. The benchmark covers three common speech tasks and evaluates methods under different training-resource settings. We also present multi-dimensional evaluation breakdowns by dataset, aggregation level, gender, and disease stage to support fine-grained comparisons and clinical adoption. Our results provide a replicable reference and actionable insights, encouraging the adoption of this publicly available benchmark to advance robust and clinically meaningful EarlyPD detection from speech.
Primary: Radboud University
All Institutions: Radboud University, Radboud University Medical Center
This paper presents the first benchmark for speech-based EarlyPD detection, addressing a critical gap in the literature. The comprehensive methodology and robust experimental evaluation provide a significant contribution to the field, encouraging further research and development in clinically meaningful detection methods.
The paper introduces a well-structured benchmark for Early-stage Parkinson's Disease (EarlyPD) detection from speech, addressing the critical issue of comparability in existing research. The methodology includes a speaker-independent split for datasets, a clear definition of EarlyPD, and a multi-dimensional evaluation framework that allows for nuanced comparisons across various factors, such as gender and disease stage. The use of diverse training-resource settings and the inclusion of both public and private datasets enhance the robustness of the proposed benchmark.
The experiments are comprehensive, utilizing multiple speech tasks and a variety of machine learning models. The results are presented clearly, with a focus on both aggregate and utterance-level performance. The findings indicate significant improvements in EarlyPD detection when expanding speaker diversity, which is a valuable insight for future research. The evaluation metrics used (AUC and F1) are appropriate for the clinical context, ensuring relevance to real-world applications.
The authors emphasize transparency and reproducibility by providing all necessary resources and protocols for replicating their benchmark. The fixed training and evaluation settings, along with the release of datasets, contribute to a high level of reproducibility. However, the reliance on specific datasets may limit generalizability if future datasets differ significantly.
One limitation is the potential bias introduced by the datasets used, particularly in terms of gender representation and the skewed nature of some datasets. Additionally, while the benchmark is robust, the focus on specific speech tasks may not encompass the full range of speech variability seen in real-world clinical settings. The authors also note that spontaneous speech tasks were not included, which could be a significant aspect of EarlyPD detection.
The proposed benchmark has the potential to significantly advance the field of speech-based EarlyPD detection, promoting more reliable and clinically relevant research. By establishing a standardized evaluation protocol, it encourages the adoption of best practices in the community, ultimately leading to improved diagnostic tools for Parkinson's disease. The emphasis on explainability in model design also aligns with current trends in AI, making the findings particularly relevant for future developments in healthcare technology. This paper presents the first benchmark for speech-based EarlyPD detection, addressing a critical gap in the literature. The comprehensive methodology and robust experimental evaluation provide a significant contribution to the field, encouraging further research and development in clinically meaningful detection methods.
Voice agents, artificial intelligence systems that conduct spoken conversations to complete tasks, are increasingly deployed across enterprise applications. However, no existing benchmark jointly addresses two core evaluation challenges: generating realistic simulated conversations, and measuring quality across the full scope of voice-specific failure modes. We present EVA-Bench, an end-to-end evaluation framework that addresses both. On the simulation side, EVA-Bench orchestrates bot-to-bot audio conversations over dynamic multi-turn dialogues, with automatic simulation validation that detects user simulator error and appropriately regenerates conversations before scoring. On the measurement side, EVA-Bench introduces two composite metrics: EVA-A (Accuracy), capturing task completion, faithfulness, and audio-level speech fidelity; and EVA-X (Experience), capturing conversation progression, spoken conciseness, and turn-taking timing. Both metrics apply to different agent architectures, enabling direct cross-architecture comparison. EVA-Bench includes 213 scenarios across three enterprise domains, a controlled perturbation suite for accent and noise robustness, and pass@1, pass@k, pass^k measurements that distinguish peak from reliable capability. Across 12 systems spanning all three architectures, we find: (1) no system simultaneously exceeds 0.5 on both EVA-A pass@1 and EVA-X pass@1; (2) peak and reliable performance diverge substantially (median pass@k - pass^k gap of 0.44 on EVA-A); and (3) accent and noise perturbations expose substantial robustness gaps, with effects varying across architectures, systems, and metrics (mean up to 0.314). We release the full framework, evaluation suite, and benchmark data under an open-source license.
Primary: ServiceNow
All Institutions: ServiceNow
The main contribution of this paper is the introduction of EVA-Bench, a novel evaluation framework for voice agents that combines realistic simulation with comprehensive metrics to assess performance across various architectures and conditions. This work significantly advances the field by addressing critical evaluation challenges and providing a foundation for future research in voice agent technology.
The methodology presented in this paper is robust and addresses significant gaps in the evaluation of voice agents. The authors introduce an end-to-end framework, EVA-Bench, that combines realistic bot-to-bot audio simulation with comprehensive measurement metrics (EVA-A and EVA-X). The simulation methodology is particularly noteworthy as it incorporates automated validation to ensure the quality of user simulations, which is critical for obtaining reliable evaluation scores. The introduction of controlled perturbations to assess robustness against accent and noise variations further strengthens the methodology, allowing for a nuanced understanding of system performance across different conditions.
The experimental evaluation is thorough, involving 12 systems across three distinct architectures and a total of 213 scenarios. The results reveal critical insights into the performance of voice agents, particularly the divergence between peak and reliable performance, which is a crucial finding for real-world applications. The use of multiple trials and the introduction of pass@1, pass@k, and pass^k metrics provide a comprehensive view of system capabilities, although the paper could benefit from additional comparative analysis against existing benchmarks to contextualize the findings further.
The authors emphasize reproducibility by providing open-source access to the framework, evaluation suite, and benchmark data. They include detailed implementation instructions and configurations, which are essential for other researchers to replicate the study. However, the reliance on commercial model APIs for full reproduction may limit accessibility for some researchers, potentially impacting the overall reproducibility of the findings.
The paper acknowledges several limitations, including potential biases in the LLM-based judges, the lack of multilingual coverage, and the constraints of the user simulator in replicating real human caller behaviors. Additionally, the evaluation does not account for harmful outputs or sensitive information exposure, which is particularly relevant in high-stakes domains. The authors also note that the framework does not assess more complex agent configurations, which may limit its applicability in certain scenarios.
The EVA-Bench framework has significant implications for the development and evaluation of voice agents in enterprise applications. By providing a comprehensive evaluation methodology, it can help improve the reliability and user experience of voice agents, ultimately leading to better deployment in real-world settings. The findings regarding performance gaps and robustness under perturbations can inform future research and development efforts, guiding improvements in voice agent architectures and evaluation practices. The main contribution of this paper is the introduction of EVA-Bench, a novel evaluation framework for voice agents that combines realistic simulation with comprehensive metrics to assess performance across various architectures and conditions. This work significantly advances the field by addressing critical evaluation challenges and providing a foundation for future research in voice agent technology.
High-quality training datasets are essential for the performance of neural networks. However, the audio domain still lacks a large-scale, strongly-labeled, and single-source sound event dataset. The FSD50K dataset, despite being relatively large and open, contains a considerable fraction of multi-source samples where background interference or overlapping events could limit the usefulness of the data. To address this challenge, we introduce a data curation framework designed for large-scale open audio corpora. Our approach leverages a generative diffusion model to synthesize clean single-class events to construct controlled noisy mixtures for supervision. We subsequently employ a pre-trained audio encoder coupled with a discriminative classifier to automatically identify and filter out multi-source samples. Experiments show that our framework achieves strong performance on a human expert-curated test set. Finally, we release FSD50K-Solo, a model-curated subset of FSD50K containing single-source audio samples identified by our method. Beyond FSD50K, our method establishes a scalable paradigm for curating open source audio corpora.
Primary: Stony Brook University
All Institutions: Stony Brook University, Bose Corporation
This paper introduces a systematic framework for automated curation of single-source sound events, addressing critical data quality challenges in audio machine learning. The innovative use of generative models for dataset enhancement and the strong experimental results position this work as a significant contribution to the field.
The proposed methodology employs a generative diffusion model to synthesize clean single-source audio events, which is a novel approach to address the challenge of multi-source interference in existing datasets. The framework's reliance on a pre-trained audio encoder and a discriminative classifier for filtering multi-source samples is a significant advancement in automated data curation. The systematic approach to generating controlled noisy mixtures for supervision demonstrates a thoughtful integration of generative modeling with traditional classification techniques.
The experiments are well-structured, utilizing both generated data and a human-curated internal dataset for evaluation. The performance metrics, including traditional classification metrics and Audiobox Aesthetics scores, provide a robust assessment of the model's effectiveness. The results indicate strong classification performance, particularly on the expert-curated dataset, which underscores the model's practical applicability.
The paper states that the complete clip-level metadata of FSD50K-Solo will be released, supporting reproducibility. However, the lack of a direct link to the dataset or code repository limits immediate access for other researchers. The methodology is described in sufficient detail to allow for replication, but the absence of a public project URL is a drawback.
One limitation acknowledged is the potential domain gap between generated data and real-world audio data, which could affect generalization. Additionally, while the framework shows promise, the exploration of its performance on unseen event classes is still required. The reliance on a human-curated dataset for validation may introduce biases inherent in the curation process.
The release of FSD50K-Solo and the proposed curation framework has the potential to significantly advance audio machine learning research by providing a high-quality dataset that can enhance model training and evaluation. The methodology can be applied to other audio corpora, promoting better practices in dataset curation across the field. The implications of improved audio datasets extend to various applications, including sound event detection, audio synthesis, and machine learning in general. This paper introduces a systematic framework for automated curation of single-source sound events, addressing critical data quality challenges in audio machine learning. The innovative use of generative models for dataset enhancement and the strong experimental results position this work as a significant contribution to the field.
Audio provides critical situational cues, yet current Audio Language Models (ALMs) face an attention bottleneck in long-form recordings where dominant background patterns can dilute rare, salient events. We introduce NAACA, a training-free NeuroAuditory Attentive Cognitive Architecture that reframes attention allocation as an auditory salience filtering problem. At its core is OWM, a neuro-inspired Oscillatory Working Memory that maintains stable attractor-like states and triggers higher-cognition ALM processing only when adaptive energy fluctuations signal perceptual salience, triggering higher-level reasoning. On XD-Violence, NAACA improves AudioQwen's average precision (AP) from 53.50% to 70.60% while reducing unnecessary ALM invocations. Furthermore, qualitative case studies on the Urban Soundscapes of the World (USoW) dataset show that OWM captures novel events and subcategory shifts while remaining robust to transient pauses and ambient urban noise.
Primary: Ghent University
All Institutions: Ghent University, Vrije Universiteit Brussel, Queen Mary University of London
The main contribution of this paper is the introduction of NAACA, a training-free neuro-inspired architecture that employs oscillatory dynamics for salience-driven attention gating in audio processing. This innovative approach addresses critical limitations in existing audio language models, offering a promising direction for future research and applications in audio understanding.
The methodology presented in NAACA is innovative, leveraging a neuro-inspired Oscillatory Working Memory (OWM) to address the attention bottleneck in Audio Language Models (ALMs). The approach of framing salience detection as an auditory filtering problem is well-grounded in cognitive neuroscience, and the training-free aspect of the architecture is particularly noteworthy. The use of oscillatory dynamics to maintain stable memory states while adapting to salient changes in audio streams is a significant advancement over traditional methods that rely on extensive historical data or training. The detailed formulation of OWM and its integration into the NAACA framework is technically sound, although the complexity of the model may pose challenges for practical implementation.
The experiments conducted on the XD-Violence and Urban Soundscapes of the World (USoW) datasets provide robust evidence of NAACA's effectiveness. The reported improvement in average precision (AP) demonstrates a clear performance gain over existing models, and the qualitative case studies further illustrate the model's ability to detect salient events in complex audio environments. However, the paper could benefit from a more comprehensive comparison with a wider range of baseline models to fully contextualize its performance.
The paper provides a thorough description of the methods and implementation details, which enhances reproducibility. However, the lack of publicly available code or datasets limits the ability of other researchers to replicate the findings. Including a demo or project URL would greatly enhance the paper's impact and usability within the community.
The primary limitation noted is the dependency on the performance of the chosen audio encoder, which may restrict the model's applicability to out-of-distribution sound events. Additionally, the hard-gating mechanism may overlook contextual information that could be preserved with more flexible attention mechanisms. The evaluation metrics focus mainly on anomaly detection, suggesting that future work should explore broader audio understanding tasks.
The implications of this research are significant, particularly in fields such as public safety surveillance, environmental monitoring, and any domain where audio analysis is critical. By improving the efficiency and effectiveness of audio processing in real-time applications, NAACA has the potential to enhance situational awareness and response capabilities in various contexts. The main contribution of this paper is the introduction of NAACA, a training-free neuro-inspired architecture that employs oscillatory dynamics for salience-driven attention gating in audio processing. This innovative approach addresses critical limitations in existing audio language models, offering a promising direction for future research and applications in audio understanding.
Symbolic-control drum generation requires preserving explicit event timing and dynamics while synthesizing acoustically plausible waveforms. We present Sec2Drum-DAC, a conditional latent-diffusion model for symbolic-to-audio drum rendering. The model conditions on event features sampled in physical time at codec-frame locations and predicts standardized principal-component coordinates of frozen DAC summed-codebook embeddings rather than waveform samples. In the evaluated DAC configuration, 72 principal components capture the observed training-frame summed-latent subspace under the stated SVD threshold, yielding a compact continuous denoising target with a deterministic reconstruction path to the 1024-dimensional DAC latent space before waveform decoding. Across 1,733 held-out four-beat windows, PCA diffusion improves paired spectral and transient metrics over deterministic PCA regression and a symbolic rendering baseline, while direct regression remains stronger on phase-sensitive waveform L1. Auxiliary RVQ cross-entropy improves short-step diffusion on mel error, onset-flux cosine, and waveform L1, with the most favorable trade-offs occurring at 6-25 denoising steps depending on the metric.
Primary: Hellenic Mediterranean University
All Institutions: Hellenic Mediterranean University, Athena RC
This paper contributes a significant advancement in symbolic-to-audio drum rendering through a novel latent-diffusion model that preserves event timing and dynamics while synthesizing realistic audio. The comprehensive methodology and robust experimental evaluation position it as a meaningful contribution to the field of machine learning in audio applications.
The paper presents a novel approach to symbolic-to-audio drum rendering using a conditional latent-diffusion model, which aligns symbolic conditioning to physical time and utilizes PCA for dimensionality reduction in the latent space. The methodology is well-structured, incorporating auxiliary RVQ cross-entropy for improved performance and demonstrating a clear pipeline from symbolic input to audio output. The use of PCA coordinates as a denoising target rather than direct waveform samples is innovative and addresses the challenges of maintaining control over the generated audio while ensuring acoustic fidelity.
The experimental setup is robust, utilizing a substantial dataset of 11,523 training examples and a variety of evaluation metrics that capture different aspects of audio quality, including spectral fidelity and transient accuracy. The results indicate significant improvements over baseline methods, particularly in spectral and transient metrics, although direct regression outperforms on phase-sensitive waveform metrics. The comprehensive evaluation across multiple configurations and the use of statistical testing to validate findings enhance the credibility of the results.
The paper outlines the training and evaluation processes in detail, including hyperparameters and data preprocessing steps, which supports reproducibility. However, the lack of a public repository at the time of review limits immediate reproducibility. The authors mention plans to release the code, which would further aid in this aspect.
The study is narrow in scope, focusing on short four-beat segments rather than full musical compositions, which may limit the generalizability of the findings. Additionally, the reliance on automatic evaluation metrics without a human listening study raises questions about perceived audio quality. The fixed PCA representation may not be optimal for all contexts, and the evaluation does not account for sampling variability.
The proposed method has significant implications for music technology, particularly in enhancing the controllability and fidelity of drum synthesis in various applications, including music production and interactive audio systems. The approach could inspire further research into symbolic-to-audio translation methods and their integration into broader music generation frameworks. This paper contributes a significant advancement in symbolic-to-audio drum rendering through a novel latent-diffusion model that preserves event timing and dynamics while synthesizing realistic audio. The comprehensive methodology and robust experimental evaluation position it as a meaningful contribution to the field of machine learning in audio applications.
Fine-tuning multilingual ASR models like Whisper for low-resource languages often improves read speech but degrades spontaneous audio performance, a phenomenon we term studio-bias. To diagnose this mismatch, we introduce Vividh-ASR, a complexity-stratified benchmark for Hindi and Malayalam across four tiers: studio, broadcast, spontaneous, and synthetic noise. Through a controlled study of learning-rate timing and curriculum ordering, we find that early large parameter updates improve global WER by 12 absolute points, while a hard-to-easy curriculum adds gains for spontaneous speech. These findings motivate reverse multi-stage fine-tuning (R-MFT), a training recipe that enables a parameter-efficient 244M Whisper model to match or exceed conventionally fine-tuned 769M counterparts. Representational analysis via CKA and SVD reveals effective schedules concentrate adaptation in the decoder, preserving the pre-trained encoder's acoustic geometry. We release the benchmark and models.
Primary: Adalat AI, India
All Institutions: Adalat AI, India
The paper presents Vividh-ASR, a complexity-tiered benchmark and a novel training strategy (R-MFT) that significantly enhances the performance of ASR systems for low-resource Indic languages. This work is a valuable contribution to the field, addressing critical challenges in adapting multilingual ASR models while preserving their foundational acoustic capabilities.
The paper introduces a systematic factorial design to dissect the effects of learning rate timing and curriculum ordering on ASR performance. The proposed Reverse Multi-Stage Fine-Tuning (R-MFT) method is well-structured, allowing for a clear understanding of how different training strategies impact model adaptation. The complexity-tiered benchmark, Vividh-ASR, is a significant methodological contribution, providing a structured way to evaluate ASR models across varying levels of acoustic complexity.
The experiments are rigorous, employing a controlled factorial design that isolates key variables affecting performance. The results are clearly presented, demonstrating substantial improvements in WER through the proposed methods. The analysis of internal model representations using CKA and SVD adds depth to the evaluation, linking empirical results to theoretical insights about model adaptation.
The paper provides sufficient details on the implementation, including model architectures, training stages, and hyperparameters, which facilitates reproducibility. However, the lack of a publicly available demo or project URL limits the ease of access to the exact experimental setup.
The study primarily focuses on Hindi and Malayalam, which may limit the generalizability of the findings to other Indic languages or low-resource languages in general. Additionally, while the paper discusses the preservation of the encoder's acoustic geometry, it does not fully explore the implications of this for other model architectures or training paradigms.
The findings have significant implications for improving ASR systems in low-resource languages, potentially enhancing accessibility and usability in diverse linguistic contexts. The introduction of a complexity-tiered benchmark could inspire further research and development in ASR, particularly for languages that have been historically underrepresented in machine learning research. The paper presents Vividh-ASR, a complexity-tiered benchmark and a novel training strategy (R-MFT) that significantly enhances the performance of ASR systems for low-resource Indic languages. This work is a valuable contribution to the field, addressing critical challenges in adapting multilingual ASR models while preserving their foundational acoustic capabilities.
Developing text-driven symbolic music generation models remains challenging due to the scarcity of aligned text-music datasets and the unreliability of automated captioning pipelines. While most efforts have focused on MIDI, sheet music representations are largely underexplored in text-driven generation. We present Text2Score, a two-stage framework comprising a planning stage and an execution stage for generating sheet music from natural language prompts. By deriving supervision signals directly from symbolic XML data, we propose an alternative training paradigm that bypasses noisy or scarce text-music pairs. In the planning stage, an LLM orchestrator translates a natural language prompt into a structured measure-wise plan defining musical attributes such as instruments, key, time signatures, harmony, etc. This plan is then consumed by a generative model in the execution stage to produce interleaved ABC notation conditioned on the plan's structural constraints. To assess output quality, we introduce an evaluation framework covering playability, readability, instrument utilization, structural complexity, and prompt adherence, validated by expert musicians. Text2Score consistently outperforms both a pure LLM-based agentic framework and three end-to-end baselines across objective and subjective dimensions. We open-source the dataset, code, evaluation set and LLM prompts used in this work; a demo is available on our project page (https://keshavbhandari.github.io/portfolio/text2score).
Primary: unknown
All Institutions: unknown
Text2Score presents a novel two-stage framework for generating sheet music from natural language prompts, significantly advancing the state of symbolic music generation. The methodology effectively separates planning and execution, yielding high-quality outputs that outperform existing models, while the comprehensive evaluation framework sets a new standard for assessing music generation systems.
The methodology presented in Text2Score is innovative, utilizing a two-stage framework that separates the planning and execution phases of music generation. This approach allows for more structured reasoning about musical attributes, which is a significant advancement over traditional end-to-end models. The use of an LLM orchestrator to create a structured measure-wise plan is particularly noteworthy, as it addresses issues related to the lack of aligned text-music datasets. The integration of a generative model that processes this plan through a hierarchical decoder further enhances the robustness of the generation process. The detailed definition of the structural plan and the metrics for evaluation are well-articulated, providing a clear framework for assessing the generated outputs.
The experimental evaluation is thorough, employing both objective metrics and subjective assessments from expert musicians. The paper provides a comprehensive suite of evaluation metrics that cover playability, readability, and prompt adherence, which are crucial for assessing the quality of generated sheet music. The results demonstrate that Text2Score outperforms several baseline models, indicating the effectiveness of the proposed framework. However, the paper could benefit from a more detailed discussion of the dataset's diversity and the specific prompts used in evaluations to better contextualize the results.
The paper includes sufficient details regarding the implementation, including the architecture of the models and the training procedures. The use of ModernBERT and a hierarchical decoder is clearly described, and the authors have made their dataset and code available, which supports reproducibility. However, the lack of specific details about the dataset curation process and the exact nature of the prompts used in evaluations could hinder full reproducibility.
One limitation noted is the potential for the LLM-generated inference plan to diverge from training plans, which could lead to discrepancies in output quality. Additionally, while the evaluation metrics are comprehensive, they may not capture all aspects of musical quality, particularly in terms of expressive nuances that could be important for professional compositions. The paper also acknowledges the need for richer annotations to capture finer musical details, which could enhance the model's performance.
The implications of this work are significant for the fields of music generation and artificial intelligence. By providing a framework that can generate high-quality sheet music from textual prompts, Text2Score opens new avenues for composers and musicians, potentially streamlining the creative process. The open-sourcing of the dataset and code encourages further research and development in this area, promoting collaboration and innovation. The integration of LLMs in music generation also highlights the potential for AI to assist in creative fields, which could lead to broader applications in music education and composition. Text2Score presents a novel two-stage framework for generating sheet music from natural language prompts, significantly advancing the state of symbolic music generation. The methodology effectively separates planning and execution, yielding high-quality outputs that outperform existing models, while the comprehensive evaluation framework sets a new standard for assessing music generation systems.
Automatic detection of speaker confidence is critical for adaptive computing but remains constrained by limited labelled data and the subjectivity of paralinguistic annotations. This paper proposes a semi-supervised hybrid framework that fuses deep semantic embeddings from the Whisper encoder with an interpretable acoustic feature vector composed of eGeMAPS descriptors and auxiliary probability estimates of vocal stress and disfluency. To mitigate reliance on scarce ground truth data, we introduce an Uncertainty-Aware Pseudo-Labelling strategy where a model generates labels for unlabelled data, retaining only high-quality samples for training. Experimental results demonstrate that the proposed approach achieves a Macro-F1 score of 0.751, outperforming self-supervised baselines, including WavLM, HuBERT, and Wav2Vec 2.0. The hybrid architecture also surpasses the unimodal Whisper baseline, yielding a 3\% improvement in the minority class, confirming that explicit prosodic and auxiliary features provide necessary corrective signals which are otherwise lost in deep semantic representations. Ablation studies further show that a curated set of high confidence pseudo-labels outperforms indiscriminate large scale augmentation, confirming that data quality outweighs quantity for perceived confidence detection.
Primary: Durham University
All Institutions: Durham University, IEEE Publication Technology Group
The paper presents a novel semi-supervised framework for speech confidence detection that integrates deep learning with interpretable acoustic features, significantly advancing the field of affective computing. The methodology is innovative, addressing key challenges in data scarcity and subjective annotation, while the experimental results demonstrate strong performance and robustness.
The paper introduces a semi-supervised hybrid framework for speech confidence detection that effectively combines deep semantic embeddings from the Whisper model with interpretable acoustic features. The methodology is robust, employing an Uncertainty-Aware Pseudo-Labelling strategy that prioritizes high-quality pseudo-labels over indiscriminate augmentation, which is a significant advancement in dealing with limited labelled data. The integration of eGeMAPS descriptors and auxiliary features for vocal stress and disfluency detection enhances the model's ability to capture nuanced acoustic signals associated with speaker confidence. The late fusion strategy used to combine different modalities is well-justified and effectively addresses the limitations of relying solely on deep semantic representations.
The experimental evaluation is thorough, employing a well-structured 5-fold cross-validation approach to ensure the reliability of results. The paper reports a Macro-F1 score of 0.751, which is a notable improvement over various self-supervised baselines, demonstrating the effectiveness of the proposed method. The ablation studies provide insights into the importance of data quality and the contributions of different components of the model, reinforcing the claims made about the hybrid architecture's advantages. However, the paper could benefit from additional qualitative evaluations or user studies to further validate the model's performance in real-world scenarios.
The paper provides a detailed description of the methodology, including dataset creation, feature extraction, and model architecture, which aids in reproducibility. However, the absence of publicly available datasets and code repositories limits the ability of other researchers to replicate the study fully. The authors should consider releasing their code and datasets to enhance transparency and facilitate further research.
The primary limitation of the study is the reliance on a relatively small ground truth dataset (N=600), which may affect the generalizability of the findings. Additionally, the model's performance may be constrained by the subjective nature of confidence detection and the cultural variability in expressing confidence. The focus on short audio clips also limits the model's ability to capture context and fluctuations in confidence over longer interactions.
The proposed framework has significant implications for applications in affective computing, particularly in enhancing human-computer interaction and adaptive learning environments. By enabling machines to detect speaker confidence, the framework could improve the responsiveness of virtual assistants and educational tools, fostering more engaging and supportive user experiences. Furthermore, the insights gained from this research could contribute to mental health monitoring and interventions by identifying low confidence as a precursor to anxiety. The paper presents a novel semi-supervised framework for speech confidence detection that integrates deep learning with interpretable acoustic features, significantly advancing the field of affective computing. The methodology is innovative, addressing key challenges in data scarcity and subjective annotation, while the experimental results demonstrate strong performance and robustness.
Despite advances in text and visual generation, creating coherent long-form audio narratives remains challenging. Existing frameworks often exhibit limitations such as mismatched character settings with voice performance, insufficient self-correction mechanisms, and limited human interactivity. To address these challenges, we propose AuDirector, a self-reflective closed-loop multi-agent framework. Specifically, it involves an Identity-Aware Pre-production mechanism that transforms narrative texts into character profiles and utterance-level emotional instructions to retrieve suitable voice candidates and guide expressive speech synthesis, thereby promoting context-aligned voice adaptation. To enhance quality, a Collaborative Synthesis and Correction module introduces a closed-loop self-correction mechanism to systematically audit and regenerate defective audio components. Furthermore, a Human-Guided Interactive Refinement module facilitates user control by interpreting natural language feedback to interactively refine the underlying scripts. Experiments demonstrate that AuDirector achieves superior performance compared to state-of-the-art baselines in structural coherence, emotional expressiveness, and acoustic fidelity. Audio samples can be found at https://anonymous-itsh.github.io/.
Primary: Shanghai Artificial Intelligence Laboratory
All Institutions: Shanghai Artificial Intelligence Laboratory, Tsinghua University
AuDirector presents a significant advancement in immersive audio storytelling through a self-reflective closed-loop multi-agent framework. Its innovative approach to integrating character profiling, emotional instruction, and user interaction sets a new standard in the field of audio generation, addressing key limitations of existing systems while demonstrating substantial technical impact and potential for broad applications.
The methodology of AuDirector is innovative, integrating a multi-agent framework that combines identity-aware pre-production, collaborative synthesis and correction, and human-guided interactive refinement. Each component is well-defined, with a clear flow from narrative input to audio output. The use of a closed-loop self-correction mechanism is particularly noteworthy, as it addresses common pitfalls in generative audio systems, such as quality inconsistency and lack of user control. The framework's reliance on large language models (LLMs) for character profiling and emotional instruction generation is a strong point, allowing for nuanced audio generation that aligns closely with narrative context.
The experiments conducted are robust, comparing AuDirector against state-of-the-art baselines like WavJourney and PodAgent across both objective and subjective metrics. The use of diverse datasets, including podcasts and radio dramas, enhances the evaluation's credibility. The results indicate significant improvements in structural coherence, emotional expressiveness, and acoustic fidelity, validating the proposed framework's effectiveness. The ablation study further strengthens the findings by demonstrating the importance of the self-correction mechanism.
The paper provides a comprehensive overview of the implementation details, including the specific agents used and the evaluation protocols. However, the lack of a publicly accessible code repository limits reproducibility. Future work could benefit from releasing the code and detailed instructions to facilitate further research and validation by the community.
While AuDirector shows promise, it still faces challenges in generating non-speech audio tracks, particularly regarding acoustic diversity and nuance. This limitation could impact the overall immersion of the audio narratives. Additionally, the complexity of the system may pose challenges for users unfamiliar with audio production, potentially limiting its accessibility.
The potential applications of AuDirector are vast, ranging from enhancing audio storytelling in entertainment to educational tools that require immersive audio experiences. The framework could significantly impact industries such as gaming, film, and interactive media, where high-quality audio narratives are essential. Furthermore, the integration of user feedback into the generation process could lead to more personalized and engaging content. AuDirector presents a significant advancement in immersive audio storytelling through a self-reflective closed-loop multi-agent framework. Its innovative approach to integrating character profiling, emotional instruction, and user interaction sets a new standard in the field of audio generation, addressing key limitations of existing systems while demonstrating substantial technical impact and potential for broad applications.
Omni-modal language models are intended to jointly understand audio, visual inputs, and language, but benchmark gains can be inflated when visual evidence alone is enough to answer a query. We study whether current omni-modal benchmarks separate visual shortcuts from genuine audio-visual-language evidence integration, and how post-training behaves under a visually debiased evaluation setting. We audit nine omni-modal benchmarks with visual-only probing, remove visually solvable queries, and retain full subsets when filtering is undefined or would make comparisons unstable. This yields OmniClean, a cleaned evaluation view with 8,551 retained queries from 16,968 audited queries. On OmniClean, we evaluate OmniBoost, a three-stage post-training recipe based on Qwen2.5-Omni-3B: mixed bi-modal SFT, mixed-modality RLVR, and SFT on self-distilled data. Balanced bi-modal SFT gives limited and uneven gains, RLVR provides the first broad improvement, and self-distillation reshapes the benchmark profile. After SFT on self-distilled data, the 3B model reaches performance comparable to, and in aggregate slightly above, Qwen3-Omni-30B-A3B-Instruct without using a stronger omni-modal teacher. These results show that omni-modal progress is easier to interpret when evaluation controls visual leakage, and that small omni-modal models can benefit from staged post-training with self-distilled omni-query supervision.
Primary: StepFun
All Institutions: StepFun, Imperial College London, Peking University, Shanghai Jiao Tong University, The University of New South Wales
The main contribution of this paper is the introduction of OmniClean, a visually debiased evaluation framework for omni-modal language models, and the demonstration of its effectiveness through a comprehensive staged post-training approach. This work significantly enhances the interpretability of model performance and sets a new standard for evaluating multi-modal capabilities in machine learning.
The paper introduces a novel evaluation framework, OmniClean, which effectively filters out visually solvable queries to assess the true omni-modal capabilities of language models. This methodology is significant as it addresses the prevalent issue of visual shortcuts in existing benchmarks, providing a more accurate measure of model performance across audio-visual-language tasks. The staged post-training approach, OmniBoost, combines mixed bi-modal supervised fine-tuning, mixed-modality reinforcement learning, and self-distillation, showcasing a comprehensive strategy for enhancing model performance.
The experiments are well-structured, utilizing a large dataset of 16,968 queries, from which 8,551 were retained after visual-only probing. The results demonstrate that the proposed methods yield meaningful improvements in model performance, particularly in the context of the cleaned evaluation view. The comparison against existing models, including larger counterparts, highlights the effectiveness of the proposed training stages.
The paper provides detailed descriptions of the experimental setup, including data sources, training protocols, and evaluation metrics. However, the lack of access to the actual model weights and code may hinder full reproducibility. The authors do release the OmniClean dataset, which aids in replicating the evaluation process.
One limitation is the reliance on the specific model architecture (Qwen2.5-Omni-3B) for the experiments, which may not generalize to other architectures or larger models. Additionally, while the visually debiased evaluation is a significant improvement, it does not eliminate all forms of bias in the benchmarks. The paper also acknowledges that the self-distillation results are profile-dependent, indicating variability in effectiveness across different datasets.
The findings have substantial implications for the development of omni-modal language models, as they provide a clearer understanding of model capabilities and limitations. By addressing visual leakage in evaluations, the work encourages the design of more robust benchmarks that can better assess true multi-modal integration. This could lead to advancements in applications requiring comprehensive understanding across audio, visual, and textual modalities. The main contribution of this paper is the introduction of OmniClean, a visually debiased evaluation framework for omni-modal language models, and the demonstration of its effectiveness through a comprehensive staged post-training approach. This work significantly enhances the interpretability of model performance and sets a new standard for evaluating multi-modal capabilities in machine learning.
We propose the Chunkwise Aligner, a novel architecture for streaming automatic speech recognition (ASR). While the Transducer is the standard model for streaming ASR, its training is costly due to the need to compute all possible audio-label alignments. The recently introduced Aligner reduces this cost by discarding explicit alignments, but this modification makes it unsuitable for streaming. Our approach overcomes this limitation by dividing the audio into chunks and aligning each label to the leftmost frames of its chunk, whereas transitions between chunks are managed by a learned end-of-chunk probability. Experiments show that the Chunkwise Aligner not only matches the Transducer's accuracy in both offline and streaming scenarios, but also offers superior training and decoding efficiencies.
Primary: University of Electro-Communications
All Institutions: University of Electro-Communications, NTT, Inc.
The main contribution of this paper is the introduction of the Chunkwise Aligner, which enhances streaming ASR by effectively managing audio segmentation and alignment, resulting in improved efficiency and comparable accuracy to existing models. This work represents a meaningful advancement in the field of speech recognition, particularly for applications requiring real-time processing.
The proposed Chunkwise Aligner introduces a novel architecture that effectively addresses the limitations of existing models in streaming automatic speech recognition (ASR). By segmenting audio into chunks and utilizing a learned end-of-chunk probability for transitions, the methodology enhances both training efficiency and decoding speed while maintaining accuracy comparable to the Transducer model. The architecture's reliance on self-transduction within chunks is innovative and provides a clear advantage in streaming scenarios, showcasing a thoughtful adaptation of existing techniques.
The experiments conducted on well-known datasets, LibriSpeech and CSJ, provide a robust evaluation of the Chunkwise Aligner's performance. The reported results demonstrate that the model achieves competitive word error rates (WER) and character error rates (CER) while significantly improving decoding speed over the Transducer. The inclusion of various configurations and the analysis of alignment strategies add depth to the evaluation, supporting the claims made regarding the model's efficiency and effectiveness.
The paper outlines the system configuration, including model architecture, training parameters, and data preprocessing steps, which contributes to reproducibility. However, the lack of publicly available code or a project URL limits the ability for independent verification of results. Providing access to the implementation would enhance the reproducibility of the findings.
One notable limitation is the dependency on forced alignments for training, which may affect performance when the timing of ground truth labels does not align with the model's predictions. Additionally, the paper acknowledges potential degradation in performance with varying alignment strategies, indicating that further exploration is needed to improve robustness in diverse scenarios.
The Chunkwise Aligner has significant implications for real-time speech recognition applications, particularly in environments where low latency is critical. Its ability to maintain accuracy while reducing computational costs makes it suitable for deployment in various devices and applications, potentially enhancing user experiences in voice-activated systems and automated transcription services. The main contribution of this paper is the introduction of the Chunkwise Aligner, which enhances streaming ASR by effectively managing audio segmentation and alignment, resulting in improved efficiency and comparable accuracy to existing models. This work represents a meaningful advancement in the field of speech recognition, particularly for applications requiring real-time processing.
Singing Voice Conversion (SVC) aims to transform a source singing voice into a target singer while preserving lyrics and melody. Most existing SVC methods depend on F0 extractors to capture the lead melody from clean vocals. However, no existing method can reliably extract clean vocals from accompanied recordings without leaving residual harmonies behind. In this paper, we innovatively propose Poly-SVC, a zero-shot, cross-lingual singing voice conversion system designed to process residual harmonies. Poly-SVC is composed of three key components: a Constant-Q Transform (CQT)-based pitch extractor to preserve both the lead melody and residual harmony, a random sampler to reduce interference information from the CQT and a diffusion decoder based on Conditional Flow Matching (CFM) that fuses pitch, content, and timbre features into natural-sounding polyphonic outputs. Experiments demonstrate that Poly-SVC surpasses the baseline models in naturalness, timbre similarity and harmony reconstruction across both harmony-rich and single-melody recordings.
Primary: Beijing University of Civil Engineering and Architecture
All Institutions: Beijing University of Civil Engineering and Architecture, Lyra Lab, Tencent Music Entertainment, Beijing Key Laboratory of Super Intelligent Technology for Urban Architecture
The main contribution of this paper is the introduction of Poly-SVC, a novel singing voice conversion system that effectively handles residual harmonies in polyphonic scenarios, significantly improving the quality of voice conversion outputs. The technical contributions, particularly in pitch extraction and model architecture, represent a meaningful advancement in the field of audio processing and machine learning.
The proposed Poly-SVC framework introduces a novel approach to singing voice conversion that addresses the challenges posed by residual harmonies in accompanied recordings. The use of a Constant-Q Transform (CQT) for pitch extraction is innovative, as it allows for the preservation of both lead melodies and residual harmonies, which is crucial for high-fidelity audio synthesis. The integration of a random sampler to mitigate interference and a Conditional Flow Matching (CFM)-based diffusion decoder further enhances the model's robustness. The methodology is well-structured, with clear delineation of components and their functions, although it could benefit from more detailed explanations of the CFM loss and its implications.
The experiments are comprehensive, utilizing a diverse set of datasets that include both speech and singing data across multiple languages. The subjective evaluation framework, employing Mean Opinion Score (MOS) and Similarity-MOS (SIM-MOS), is appropriate for assessing the quality of voice conversion in both single-melody and harmony-rich scenarios. The results demonstrate that Poly-SVC significantly outperforms existing baselines, particularly in preserving harmonic structures, which is a key contribution to the field. However, the paper could enhance its experimental rigor by including objective metrics alongside subjective evaluations.
The paper provides a reasonable level of detail regarding the implementation, including the choice of models and datasets. However, it lacks specific URLs or repositories for code and data, which are essential for reproducibility. Clearer descriptions of hyperparameters and training procedures would also aid in replicating the results.
One limitation is the reliance on subjective evaluations, which can be influenced by individual preferences and biases. Additionally, the model's performance in extremely complex polyphonic scenarios remains to be fully explored. The paper also acknowledges the challenge of content overlapping, which is not fully addressed in the current framework.
The implications of this research extend to various applications in music production, entertainment, and accessibility. By improving the quality of singing voice conversion, Poly-SVC could facilitate personalized music experiences, enhance karaoke applications, and support language learning through singing. The approach may also inspire further research into polyphonic audio processing and machine learning applications in music. The main contribution of this paper is the introduction of Poly-SVC, a novel singing voice conversion system that effectively handles residual harmonies in polyphonic scenarios, significantly improving the quality of voice conversion outputs. The technical contributions, particularly in pitch extraction and model architecture, represent a meaningful advancement in the field of audio processing and machine learning.
Over the past two decades, the task of musical beat tracking has transitioned from heuristic onset detection algorithms to highly capable deep neural networks (DNN). Although DNN-based beat tracking models achieve near-perfect performance on mainstream, percussive datasets, the SMC dataset has stubbornly yielded low F-measure scores. By testing how well state-of-the-art models detect beats on individual tracks in the SMC dataset, we identify three distinct failure modes: octave errors, continuity errors, and complete tracking failure where all metrics fall below 0.3. We reveal that state-of-the-art models tend to generate "confident-but-wrong" activations. Furthermore, we show that the standard DBN's default minimum tempo of 55 BPM prevents it from inferring the correct tempo for 21\% of SMC tracks, forcing double-tempo predictions on slow music. By exposing such fundamental oversights, we provide concrete directions for improving beat and downbeat detection, specifically emphasizing training data diversification and multi-hypothesis tempo estimation.
Primary: University
All Institutions: Company, Department of Computer Science, International Laboratories, University
This paper provides a critical analysis of beat tracking failures in state-of-the-art models, identifying specific weaknesses and proposing directions for future research. The combination of detailed diagnostics and practical recommendations positions this work as a valuable contribution to the field of music information retrieval.
The paper presents a thorough diagnostic analysis of beat tracking models on the SMC dataset, identifying specific failure modes that have not been previously documented. The methodology involves a systematic evaluation of three state-of-the-art models, categorizing the difficulties of the dataset into four axes and analyzing the activation functions to pinpoint the causes of errors. This approach is innovative as it combines qualitative analysis with quantitative metrics, providing a nuanced understanding of model performance.
The experiments are well-structured, employing an 8-fold cross-validation setup to ensure robust evaluation. The use of various metrics, including F-measure and continuity metrics, allows for a comprehensive assessment of model performance. The results reveal significant insights into the limitations of current models, particularly in handling tempo instability and metrical ambiguity, which are critical for advancing the field.
While the paper outlines the experimental setup and evaluation metrics, it lacks detailed information on the implementation of the models and the specific configurations used. This could hinder reproducibility for other researchers attempting to replicate the results or build upon the findings.
One limitation of the study is the reliance on the SMC dataset, which, while challenging, may not fully represent the diversity of musical styles encountered in real-world applications. Additionally, the findings suggest a need for more diverse training data to overcome the identified activation ceiling, indicating that current models may not generalize well to other datasets.
The insights gained from this research have the potential to significantly influence future work in music information retrieval, particularly in improving beat tracking algorithms. By addressing the identified failure modes and suggesting enhancements to model training and architecture, this work could lead to more robust systems that better handle complex musical structures. This paper provides a critical analysis of beat tracking failures in state-of-the-art models, identifying specific weaknesses and proposing directions for future research. The combination of detailed diagnostics and practical recommendations positions this work as a valuable contribution to the field of music information retrieval.
We present STRUM (Spectral Transcription and Rhythm Understanding Model), an audio-to-chart pipeline that converts raw recordings into playable Clone Hero / YARG charts for drums, guitar, bass, vocals, and keys without any oracle metadata. STRUM is a multi-stage hybrid: a two-stage CRNN onset detector and a six-model ensemble classifier for drums; neural onset detectors with monophonic pitch tracking for guitar and bass; word-aligned ASR for vocals; and spectral keyboard detection for keys. We evaluate on a 30-song in-envelope benchmark constructed by screening candidate songs on a single audio-quality criterion -- the median 1-second drum-stem RMS after htdemucs_6s source separation. On this benchmark STRUM achieves drums onset F1 = 0.838, bass F1 = 0.694, guitar F1 = 0.651, and vocals F1 = 0.539 at a +/- 100 ms tolerance with per-song global offset search. We report a complete ablation of seven drum-pipeline components with paired per-song Wilcoxon tests, an analysis of ground-truth-to-audio timing distributions in community Clone Hero charts, and a per-class confusion matrix for the drum classifier. Code, model weights, and the full benchmark manifest are released.
Primary: Independent Researcher
All Institutions: Independent Researcher
STRUM presents a comprehensive audio-to-chart pipeline for rhythm games, addressing a critical bottleneck in chart creation. The paper's technical contributions, including a detailed methodology and thorough evaluation, position it as a meaningful advancement in the field of automatic music transcription.
The methodology employed in STRUM is robust and multifaceted, utilizing a combination of CRNNs, ensemble classifiers, and various detection techniques tailored to different instruments. The use of a two-stage CRNN for drums, alongside a comprehensive pipeline for other instruments, reflects a thoughtful approach to the complexities of audio transcription. The ablation studies provide valuable insights into the contributions of individual components, enhancing the credibility of the results. However, the reliance on pre-processed audio stems and the lack of a fully end-to-end model may limit the generalizability of the findings.
The experimental evaluation is thorough, featuring a well-defined benchmark of 30 songs and clear metrics for performance assessment. The F1 scores reported for different instruments indicate a solid performance, particularly for drums, which is the focus of the paper. The use of a specific audio-quality criterion for song selection is a commendable approach that adds rigor to the evaluation process. However, the limited sample size and the specific genre focus may restrict the applicability of the results across a broader range of music.
The authors have made significant efforts to ensure reproducibility by releasing code, model weights, and a benchmark manifest. This transparency is crucial for the research community, allowing others to validate and build upon the work. The detailed descriptions of the methodologies and the evaluation protocols further enhance the reproducibility of the study.
The paper acknowledges several limitations, including the rejection of songs based on audio quality, which may exclude potentially valuable data. The vocal transcription performance is notably lower than that of the instruments, indicating a misalignment between detected onsets and community charting practices. Additionally, the blue lane's low accuracy highlights challenges in accurately distinguishing between similar sounds in the context of rhythm games.
STRUM has the potential to significantly impact the rhythm game community by streamlining the chart creation process, making it more accessible for newcomers while providing experienced charters with a solid starting point. The open-source nature of the project encourages collaboration and further development, which could lead to advancements in automatic music transcription and related fields. STRUM presents a comprehensive audio-to-chart pipeline for rhythm games, addressing a critical bottleneck in chart creation. The paper's technical contributions, including a detailed methodology and thorough evaluation, position it as a meaningful advancement in the field of automatic music transcription.