Generative audio modeling has largely been fragmented into specialized tasks, text-to-speech (TTS), text-to-music (TTM), and text-to-audio (TTA), each operating under heterogeneous control paradigms. Unifying these modalities remains a fundamental challenge due to the intrinsic dissonance between structured semantic representations (speech/music) and unstructured acoustic textures (sound effects). In this paper, we introduce UniSonate, a unified flow-matching framework capable of synthesizing speech, music, and sound effects through a standardized, reference-free natural language instruction interface. To reconcile structural disparities, we propose a novel dynamic token injection mechanism that projects unstructured environmental sounds into a structured temporal latent space, enabling precise duration control within a phoneme-driven Multimodal Diffusion Transformer (MM-DiT). Coupled with a multi-stage curriculum learning strategy, this approach effectively mitigates cross-modal optimization conflicts. Extensive experiments demonstrate that UniSonate achieves state-of-the-art performance in instruction-based TTS (WER 1.47%) and TTM (SongEval Coherence 3.18), while maintaining competitive fidelity in TTA. Crucially, we observe positive transfer, where joint training on diverse audio data significantly enhances structural coherence and prosodic expressiveness compared to single-task baselines. Audio samples are available at https://qiangchunyu.github.io/UniSonate/.
Primary: Tianjin University
All Institutions: Tianjin University, Kuaishou Technology, Institute of Automation, Chinese Academy of Sciences
UniSonate presents a unified framework for audio generation that synthesizes speech, music, and sound effects through a novel natural language interface. The technical contributions, including dynamic token injection and a multi-stage curriculum learning strategy, significantly advance the field of generative audio modeling, offering a comprehensive solution to the challenges of multimodal audio synthesis.
The methodology proposed in UniSonate is innovative, introducing a unified flow-matching framework that integrates speech, music, and sound effect generation through a natural language interface. The dynamic token injection mechanism is particularly noteworthy as it allows unstructured sound effects to be processed in a structured manner, enabling precise control over audio generation. This is complemented by a multi-stage curriculum learning strategy that effectively mitigates optimization conflicts, showcasing a thoughtful approach to training across diverse audio modalities.
The experimental evaluation is robust, with extensive comparisons against state-of-the-art models in TTS, TTM, and TTA. The paper presents clear metrics for performance evaluation, including WER, SongEval scores, and subjective evaluations like MOS. The results indicate that UniSonate achieves state-of-the-art performance in TTS and TTM while maintaining competitive fidelity in TTA, demonstrating the effectiveness of the proposed methods.
The paper provides a comprehensive description of the model architecture, training procedures, and datasets used, which supports reproducibility. However, the lack of a public code repository may hinder independent verification of results. The authors do mention the use of specific hardware configurations and hyperparameters, which aids in understanding the implementation details.
The paper acknowledges limitations, particularly in the sound effect generation where performance lags behind specialized models. Additionally, challenges in generating long-form audio content and the inherent ambiguity in natural language instructions are highlighted. These limitations suggest areas for future research and improvement.
The potential applications of UniSonate are significant, as it paves the way for general-purpose audio generation systems that can synthesize complex auditory scenes. However, ethical considerations regarding the misuse of generated audio, biases in training data, and copyright issues in music generation are critical and warrant careful attention. UniSonate presents a unified framework for audio generation that synthesizes speech, music, and sound effects through a novel natural language interface. The technical contributions, including dynamic token injection and a multi-stage curriculum learning strategy, significantly advance the field of generative audio modeling, offering a comprehensive solution to the challenges of multimodal audio synthesis.
While Large Audio Language Models (LALMs) achieve strong performance on short audio, they degrade on long-form inputs. This degradation is more severe in temporal awareness tasks, where temporal alignment becomes increasingly inaccurate as audio duration grows. We attribute these limitations to the lack of data, benchmarks, and modeling approaches tailored for long-form temporal awareness. To bridge this gap, we first construct LAT-Chronicle, a 1.2k hour long-form audio dataset with temporal annotations across real-world scenarios. We further develop LAT-Bench, the first human-verified benchmark supporting audio up to 30 minutes while covering three core tasks: Dense Audio Caption, Temporal Audio Grounding, and Targeted Audio Caption. Leveraging these resources, we propose LAT-Audio, formulating temporal awareness as a progressive global-to-local reasoning paradigm. A global timeline is first constructed as an aligned temporal-semantic context,and the Think-With-Audio Chain-of-Thought (TWA-CoT) is then introduced to perform iterative reasoning by incorporating local audio information via tool use. Experiments show that LAT-Audio surpasses existing models on long-form audio temporal awareness tasks and improves robustness to input duration. We release the dataset, benchmark, and model to facilitate future research at https://github.com/alanshaoTT/LAT-Audio-Repo.
Primary: Northwestern Polytechnical University
All Institutions: Northwestern Polytechnical University, Independent Researcher
The main contribution of this paper is the introduction of a novel framework and dataset for improving temporal awareness in long-form audio understanding, which significantly advances the state of the art in audio language models. The comprehensive methodology, robust experimental validation, and potential applications underscore its significance in the field of machine learning and audio processing.
The paper presents a comprehensive methodology that addresses the limitations of existing Large Audio Language Models (LALMs) in handling long-form audio. The authors construct a new dataset (LAT-Chronicle) and benchmark (LAT-Bench) specifically designed for Long-form Audio Temporal Awareness (LATA) tasks, which include Dense Audio Captioning, Temporal Audio Grounding, and Targeted Audio Captioning. The proposed LAT-Audio framework introduces a novel global-to-local reasoning paradigm and the Think-With-Audio Chain-of-Thought (TWA-CoT) approach, which iteratively refines audio understanding by leveraging local audio segments based on a constructed global timeline. This innovative approach is well-justified and effectively addresses the challenges posed by long-form audio inputs.
The experimental evaluation is robust, demonstrating the effectiveness of LAT-Audio against existing models across multiple tasks. The authors provide thorough comparisons with baseline models and conduct ablation studies to validate the importance of key components such as the global timeline and TWA-CoT. The results show significant improvements in performance metrics, indicating that the proposed methods enhance temporal awareness and robustness in long-form audio understanding. The inclusion of a diverse dataset and human-verified benchmarks adds credibility to the findings.
The paper includes detailed implementation details and a clear description of the training strategy, which enhances the reproducibility of the results. The authors provide access to the dataset, benchmark, and model through a GitHub repository, facilitating further research and validation of their findings by the community.
While the proposed framework shows promise, there are limitations, such as the computational overhead introduced by multi-turn reasoning and tool use, which may hinder real-time applications. Additionally, the focus on single-audio inputs limits the framework's applicability in more complex multimodal scenarios. Future work is needed to enhance efficiency and extend the framework to broader contexts.
The research has significant implications for various applications, including automated transcription, audio search engines, and multimedia content analysis. By improving long-form audio understanding, the work can enhance user experiences in domains such as education, entertainment, and accessibility for the hearing impaired. The open-source nature of the project encourages further innovation and exploration in the field of audio language processing. The main contribution of this paper is the introduction of a novel framework and dataset for improving temporal awareness in long-form audio understanding, which significantly advances the state of the art in audio language models. The comprehensive methodology, robust experimental validation, and potential applications underscore its significance in the field of machine learning and audio processing.
Mispronunciation Detection and Diagnosis (MDD) requires modeling fine-grained acoustic deviations. However, current ASR-derived MDD systems often face inherent limitations. In particular, CTC-based models favor sequence-level alignments that neglect transient mispronunciation cues, while explicit canonical priors bias predictions toward intended targets. To address these bottlenecks, we propose a prompt-free framework decoupling acoustic fidelity from canonical guidance. First, we introduce CROTTC, an acoustic model enforcing monotonic, frame-level alignment to accurately capture pronunciation deviations. Second, we implicitly inject mispronunciation information via the IF strategy under the knowledge transfer principle. Experiments show CROTTC-IF achieves a 71.77% F1-score on L2-ARCTIC and 71.70% F1-score on the Iqra'Eval2 leaderboard. With empirical analysis, we demonstrate that decoupling acoustics from explicit priors provides highly robust MDD.
Primary: The University of Tokyo
All Institutions: The University of Tokyo
The main contribution of this paper is the introduction of a prompt-free paradigm for mispronunciation detection that effectively separates acoustic fidelity from canonical bias, leading to improved diagnostic accuracy. This work significantly advances the field of MDD by addressing critical methodological challenges and demonstrating state-of-the-art performance across diverse benchmarks, thus paving the way for future research and applications in language learning and speech recognition.
The paper introduces a novel framework, CROTTC-IF, which effectively decouples acoustic fidelity from canonical guidance in Mispronunciation Detection and Diagnosis (MDD). The methodology is well-structured, incorporating a frame-wise acoustic model (CROTTC) that utilizes Optimal Temporal Transport Classification (OTTC) to capture fine-grained mispronunciation cues. Additionally, the Indirect Fusion (IF) strategy allows for implicit knowledge transfer, enhancing the model's performance without relying on explicit canonical prompts. The integration of Consistency Regularization further stabilizes predictions, showcasing a comprehensive approach to addressing the limitations of existing MDD systems.
The experimental evaluation is robust, with the authors conducting extensive tests on multiple datasets, including L2-ARCTIC and Iqra'Eval2. The reported F1-scores of 71.77% and 71.70% demonstrate competitive performance compared to state-of-the-art methods. The paper includes ablation studies that effectively highlight the contributions of different components of the proposed framework, providing a clear understanding of the impact of each method on overall performance.
The paper provides detailed implementation details, including architecture specifications, training protocols, and hyperparameter settings. However, the lack of a publicly accessible code repository limits the reproducibility of the results, as external researchers cannot easily verify or build upon the findings.
While the proposed framework shows promise, the paper does not address potential limitations regarding the generalizability of the model to spontaneous speech or other languages beyond the tested datasets. Additionally, the reliance on specific datasets may introduce biases that could affect the model's applicability in diverse real-world scenarios.
The advancements in MDD presented in this paper have significant implications for various applications, particularly in language learning and automated speech recognition. By improving the accuracy of mispronunciation detection, the framework can enhance educational tools for language learners and contribute to more effective speech therapy solutions. The main contribution of this paper is the introduction of a prompt-free paradigm for mispronunciation detection that effectively separates acoustic fidelity from canonical bias, leading to improved diagnostic accuracy. This work significantly advances the field of MDD by addressing critical methodological challenges and demonstrating state-of-the-art performance across diverse benchmarks, thus paving the way for future research and applications in language learning and speech recognition.
Conversational multimodal understanding aims to infer the meaning or label of the current utterance from its preceding dialogue context together with textual, acoustic, and visual signals. Existing methods mainly strengthen contextual modeling through enhanced encoding, fusion, or propagation, but rarely abstract the context-utterance dependency into an explicit cue and incorporate it into later multimodal reasoning. To address this issue, we propose CUCI-Net for conversational multimodal understanding. CUCI-Net fully preserves the structural distinction between context and utterance during encoding, effectively abstracts their dependency into an interpretation cue by combining local modality evidence with global contextual evidence, and seamlessly integrates the resulting cue into the final multimodal interaction stage for context-conditioned prediction. Extensive experiments on mainstream benchmark datasets fully demonstrate the effectiveness of the proposed method.
Primary: Zhejiang University
All Institutions: Nanjing University, Zhejiang University
The main contribution of this paper is the introduction of CUCI-Net, a novel framework for conversational multimodal understanding that effectively preserves the context-utterance structure and utilizes an interpretation cue to guide multimodal reasoning, leading to improved performance in sarcasm detection tasks. This work significantly advances the state of the art in multimodal dialogue understanding by addressing key limitations in existing methodologies.
The proposed CUCI-Net introduces a three-stage framework that emphasizes the preservation of context-utterance structure, the abstraction of context-utterance dependencies into an interpretation cue, and the integration of this cue into multimodal reasoning. This methodology is innovative as it directly addresses the limitations of existing models that often overlook the explicit context-utterance relationship in multimodal dialogue understanding. The use of dual-expert encoders and the structured approach to cue-guided interaction represent a significant advancement in the field.
The experiments conducted on the MUStARD and MUStARD++ datasets demonstrate the effectiveness of CUCI-Net, achieving superior performance compared to various strong baselines. The results are rigorously reported, with metrics such as Precision, Recall, and F1-score, and the ablation studies provide clear insights into the contributions of each component of the model. This thorough evaluation strengthens the claims made regarding the model's effectiveness.
The paper provides detailed implementation details, including architecture specifications, optimization settings, and feature extraction methods. However, the absence of a public code repository or demo URL limits the reproducibility of the results, as others cannot easily replicate the experiments or validate the findings independently.
One notable limitation is the reliance on specific datasets (MUStARD and MUStARD++) that may not fully represent the diversity of conversational contexts in real-world applications. Additionally, while the model excels in sarcasm detection, its performance on other forms of non-literal expressions or more complex conversational dynamics remains to be thoroughly evaluated.
The advancements presented in CUCI-Net have potential applications in various domains, including conversational AI, sentiment analysis, and multimodal interaction systems. By improving context-dependent understanding in dialogue systems, this research can enhance user experiences in virtual assistants, customer service bots, and social robots, contributing to more natural and effective human-computer interactions. The main contribution of this paper is the introduction of CUCI-Net, a novel framework for conversational multimodal understanding that effectively preserves the context-utterance structure and utilizes an interpretation cue to guide multimodal reasoning, leading to improved performance in sarcasm detection tasks. This work significantly advances the state of the art in multimodal dialogue understanding by addressing key limitations in existing methodologies.
Multimodal Sentiment Analysis (MSA) requires integrating language, acoustic, and visual signals without sacrificing modality-specific sentiment evidence. Existing methods mainly improve either shared-private decomposition or cross-modal interaction. Although effective, both ultimately depend on how shared and modality-specific evidence is organized before prediction. We observe that, under standard shared-private pipelines, modality heterogeneity often induces a branch-imbalance process: dominant shared patterns accumulate in the shared branch, yielding redundant and modality-biased evidence, while repeated interaction and rigid alignment gradually leak shared information into modality-specific channels and weaken discriminative private representations. As a result, the complementarity between shared and private representations is reduced, limiting robust sentiment reasoning. To address this issue, we propose the Dual-Branch Rebalancing Framework (DBR) on top of a standard multimodal decoupling stage. In the shared branch, a Temporal-Structural Factorization (TSF) module disentangles temporal evolution from structural dependencies and adaptively integrates them to reduce shared redundancy. In the private branch, an Anchor-Guided Private Routing (AGPR) module preserves discriminative modality-specific patterns while allowing controlled cross-modal borrowing. A Bidirectional Rebalancing Fusion (BRF) module then reunifies the two regularized branches in a context-aware manner for final prediction. Extensive experiments on CMU-MOSI, CMU-MOSEI, and MIntRec demonstrate that DBR consistently outperforms the compared baselines. Further analyses show that these improvements come from coordinated mitigation of branch imbalance.
Primary: Fudan University
All Institutions: China University of Petroleum-Beijing at Karamay, Fudan University, Peking University, University of Southern California, University of Macau
The paper presents a comprehensive framework for addressing shared-private branch imbalance in multimodal sentiment analysis, contributing valuable insights and methodologies to the field. The innovative approach and rigorous experimental validation position this work as a significant advancement in multimodal representation learning.
The proposed Dual-Branch Rebalancing Framework (DBR) introduces a novel approach to mitigating shared-private branch imbalance in multimodal sentiment analysis. The methodology is well-structured, comprising three main components: Temporal-Structural Factorization (TSF) to disentangle shared representations, Anchor-Guided Private Routing (AGPR) to maintain modality-specific features, and Bidirectional Rebalancing Fusion (BRF) for effective integration. This coordinated design addresses the inherent challenges of modality heterogeneity and redundancy, showcasing a clear understanding of the complexities involved in multimodal representation learning.
The experimental evaluation is robust, utilizing multiple widely recognized benchmarks (CMU-MOSI, CMU-MOSEI, and MIntRec) to validate the effectiveness of DBR. The results demonstrate significant improvements over state-of-the-art baselines across various metrics, indicating the proposed framework's strong performance. The ablation studies further substantiate the contributions of each module, providing insights into their individual impacts on overall performance.
The paper provides sufficient implementation details, including the use of PyTorch, training configurations, and evaluation metrics, which facilitate reproducibility. However, the absence of a publicly available code repository or demo limits the practical reproducibility of the results.
While the framework shows promising results, the paper does not address potential limitations such as the scalability of the model to larger datasets or the computational efficiency of the proposed modules. Additionally, the reliance on specific benchmarks may not fully capture the generalizability of the approach across diverse multimodal tasks.
The findings of this research have significant implications for the field of multimodal sentiment analysis, particularly in applications involving human-centered AI systems. By improving the integration of diverse modalities, the proposed framework can enhance the robustness of sentiment prediction in real-world scenarios, potentially benefiting areas like social media analysis, customer feedback interpretation, and emotional AI. The paper presents a comprehensive framework for addressing shared-private branch imbalance in multimodal sentiment analysis, contributing valuable insights and methodologies to the field. The innovative approach and rigorous experimental validation position this work as a significant advancement in multimodal representation learning.
Standard text-to-speech (TTS) evaluation measures intelligibility (WER, CER) and overall naturalness (MOS, UTMOS) but does not quantify accent. A synthesiser may score well on all four yet sound non-native on features that are phonemic in the target language. For Indic languages, these features include retroflex articulation, aspiration, vowel length, and the Tamil retroflex approximant (letter zha). We present PSP, the Phoneme Substitution Profile, an interpretable, per-phonological-dimension accent benchmark for Indic TTS. PSP decomposes accent into six complementary dimensions: retroflex collapse rate (RR), aspiration fidelity (AF), vowel-length fidelity (LF), Tamil-zha fidelity (ZF), Frechet Audio Distance (FAD), and prosodic signature divergence (PSD). The first four are measured via forced alignment plus native-speaker-centroid acoustic probes over Wav2Vec2-XLS-R layer-9 embeddings; the latter two are corpus-level distributional distances. In this v1 we benchmark four commercial and open-source systems (ElevenLabs v3, Cartesia Sonic-3, Sarvam Bulbul, Indic Parler-TTS) on Hindi, Telugu, and Tamil pilot sets, with a fifth system (Praxy Voice) included on all three languages, plus an R5->R6 case study on Telugu. Three findings: (i) retroflex collapse grows monotonically with phonological difficulty Hindi < Telugu < Tamil (~1%, ~40%, ~68%); (ii) PSP ordering diverges from WER ordering -- commercial WER-leaders do not uniformly lead on retroflex or prosodic fidelity; (iii) no single system is Pareto-optimal across all six dimensions. We release native reference centroids (500 clips per language), 1000-clip embeddings for FAD, 500-clip prosodic feature matrices for PSD, 300-utterance golden sets per language, scoring code under MIT, and centroids under CC-BY. Formal MOS-correlation is deferred to v2; v1 reports five internal-consistency signals plus a native-audio sanity check.
Primary: Praxel Ventures
All Institutions: Praxel Ventures
The paper presents a novel accent evaluation benchmark for Indic TTS systems, offering a detailed and interpretable framework that enhances the understanding of accent fidelity in synthesized speech. The innovative methodology and significant findings position this work as a valuable contribution to the field of machine learning and speech synthesis.
The paper proposes a novel framework, the Phoneme Substitution Profile (PSP), which quantitatively evaluates accent fidelity in Indic languages for TTS systems. The methodology is robust, utilizing a combination of acoustic probes and distributional metrics to capture phonological dimensions of accent. The use of Wav2Vec2 embeddings for forced alignment and the construction of native speaker centroids are particularly innovative, allowing for a detailed analysis of accent features that are often overlooked in traditional TTS evaluations. The six dimensions of accent fidelity (RR, AF, LF, ZF, FAD, PSD) provide a comprehensive approach to understanding TTS performance across different systems and languages.
The experiments benchmark four commercial and open-source TTS systems across Hindi, Telugu, and Tamil, showcasing the effectiveness of the PSP framework. The findings reveal significant insights into the performance of these systems, particularly the divergence between traditional intelligibility metrics (WER) and the proposed accent fidelity metrics. The detailed analysis of results across different languages highlights the varying challenges posed by phonological complexity, making the evaluation both thorough and insightful.
The authors have made a commendable effort to ensure reproducibility by releasing the scoring code and native speaker centroids under open-source licenses. However, the reliance on specific aligners and the current limitations in the quality of these tools may affect the reproducibility of results, particularly for Telugu and Tamil. Future versions of the benchmark are expected to address these issues, enhancing the overall reproducibility.
The paper acknowledges several limitations, including the dependency on forced alignment accuracy, which varies by language, and the potential noise floor in per-phoneme scores. The authors also note that the current version of the PSP does not include formal MOS calibration, which is essential for validating the proposed metrics against human judgment. Additionally, the limited size of pilot sets may affect the statistical significance of some findings.
The PSP framework has the potential to significantly impact the development of TTS systems for Indic languages, providing a much-needed tool for developers to optimize accent fidelity. By focusing on specific phonological features, the framework can help improve the naturalness and intelligibility of synthesized speech, making it more accessible to native speakers. This work also opens avenues for further research into accent evaluation in other languages and dialects, contributing to the broader field of speech synthesis. The paper presents a novel accent evaluation benchmark for Indic TTS systems, offering a detailed and interpretable framework that enhances the understanding of accent fidelity in synthesized speech. The innovative methodology and significant findings position this work as a valuable contribution to the field of machine learning and speech synthesis.
Recent advancements in large audio language models have extended Chain-of-Thought (CoT) reasoning into the auditory domain, enabling models to tackle increasingly complex acoustic and spoken tasks. To elicit and sustain these extended reasoning chains, the prevailing paradigm -- driven by the success of text-based reasoning models -- overwhelmingly relies on Reinforcement Learning with Verified Rewards (RLVR). However, as models are strictly optimized to distill rich, continuous auditory contexts into isolated, verifiable text labels, a fundamental question arises: are we fostering true audio intelligence, or merely reducing a continuous sensory medium into a discrete puzzle? We identify this as the "verifiable reward trap." While RLVR yields remarkable scores on standardized objective benchmarks, it systematically degrades the real-world conversational feel of audio models. By prioritizing isolated correctness over acoustic nuance, RLVR reduces dynamic interactions to mechanical "answering machines," severely compromising prosodic naturalness, emotional continuity, and user immersion, particularly in long-turn dialogues. To bridge the gap between mechanical objective verification and genuine sensory empathy, we introduce Step-Audio-R1.5, marking a paradigm shift toward Reinforcement Learning from Human Feedback (RLHF) in audio reasoning. Comprehensive evaluations demonstrate that Step-Audio-R1.5 not only maintains robust analytical reasoning but profoundly transforms the interactive experience, redefining the boundaries of deeply immersive long-turn spoken dialogue.
Primary: StepFun
All Institutions: StepFun, Nanyang Technological University, University of New South Wales, Shanghai Jiao Tong University
The main contribution of this paper is the introduction of Step-Audio-R1.5, a novel audio reasoning model that integrates RLHF to enhance the quality of multi-turn dialogues, addressing the limitations of existing models that prioritize isolated correctness over conversational naturalness. This work represents a significant step forward in developing more empathetic and engaging audio interaction systems, setting a new standard for future research in audio language models.
The methodology is robust and innovative, introducing a new paradigm in audio language models by integrating Reinforcement Learning from Human Feedback (RLHF) to address the limitations of Reinforcement Learning with Verified Rewards (RLVR). The paper effectively outlines a structured approach that includes a mid-training stage, cold-start supervised fine-tuning, and a novel reward model that captures both explicit and implicit quality metrics. This combination is significant as it aims to enhance the naturalness and emotional engagement of audio interactions, which is a critical aspect often overlooked in traditional models.
The experimental evaluation is comprehensive, utilizing multiple benchmarks, including the newly proposed AudioMultiChallenge and Step-Caption, which are well-designed to assess various dimensions of audio reasoning and dialogue quality. The results indicate that Step-Audio-R1.5 performs competitively against leading models, demonstrating significant improvements in multi-turn dialogue scenarios. The use of diverse datasets and rigorous evaluation metrics strengthens the findings.
The paper provides a clear description of the architecture and training process, which aids in reproducibility. However, it lacks detailed implementation specifics such as hyperparameters and training duration, which are essential for fully replicating the experiments. The availability of the project URL is a positive aspect, as it may contain additional resources for implementation.
One limitation is the potential over-reliance on human feedback, which may introduce biases based on the evaluators' preferences. Additionally, while the model shows improvements in conversational quality, the paper does not extensively discuss how it handles edge cases or unexpected user inputs, which are common in real-world applications.
The proposed model has the potential to significantly advance the field of audio language processing by improving user interactions in conversational AI systems. This could lead to more engaging and emotionally aware audio applications in various domains, including virtual assistants, customer service, and entertainment. The main contribution of this paper is the introduction of Step-Audio-R1.5, a novel audio reasoning model that integrates RLHF to enhance the quality of multi-turn dialogues, addressing the limitations of existing models that prioritize isolated correctness over conversational naturalness. This work represents a significant step forward in developing more empathetic and engaging audio interaction systems, setting a new standard for future research in audio language models.
Generating symphonic music requires simultaneously managing high-level structural form and dense, multi-track orchestration. Existing symbolic models often struggle with a "complexity-control imbalance", in which scaling bottlenecks limit long-term granular steerability. We present SymphonyGen, a 3D hierarchical framework for contemporary cinematic orchestration. SymphonyGen employs a cascading decoder architecture that decomposes the Bar, Track, and Event axes, improving computational efficiency and scalability over conventional 1D or 2D models. We introduce "short-score" conditioning via a beat-quantized multi-voice harmony skeleton, enabling outline control while preserving textural diversity. The model is further refined using Group Relative Policy Optimization (GRPO) with a cross-modal audio-perceptual reward, aligning symbolic output with modern acoustic expectations. Additionally, we implement a dissonance-averse sampling algorithm to suppress unintended tonal clashes during inference. Objective evaluations show that both reinforcement learning and dissonance-averse sampling effectively enhance harmonic cleanliness while maintaining melodic expression. Subjective evaluations demonstrate that SymphonyGen outperforms baselines in musicality and preference for orchestral music generation. Demo page: https://symphonygen.github.io/
Primary: Central Conservatory of Music
All Institutions: Frontier Institute of Science and Technology, Central Conservatory of Music, Department of AI Music and Music Information Technology, Shenzhen University, Interdisciplinary Research Center
The main contribution of this paper is the development of SymphonyGen, a novel 3D hierarchical framework for orchestral music generation that effectively addresses the complexities of high-level structural form and dense orchestration. This work represents a substantial advancement in the field of AI music generation, combining innovative methodologies with rigorous evaluation to produce a system that aligns closely with modern acoustic expectations.
The paper introduces a 3D hierarchical architecture that effectively manages the complexities of orchestral music generation by decomposing the task into Bar, Track, and Event levels. This cascading decoder architecture enhances computational efficiency and scalability, which is a significant improvement over conventional models. The introduction of a "short-score" conditioning via a beat-quantized multi-voice harmony skeleton is innovative, allowing for greater control over the generated music while maintaining textural diversity. The use of Group Relative Policy Optimization (GRPO) with a cross-modal audio-perceptual reward is a novel approach that aligns the generated symbolic output with acoustic expectations, addressing the limitations of previous models. The dissonance-averse sampling algorithm further refines the output by suppressing unintended tonal clashes, showcasing a thoughtful integration of music theory into the generative process.
The experimental design is robust, featuring both objective and subjective evaluations. The use of a large dataset (SymphonyNet) for training and validation ensures that the model is well-tested across various orchestral styles. Objective metrics such as harmony precision, recall, and dissonance scores provide quantitative assessments of the model's performance, while subjective evaluations involving listener preferences add qualitative insights. The results indicate that SymphonyGen outperforms baseline models in terms of musicality and preference, particularly among general listeners, which is a strong endorsement of its effectiveness.
The paper provides detailed implementation information, including architecture specifications, training procedures, and evaluation metrics. However, the absence of a publicly available code repository limits reproducibility. The authors mention that implementation details will be available in their codebase, but without immediate access, it is challenging to fully assess reproducibility.
The paper acknowledges some limitations, such as the potential for "strange" harmonies or "noisy" segments in the generated music, which may stem from errors in harmony skeleton generation. Additionally, the subjective evaluations indicate that while the model performs well, it may still produce overly full orchestrations at times, suggesting room for improvement in balancing orchestration richness with clarity.
SymphonyGen has significant implications for the field of AI-assisted music composition, particularly in cinematic orchestration. By providing a controllable framework for composers, it enhances the collaborative potential between human creativity and AI-generated music. The model's ability to produce high-quality orchestral compositions could benefit various applications, including film scoring, video game music, and other multimedia projects, ultimately enriching the landscape of contemporary music creation. The main contribution of this paper is the development of SymphonyGen, a novel 3D hierarchical framework for orchestral music generation that effectively addresses the complexities of high-level structural form and dense orchestration. This work represents a substantial advancement in the field of AI music generation, combining innovative methodologies with rigorous evaluation to produce a system that aligns closely with modern acoustic expectations.
Recent audio-aware large language models (ALLMs) have demonstrated strong capabilities across diverse audio understanding and reasoning tasks, but they still frequently produce hallucinated or overly confident outputs. While uncertainty estimation has been extensively studied in text-only LLMs, it remains largely unexplored for ALLMs, where audio-conditioned generation introduces additional challenges such as perceptual ambiguity and cross-modal grounding. In this work, we present the first systematic empirical study of uncertainty estimation in ALLMs. We benchmark five representative methods, including predictive entropy, length-normalized entropy, semantic entropy, discrete semantic entropy, and P(True), across multiple models and diverse evaluation settings spanning general audio understanding, reasoning, hallucination detection, and unanswerable question answering. Our results reveal two key findings. First, semantic-level and verification-based methods consistently outperform token-level baselines on general audio reasoning benchmarks. Second, on trustworthiness-oriented benchmarks, the relative effectiveness of uncertainty methods becomes notably more model- and benchmark-dependent, indicating that conclusions drawn from general reasoning settings do not straightforwardly transfer to hallucination and unanswerable-question scenarios. We further explore uncertainty-based adaptive inference as a potential downstream application. We hope this study provides a foundation for future research on reliable, uncertainty-aware audio-language systems.
Primary: National Taiwan University
All Institutions: National Taiwan University, Artificial Intelligence Center of Research Excellence (AI-CoRE)
This paper makes a significant contribution by systematically evaluating uncertainty estimation methods in audio-aware large language models, revealing critical insights that could guide future research and applications in multimodal AI systems. The comprehensive benchmarking and analysis of methods provide a valuable foundation for improving the reliability of ALLMs in practical scenarios.
The paper presents a systematic empirical study of uncertainty estimation methods tailored for audio-aware large language models (ALLMs). It benchmarks five distinct methods, including predictive entropy and semantic entropy, across various models and tasks, highlighting the unique challenges posed by audio inputs. The methodology is sound, employing a two-stage protocol for uncertainty estimation and a clear comparative analysis across multiple benchmarks. However, the reliance on existing methods from text-based LLMs without significant adaptation for audio-specific challenges could be seen as a limitation.
The experiments are comprehensive, covering a wide range of benchmarks that assess both general audio understanding and trustworthiness-oriented tasks. The results indicate that semantic-level and verification-based methods consistently outperform token-level baselines, providing valuable insights into the performance of uncertainty estimation in ALLMs. The evaluation metrics, including AUROC and AURAC, are appropriate for the tasks at hand.
While the paper provides a detailed description of the experimental setup, including the models used and the evaluation protocols, it lacks specific implementation details or code availability, which could hinder reproducibility. The absence of a project URL further complicates this aspect.
The study primarily focuses on constrained answer spaces, which may not generalize well to open-ended tasks. Additionally, the uncertainty estimation methods are largely inherited from text LLM literature, potentially limiting their effectiveness in capturing audio-specific uncertainties. The fixed threshold for adaptive inference may not be optimal across all scenarios, and the study does not explore more sophisticated routing strategies.
The findings have significant implications for the development of more reliable audio-language systems, particularly in applications requiring robust uncertainty estimation for decision-making. The work lays a foundation for future research in uncertainty-aware models, which could enhance the safety and reliability of AI systems in high-stakes environments. This paper makes a significant contribution by systematically evaluating uncertainty estimation methods in audio-aware large language models, revealing critical insights that could guide future research and applications in multimodal AI systems. The comprehensive benchmarking and analysis of methods provide a valuable foundation for improving the reliability of ALLMs in practical scenarios.
To establish empathy with machines, it is essential to fully understand human emotional changes. However, research in multimodal emotion recognition often overlooks one problem: individual expressive traits vary significantly, which means that different people may express emotions differently. In our daily lives, we can see this. When communicating with different people, some express "happiness" through their facial expressions and words, while others may hide their happiness or express it through their actions. Both are expressions of 'happiness,' but such differences in emotional expression are still too difficult for machines to distinguish. Current emotion recognition remains at a 'static' level, using a single recognition model to identify all emotional styles. This "simplification" often affects the recognition results, especially in multi-turn dialogues. To address this problem, this paper introduces a novel Multi-Level Speaker Adaptive Network (ML-SAN), which, specifically, effectively addresses the challenge of speaker identity information confusion. ML-SAN does not simply assign a speaker's ID after recognition; instead, it employs a three-stage adaptive process: First, Input-level Calibration uses Feature-Level Linear Modulation (FiLM) to adjust the raw audio and visual features into a neutral space unrelated to the speaker. Then, Interaction-level Gating re-adjusts the trust level for each modality (e.g., voice or facial features) based on the speaker's identity information. Finally, Output-level Regularization maintains the consistency of speaker features in the latent space. Tests on the MELD and IEMOCAP datasets show that our model (ML-SAN) achieves better results, performs exceptionally well in handling challenging tail sentiment categories, and better addresses the diversity of speakers in real-world scenarios.
Primary: Xinjiang University
All Institutions: Joint Research Laboratory for Embodied Intelligence, Joint International Research Laboratory of Silk Road Multilingual Cognitive Computing, School of Computer Science and Technology, Xinjiang University
The main contribution of this paper is the introduction of the Multi-Level Speaker-Adaptive Network (ML-SAN), which effectively addresses speaker heterogeneity in multimodal emotion recognition through a novel three-stage adaptive process. This work represents a significant advancement in the field of emotion recognition by integrating speaker identity into the modeling process, thereby improving the accuracy and robustness of emotion detection in conversations.
The proposed ML-SAN framework introduces a three-stage adaptive process that effectively addresses the challenges of speaker identity confusion in emotion recognition. The use of Feature-wise Linear Modulation (FiLM) for input calibration, dynamic gating for interaction-level adjustments, and output regularization to maintain speaker identity showcases a thoughtful and innovative approach to handling multimodal data. This hierarchical adaptation strategy is a significant advancement over traditional speaker-agnostic methods, as it actively incorporates speaker characteristics into the model's decision-making process.
The experiments conducted on the MELD and IEMOCAP datasets demonstrate the effectiveness of the ML-SAN model, achieving superior performance compared to the baseline MultiEMO. The rigorous evaluation, including ablation studies to analyze the contribution of each component, adds credibility to the findings. The reported metrics, such as the weighted F1-score, indicate that the model performs well, particularly in challenging scenarios involving diverse emotional expressions.
The paper provides sufficient details regarding the experimental setup, including the use of specific datasets and the implementation of baseline models under identical conditions. However, the absence of a publicly accessible code repository limits the reproducibility of the results. Future work should consider making the code available to facilitate further research and validation.
While the ML-SAN model shows promising results, the paper acknowledges potential challenges in real-world applications, such as background noise and missing modalities. Additionally, the model's reliance on specific datasets may limit its generalizability to other contexts or languages. The authors should address these limitations in future iterations of their work.
The ability to accurately recognize emotions in conversations has significant implications for the development of empathetic AI systems. This research could enhance human-computer interaction in various applications, including virtual assistants, mental health support, and customer service. By improving emotion recognition, ML-SAN can contribute to more nuanced and effective communication between humans and machines. The main contribution of this paper is the introduction of the Multi-Level Speaker-Adaptive Network (ML-SAN), which effectively addresses speaker heterogeneity in multimodal emotion recognition through a novel three-stage adaptive process. This work represents a significant advancement in the field of emotion recognition by integrating speaker identity into the modeling process, thereby improving the accuracy and robustness of emotion detection in conversations.
Commercial TTS systems produce near-native Indic audio, but the best open-source bases (Chatterbox, Indic Parler-TTS, IndicF5) trail them on measured phonological dimensions, and the most widely adopted multilingual base (Chatterbox, 23 languages) does not even tokenise Telugu or Tamil. We ask: what is the minimum intervention that brings such a non-Indic-native base to commercial-class output on Telugu, Tamil, and Hindi, without training a new acoustic decoder and without any commercial TTS training data? We combine three pieces: (1) BUPS, a Brahmic Unified Phoneme Space that deterministically romanises seven Indic scripts to ISO-15919 so Chatterbox's Latin tokeniser can process them; (2) a LoRA adapter on only the text-token predictor (Chatterbox's t3), trained on ~1,220h of licensed Indic audio with a Hindi-proxy language_id; (3) a voice-prompt recovery recipe -- an 8-11s same-language reference clip plus three sampling overrides (exaggeration 0.7, temperature 0.6, min_p 0.1; "Config B") -- that recovers commercial-class acoustic output with no acoustic-decoder training. On Hindi, the LoRA regresses accuracy and we instead use vanilla Chatterbox + Config B, giving a two-branch deployment. Evaluated on 10-utterance pilot sets with the companion PSP benchmark, Praxy Voice matches or slightly leads commercial baselines: 26.7% retroflex collapse on Telugu (vs Sarvam Bulbul 33.3%), 71% Tamil-zha collapse (vs commercial trio's 86%), 0.025 LLM-WER on Hindi (tied with Cartesia Sonic-3). For intra-sentential code-mix we add a third branch (IndicF5 + native-script transliteration) that drops code-mix LLM-WER from 0.80-0.85 to 0.14-0.27 across Hi/Te/Ta. We release R6 LoRA weights (Apache-2.0), inference code and router (MIT), and a Gradio demo.
Primary: Praxel Ventures
All Institutions: Praxel Ventures
The paper presents a novel approach to adapting a frozen multilingual TTS model for Indic languages, demonstrating competitive performance against commercial systems while requiring minimal training data. The combination of BUPS, LoRA adaptation, and voice-prompt recovery represents a significant advancement in TTS technology, particularly for low-resource languages.
The methodology presented in the paper combines three innovative components: the Brahmic Unified Phoneme Space (BUPS) for romanisation of Indic scripts, a low-rank adaptation (LoRA) approach for the text-token predictor, and a voice-prompt recovery recipe that enhances acoustic output without retraining the acoustic decoder. This combination allows for effective adaptation of a frozen multilingual TTS model to support Indic languages, which is a significant advancement in TTS technology for low-resource languages. The approach is well-structured, addressing specific challenges in TTS for Indic languages and demonstrating a clear understanding of the limitations of existing systems.
The experimental evaluation is robust, utilizing a companion benchmark for assessing phonological accuracy and intelligibility across three Indic languages. The results indicate that the proposed system performs competitively against commercial baselines, particularly in terms of retroflex collapse and other phonological metrics. The use of a 10-utterance pilot set allows for initial validation, although the small sample size may limit statistical significance. The paper effectively communicates the results, providing detailed comparisons with existing systems.
The authors have made significant efforts to ensure reproducibility by releasing the LoRA weights, inference code, and a demo interface. However, the reliance on specific datasets and the complexity of the methods may pose challenges for complete replication without access to the same resources. The paper includes sufficient detail on the methodology and experimental setup to allow for independent verification of results.
The paper acknowledges several limitations, including the small sample size for pilot evaluations, the lack of formal Mean Opinion Score (MOS) testing, and the challenges faced in adapting the acoustic decoder. Additionally, the performance on Hindi with the LoRA adapter regressed accuracy, indicating that the method's effectiveness may vary across languages. The authors also note that the current implementation relies on reference audio clips, which may limit flexibility in practical applications.
This research has the potential to significantly impact the development of TTS systems for low-resource languages, particularly in India, where many languages are underrepresented in commercial TTS solutions. By providing a method that requires minimal training data and computational resources, the work could democratize access to high-quality TTS technology for Indic languages, fostering greater inclusivity in technology. The open-source release of the model and code further enhances its potential for widespread adoption and further research. The paper presents a novel approach to adapting a frozen multilingual TTS model for Indic languages, demonstrating competitive performance against commercial systems while requiring minimal training data. The combination of BUPS, LoRA adaptation, and voice-prompt recovery represents a significant advancement in TTS technology, particularly for low-resource languages.
This performance presents a duet between two intelligent musical instruments, Sù (to trace back; to go upstream) and Agentier (playing on agentic clavier), and their human performers, connected through feedback loops. Rather than treating AI as a tool that responds predictably to input, both systems operate recursively, where past actions continuously influence future behaviour. The Sù operates in the audio space through latent representation. Its performer uses Make Noise 0-series synthesisers and MIDI controllers to work with a neural feedback synthesis system based on a RAVE model, with a latent feedback loop embedded within the model's internal structure. This allows the instrument to remember and reuse its own internal states, influencing ongoing sound generation through its recent sonic history. The Agentier functions in the control space. Its performer interacts with the system using a Roland S-1 synthesiser and Keith McMillen QuNeo touchpad, where control gestures are routed into a recurrent neural network that feeds back into the synthesis process. Through this feedback loop, the system actively shapes the evolution of control signals over time. Contrasting feedback in the audio and control domains, the performance explores shared agency, resistance, and negotiation between humans and intelligent musical systems. Musical phenomena are co-produced through the entangled states of interaction, rather than through pre-existing system configuration or fixed mappings.
Primary: The Australian National University
All Institutions: The Australian National University
This paper presents a significant contribution to the field of AI in music by exploring the co-constructive relationship between human performers and intelligent musical instruments through innovative feedback mechanisms. The methodology is well-defined, though the lack of rigorous experimental evaluation and reproducibility details limits its impact.
The paper presents a novel approach to musical performance through the integration of AI in two intelligent musical instruments, Sù and Agentier. The methodology is well-articulated, detailing the use of a RAVE model for audio synthesis and a recurrent neural network for control signal generation. The recursive feedback mechanisms employed in both instruments are innovative, allowing for a dynamic interaction between the performer and the instrument, which enhances the creative process. The use of latent representations and direct manipulation of latent dimensions is particularly noteworthy, as it provides performers with greater control over the sonic output.
While the paper describes the performance setup and the interaction between the instruments, it lacks a comprehensive experimental evaluation with quantitative metrics. The authors mention a video documentation of a performance, which serves as a qualitative demonstration of their approach. However, there is no detailed analysis of the performance outcomes, such as audience reception or systematic comparisons with traditional instruments or other AI-enabled systems. Including metrics like Mean Opinion Score (MOS) or other objective evaluations would strengthen the claims made.
The paper provides a clear description of the instruments and the technology used, which aids in reproducibility. However, specific implementation details, such as the exact configurations of the neural networks and the training datasets, are not sufficiently detailed. Additionally, the lack of a publicly available code repository limits the ability of other researchers to replicate the work fully.
One of the main limitations is the absence of a rigorous experimental evaluation framework to assess the performance of the instruments quantitatively. The reliance on qualitative descriptions and a single performance video may not provide a comprehensive understanding of the instruments' capabilities. Furthermore, the paper does not address potential issues related to latency in real-time performance, which could affect the interaction quality between the performer and the AI systems.
The integration of AI in musical performance has significant implications for the future of music creation and performance. This work encourages a rethinking of the role of the performer and the instrument, promoting a collaborative relationship that could lead to new forms of musical expression. The exploration of feedback loops and shared agency could inspire further research in both music technology and human-computer interaction, potentially influencing the design of future intelligent musical instruments. This paper presents a significant contribution to the field of AI in music by exploring the co-constructive relationship between human performers and intelligent musical instruments through innovative feedback mechanisms. The methodology is well-defined, though the lack of rigorous experimental evaluation and reproducibility details limits its impact.
Large Audio-Language Models show consistent performance gains across speech and audio benchmarks, yet high scores may not reflect true auditory perception. If a model can answer questions without processing the acoustic signal, the benchmark fails as a measure of auditory understanding. We present a diagnostic framework using two axes: text prior, which measures answerability from text and general knowledge alone, and audio reliance, which assesses actual dependency on the acoustic signal. Evaluating eight LALMs across three benchmarks, we find that models retain 60-72% of their full audio scores even without any audio input. Moreover, among items that require audio, only 3.0-4.2% need the complete audio clip; the majority can be resolved using localized fragments. These findings challenge the assumption that benchmark performance equals robust audio understanding, and we conclude with practical guidelines for improving evaluation reliability and benchmark design.
Primary: National Taiwan University
All Institutions: National Taiwan University, NTU Artificial Intelligence Center of Research Excellence
The paper presents a critical analysis of the reliance on audio in audio-language models, challenging existing benchmarks and proposing a framework for better evaluation. The methodology and findings are significant, offering valuable insights for researchers and practitioners in the field of machine learning and audio understanding.
The paper introduces a novel diagnostic framework that assesses audio-language models (LALMs) based on two axes: text prior and audio reliance. This dual-axis approach allows for a nuanced understanding of how much of a model's performance can be attributed to textual cues versus actual audio processing. The methodology is well-structured, employing controlled settings to quantify the text prior and audio reliance, which is a significant advancement in evaluating LALMs. The use of multiple benchmarks and a variety of models strengthens the robustness of the findings.
The experiments are thorough, evaluating eight LALMs across three distinct benchmarks. The results indicate a substantial grounding gap, revealing that models can achieve high scores without audio input, which challenges the assumption of robust auditory understanding. The analysis of performance retention with partial audio is particularly insightful, providing a clear picture of how audio information is utilized by the models. However, the paper could benefit from more detailed statistical analysis to support its claims.
The paper provides a clear description of the experimental setup, including the models used and the evaluation protocols. However, it lacks specific URLs or repositories for code and data, which could hinder reproducibility. Including such resources would enhance the paper's impact and facilitate further research in this area.
One limitation is the reliance on existing benchmarks, which may not fully capture the complexities of audio understanding. Additionally, while the study identifies issues with current benchmarks, it does not propose new benchmarks or datasets, which could be a missed opportunity for advancing the field. The findings may also be limited by the specific models and benchmarks chosen for evaluation.
The findings have significant implications for the design of future audio-language benchmarks and the evaluation of LALMs. By highlighting the potential for models to rely on textual priors rather than genuine auditory understanding, the paper calls for a reevaluation of how auditory capabilities are assessed in machine learning. This could lead to more accurate and reliable evaluations, ultimately improving the development of models that genuinely understand audio. The paper presents a critical analysis of the reliance on audio in audio-language models, challenging existing benchmarks and proposing a framework for better evaluation. The methodology and findings are significant, offering valuable insights for researchers and practitioners in the field of machine learning and audio understanding.
Automatic chord recognition (ACR) extracts time-aligned chord labels from music audio recordings. Despite recent advances, ACR still struggles with oversegmentation, data scarcity, and imbalance, especially in recognizing complex chords such as non-triads, which are unpopular in existing datasets. To address these challenges, we reformulate ACR as a segment-level sequence-to-sequence prediction task, where chord sequences are predicted auto-regressively rather than frame by frame. This design mitigates excessive segmentation by detecting chord changes only at segment boundaries. We further introduce two types of token representations and an encoder pre-training method, both specifically designed for time-aligned chord modeling. Experimental results show that our model improves performance in both chord recognition and segmentation, with notable gains for complex and infrequent chord types. These findings demonstrate the effectiveness of segment-level sequence modeling, structured tokenization, and representation learning for advancing chord recognition systems.
Primary: Seoul National University
All Institutions: Seoul National University
This paper presents a significant advancement in automatic chord recognition through a novel segment-level sequence modeling approach, effectively addressing oversegmentation and data imbalance challenges. The methodology is well-structured, and the experimental results demonstrate substantial improvements, marking a meaningful contribution to the field of music information retrieval.
The paper introduces a novel segment-level sequence-to-sequence approach for automatic chord recognition (ACR), effectively addressing oversegmentation and data imbalance issues prevalent in traditional frame-level methods. The use of a Transformer encoder-decoder architecture is well-justified, and the introduction of two token representations (MERGE and SPLIT) demonstrates a thoughtful approach to chord modeling. The encoder pre-training method based on chord similarity is innovative and enhances the model's ability to generalize, particularly for complex chord types.
The experiments are comprehensive, utilizing a well-defined dataset of 471 pop songs with manual annotations. The use of 5-fold cross-validation strengthens the reliability of the results. The reported improvements in both chord recognition and segmentation metrics, particularly for complex chords, are significant and demonstrate the effectiveness of the proposed methods. The ablation studies provide clear insights into the contributions of each component of the model.
The paper includes sufficient implementation details, such as data preprocessing, model architecture, training procedures, and evaluation metrics, which facilitate reproducibility. The availability of the code repository enhances this aspect, allowing other researchers to replicate the results and build upon this work.
While the paper addresses several critical challenges in ACR, it does not discuss the potential limitations of the proposed methods, such as the reliance on the quality of the training dataset or the challenges in generalizing to genres or styles not represented in the dataset. Additionally, the model's performance on real-world recordings versus studio recordings could be explored further.
The advancements in chord recognition could have significant implications for music information retrieval, music education, and automated music composition systems. By improving the recognition of complex chords, this work could enhance tools for musicians and composers, making music analysis more accessible and efficient. This paper presents a significant advancement in automatic chord recognition through a novel segment-level sequence modeling approach, effectively addressing oversegmentation and data imbalance challenges. The methodology is well-structured, and the experimental results demonstrate substantial improvements, marking a meaningful contribution to the field of music information retrieval.
Automatic speech recognition systems often produce confident yet incorrect transcriptions under noisy or ambiguous conditions, which can be misleading for both users and downstream applications. Standard evaluation based on Word Error Rate focuses solely on accuracy and fails to capture transcription reliability. We introduce an abstention-aware transcription framework that enables ASR models to explicitly abstain from uncertain segments. To evaluate reliability under abstention, we propose RAS, a reliability-oriented metric that balances transcription informativeness and error aversion, with its trade-off parameter calibrated by human preference. We then train an abstention-aware ASR model through supervised bootstrapping followed by reinforcement learning. Our experiments demonstrate substantial improvements in transcription reliability while maintaining competitive accuracy.
Primary: Shanghai Jiao Tong University
All Institutions: Shanghai Jiao Tong University, MoE Key Lab of Artificial Intelligence, Jiangsu Key Lab of Language Computing
The paper presents a significant advancement in automatic speech recognition by introducing an abstention-aware framework and a novel reliability metric, RAS, which enhances the reliability of ASR outputs in uncertain conditions. The methodology is well-founded and the experimental results robustly support the proposed contributions, marking a meaningful step forward in the field of speech processing.
The paper introduces a novel abstention-aware transcription framework for ASR systems, which allows models to abstain from uncertain segments rather than producing potentially misleading transcriptions. The proposed Reliability-Aware Score (RAS) metric is innovative as it integrates a placeholder for uncertainty directly into the transcription process, moving beyond traditional metrics like Word Error Rate (WER). The methodology is well-structured, employing a two-stage training pipeline that combines supervised bootstrapping and reinforcement learning, effectively enhancing the model's reliability in challenging acoustic conditions.
The experiments are comprehensive, utilizing two datasets (LibriSpeech and TALCS) to evaluate the proposed method under both clean and noisy conditions. The results demonstrate significant improvements in transcription reliability, particularly in adverse environments, while maintaining competitive accuracy. The use of human preference alignment for calibrating the RAS metric adds robustness to the evaluation process, ensuring that the proposed framework is grounded in real-world applicability.
The paper provides detailed descriptions of the methodology, including the training pipeline and experimental setup. However, there is a lack of supplementary material or code repositories that would facilitate complete reproducibility. The absence of a project URL limits the ability for other researchers to replicate the findings directly.
While the proposed framework shows promise, the reliance on human preference data for calibrating the RAS metric may introduce biases based on the specific population sampled. Additionally, the performance in highly diverse acoustic environments beyond those tested (e.g., different languages or dialects) remains unaddressed, which could limit the generalizability of the findings.
The approach has significant implications for high-stakes applications of ASR, such as medical and legal transcription, where reliability is critical. By providing a mechanism for models to indicate uncertainty, the framework can enhance user trust and improve decision-making processes in various domains. The introduction of RAS as a new evaluation metric could also pave the way for further research into reliable ASR systems. The paper presents a significant advancement in automatic speech recognition by introducing an abstention-aware framework and a novel reliability metric, RAS, which enhances the reliability of ASR outputs in uncertain conditions. The methodology is well-founded and the experimental results robustly support the proposed contributions, marking a meaningful step forward in the field of speech processing.
We propose Speech Enhancement based on Drifting Models (DriftSE), a novel generative framework that formulates denoising as an equilibrium problem. Rather than relying on iterative sampling, DriftSE natively achieves one-step inference by evolving the pushforward distribution of a mapping function to directly match the clean speech distribution. This evolution is driven by a Drifting Field, a learned correction vector that guides samples toward the high-density regions of the clean distribution, which naturally facilitates training on unpaired data by matching distributions rather than paired samples. We investigate the framework under two formulations: a direct mapping from the noisy observation, and a stochastic conditional generative model from a Gaussian prior. Experiments on the VoiceBank-DEMAND benchmark demonstrate that DriftSE achieves high-fidelity enhancement in a single step, outperforming multi-step diffusion baselines and establishing a new paradigm for speech enhancement.
Primary: Victoria University of Wellington
All Institutions: Victoria University of Wellington, GN Audio A/S
The main contribution of this paper is the introduction of DriftSE, a novel generative framework for speech enhancement that reformulates denoising as an equilibrium problem, achieving high-fidelity results in a single inference step. This work represents a significant advancement in the field of speech enhancement, combining innovative methodology with robust experimental validation to address critical challenges in real-time applications.
The proposed method, DriftSE, innovatively formulates speech enhancement as an equilibrium problem, leveraging a learned Drifting Field for one-step inference. This approach diverges from traditional iterative sampling techniques, providing a significant computational advantage. The use of a semantic latent space for drift computation enhances the model's ability to capture complex speech structures, which is a notable improvement over existing methods. The dual formulation of the model—direct mapping and conditional generation—adds flexibility and robustness to the framework, allowing it to adapt to various scenarios, including unpaired training.
The experiments conducted on the VoiceBank-DEMAND benchmark and the DNS Challenge 2020 blind test set showcase the effectiveness of DriftSE in achieving high-fidelity speech enhancement. The reported metrics (PESQ, SI-SDR, SCOREQ) indicate that DriftSE outperforms both multi-step diffusion models and other one-step approaches, establishing its competitive edge. The thorough evaluation across different datasets and conditions demonstrates the model's generalization capabilities, which is crucial for real-world applications.
The paper provides detailed implementation specifics, including architecture choices, training procedures, and hyperparameter settings, which are essential for reproducibility. However, the absence of a public code repository or demo URL limits the accessibility of the method for further validation by the research community.
While the DriftSE framework shows promising results, its reliance on a pre-trained self-supervised learning encoder may introduce limitations related to the quality and representativeness of the latent features. Additionally, the performance drop in unpaired settings suggests that the model may struggle in scenarios where clean-reference data is not available, highlighting a potential area for improvement.
The DriftSE framework has significant implications for real-time speech enhancement applications, particularly in environments with varying noise conditions. Its ability to perform one-step inference could facilitate deployment in low-latency scenarios, such as telecommunication and assistive technologies. Furthermore, the methodology could inspire future research in generative modeling and distribution matching across other domains beyond audio. The main contribution of this paper is the introduction of DriftSE, a novel generative framework for speech enhancement that reformulates denoising as an equilibrium problem, achieving high-fidelity results in a single inference step. This work represents a significant advancement in the field of speech enhancement, combining innovative methodology with robust experimental validation to address critical challenges in real-time applications.
Automated movie creation requires coordinating multiple characters, modalities, and narrative elements across extended sequences -- a challenge that existing end-to-end approaches struggle to address effectively. We present \textbf{CineAGI}, a hierarchical movie generation framework that decomposes this complex task through specialized multi-agent orchestration. Our framework employs three key innovations: (1) a multi-agent narrative synthesis module where specialized LLM agents collaboratively generate comprehensive cinematic blueprints with character profiles, scene descriptions, and cross-modal specifications; (2) a decoupled character-centric pipeline that maintains identity consistency through instance-level tracking and integration while enabling flexible multi-character composition; and (3) a hierarchical audio-visual synchronization mechanism ensuring frame-level alignment of dialogue, expressions, and music. Extensive experiments demonstrate that CineAGI achieves 40\% improvement in overall consistency, 4.4\% gain in subject consistency, 5.4\% enhancement in aesthetic quality, and 28.7\% higher character consistency compared to baselines. Our work establishes a principled foundation for automated multi-scene video generation that preserves narrative coherence and character authenticity.
Primary: Nanjing University
All Institutions: Nanjing University, Zhejiang Sci-Tech University, University of British Columbia, Beijing Shuzhimei Technology Co., Ltd, Jilin University, Tianjin University
CineAGI represents a significant advancement in automated movie creation through its innovative multi-agent orchestration framework. The comprehensive methodology and substantial experimental validation establish it as a leading approach in the field, with the potential to reshape how narratives are crafted in digital media.
The methodology presented in CineAGI is robust and innovative, leveraging a hierarchical multi-agent orchestration approach to tackle the complex task of automated movie creation. The use of specialized LLM agents for narrative synthesis, character generation, and cinematographic synthesis is a significant advancement over traditional end-to-end models. The framework's ability to maintain character consistency and narrative coherence across scenes through decoupled processing and explicit synchronization mechanisms is particularly noteworthy. The detailed breakdown of each module and the integration of various generative models demonstrate a comprehensive understanding of the challenges in automated filmmaking.
The experimental evaluation is thorough, utilizing a diverse benchmark of 100 story prompts across multiple genres to assess the framework's performance. The use of both quantitative metrics and qualitative human evaluations provides a well-rounded perspective on the system's effectiveness. The reported improvements in consistency and aesthetic quality are substantial, indicating that the proposed methods yield significant enhancements over existing baselines. However, the paper could benefit from more detailed comparisons with a wider range of contemporary methods to further contextualize its contributions.
The paper provides a detailed description of the experimental setup, including generation settings, evaluation metrics, and baseline comparisons. However, the lack of publicly available code or demo URLs limits reproducibility. Future work should consider releasing the implementation to facilitate further research and validation by the community.
One limitation of the study is the reliance on specific generative models, which may not generalize across all contexts or genres of filmmaking. Additionally, while the framework shows improvements in character consistency and narrative coherence, the complexity of the system may introduce challenges in real-time applications or scalability. The computational cost of approximately 11.3 minutes per scene on a single GPU could also be a barrier for broader adoption.
The implications of CineAGI extend beyond academic research into practical applications in the film and entertainment industry. By automating aspects of movie creation, this framework could democratize content production, enabling creators with limited resources to produce high-quality narratives. Furthermore, the integration of AI in creative processes raises questions about authorship and the role of human creativity in storytelling. CineAGI represents a significant advancement in automated movie creation through its innovative multi-agent orchestration framework. The comprehensive methodology and substantial experimental validation establish it as a leading approach in the field, with the potential to reshape how narratives are crafted in digital media.
Recent large audio language models (LALMs) demonstrate remarkable capabilities in processing extended multi-modal sequences, yet incur high inference costs. Token compression is an effective method that directly reduces redundant tokens in the sequence. Existing compression methods usually assume that all attention heads in LALMs contribute equally to various audio tasks and calculate token importance by averaging scores across all heads. However, our analysis demonstrates that attention heads exhibit distinct behaviors across diverse audio domains. We further reveal that only a sparse subset of attention heads actively responds to audio, with completely different performance when handling semantic and acoustic tasks. In light of this observation, we propose HeadRouter, a head-importance-aware token pruning method that perceives the varying importance of attention heads in different audio tasks to maximize the retention of crucial tokens. HeadRouter is training-free and can be applied to various LALMs. Extensive experiments on the AudioMarathon and MMAU-Pro benchmarks demonstrate that HeadRouter achieves state-of-the-art compression performance, exceeding the baseline model even when retaining 70% of the audio tokens and achieving 101.8% and 103.0% of the vanilla average on Qwen2.5-Omni-3B and Qwen2.5-Omni-7B, respectively.
Primary: Shanghai Jiao Tong University
All Institutions: Shanghai Jiao Tong University, DAIL Tech, Northeastern University, Sichuan University, Huazhong University of Science and Technology
The main contribution of this paper is the introduction of HeadRouter, a dynamic head-weight routing mechanism for audio token pruning in large audio language models, which significantly enhances performance and efficiency in processing diverse audio tasks. This work represents a meaningful advancement in the field of audio language models, addressing critical challenges in token management and model efficiency while maintaining high performance across various audio tasks.
The proposed HeadRouter method introduces a novel dynamic head-weight routing mechanism that adapts to the varying importance of attention heads in large audio language models (LALMs). This approach is innovative in its use of entropy-based selectivity scores and Gaussian soft mixing to create task-specific head-weight profiles. The training-free nature of the method allows it to be easily integrated into existing models without additional training overhead, which is a significant advantage for practical applications.
The experiments conducted on the AudioMarathon and MMAU-Pro benchmarks demonstrate the effectiveness of HeadRouter in outperforming existing token pruning methods across various audio tasks. The results indicate that the method not only maintains performance while aggressively pruning tokens but also adapts well to different audio contexts, showcasing its robustness. The comparative analysis with state-of-the-art methods further validates the proposed approach's superiority in managing token importance dynamically.
The paper provides a clear description of the methodology, including the routing mechanism and evaluation setup, which supports reproducibility. However, the lack of publicly available code or detailed implementation guidelines may hinder full reproducibility for other researchers.
One limitation is the reliance on pre-calibrated head-weight profiles, which may not generalize across all audio tasks or models. Additionally, while the method shows promise in reducing computational costs, the paper does not explore the implications of using HeadRouter in real-time applications or its impact on latency in practical deployments.
The implications of this research extend to various applications in audio processing, including speech recognition, music analysis, and multimodal systems. By improving the efficiency of LALMs, this work could facilitate more widespread adoption of advanced audio understanding technologies in real-time applications, enhancing user experiences in voice-interactive systems. The main contribution of this paper is the introduction of HeadRouter, a dynamic head-weight routing mechanism for audio token pruning in large audio language models, which significantly enhances performance and efficiency in processing diverse audio tasks. This work represents a meaningful advancement in the field of audio language models, addressing critical challenges in token management and model efficiency while maintaining high performance across various audio tasks.
Machine generation of symbolic music and digital audio are hot topics but there have been relatively few digital musical instruments that integrate generative AI. Present musical AI tools are not artist centred and do not support experimentation or integrating into musical instruments or practices. This work introduces an inexpensive generative AI instrument platform based on a single board computer that connects via MIDI to other musical devices. The platform uses artist-collected datasets with models trained on a regular computer. This paper asks what the design space of intelligent musical instruments might look like when accessible and portable AI systems are available for artistic exploration. I contribute five examples of instruments created and tested through a two-year first-person artistic research process. These show that (re)mapping can replace retraining for discovering AI interaction, that fast input interleaving is a new co-creative strategy, that small-data AI models can be a transportable design resource, and that cheap hardware can lower barriers to inclusion. This work could enable artists to explore new interaction and performance schemes with intelligent musical instruments.
Primary: The Australian National University
All Institutions: The Australian National University
This paper presents a novel generative AI platform for intelligent musical instruments, emphasizing artist-centered design and small-data approaches. The comprehensive exploration of performance experiences and instrument development contributes valuable insights to the intersection of AI and music, highlighting the potential for innovative co-creative practices.
The methodology is grounded in a first-person artistic research approach, which is innovative in the context of generative AI in music. The use of small-data AI models trained on artist-collected datasets is a significant contribution, allowing for a more personalized and artist-centered exploration of generative AI in musical contexts. The paper effectively outlines the design and implementation of a generative AI platform that integrates with existing musical instruments, showcasing a practical application of AI in music performance. The iterative development of five distinct instruments provides a rich qualitative dataset for analysis.
The experiments conducted over two years of performance practice are well-documented, providing insights into the evolution of the instruments and their interactions with musicians. The author details the performance experiences and the adaptability of the instruments in various contexts, which adds depth to the evaluation. However, the paper lacks quantitative metrics for assessing the performance of the AI models, which could strengthen the evaluation of their effectiveness.
The implementation details are provided, including the use of Raspberry Pi and the open-source nature of the software, which enhances reproducibility. The availability of the project on GitHub allows others to replicate the setup and experiment with the platform. However, more detailed instructions on the configuration and training processes would further aid reproducibility.
The study is limited by its first-person perspective, which may not capture the full range of experiences from diverse musicians. Additionally, the exploration of model updates over time is not systematically addressed, which could provide further insights into the adaptability and longevity of the AI models in performance settings.
This work has the potential to democratize access to intelligent musical instruments by lowering the cost barrier and encouraging experimentation among artists. The findings could influence future designs of musical AI systems, promoting a shift towards artist-centered approaches in generative AI applications. The implications for HCI and music technology communities are significant, as the research opens new avenues for interaction and collaboration between humans and AI in creative practices. This paper presents a novel generative AI platform for intelligent musical instruments, emphasizing artist-centered design and small-data approaches. The comprehensive exploration of performance experiences and instrument development contributes valuable insights to the intersection of AI and music, highlighting the potential for innovative co-creative practices.
With the rapid advancement of speech generation technologies, the threat posed by speech deepfakes in real-time communication (RTC) scenarios has intensified. However, existing detection studies mainly focus on offline simulations and struggle to cope with the complex distortions introduced during RTC transmission, including unknown speech enhancement processes (e.g., noise suppression) and codec compression. To address this challenge, we present the first large-scale speech deepfake dataset tailored for RTC scenarios, termed \textit{RTCFake}, totaling approximately 600 hours. The dataset is constructed by transmitting speech through multiple mainstream social media and conferencing platforms (e.g., Zoom), enabling precise pairing between offline and online speech. In addition, we propose a phoneme-guided consistency learning (PCL) strategy that enforces models to learn platform-invariant semantic structural representations. In this paper, the RTCFake dataset is divided into training, development, and evaluation sets. The evaluation set further includes both unseen RTC platforms and unseen complex noise conditions, thereby providing a more realistic and challenging evaluation benchmark for speech deepfake detection. Furthermore, the proposed PCL strategy achieves significant improvements in both cross-platform generalization and noise robustness, offering an effective and generalizable modeling paradigm. The \textit{RTCFake} dataset is provided in the {https://huggingface.co/datasets/JunXueTech/RTCFake}.
Primary: unknown
All Institutions: unknown
The paper presents RTCFake, a novel dataset and a phoneme-guided consistency learning strategy for detecting speech deepfakes in real-time communication, addressing a critical gap in existing research. The methodology is innovative, and the experimental results demonstrate substantial improvements, making it a valuable contribution to the field of audio and speech processing.
The paper introduces a phoneme-guided consistency learning (PCL) strategy, which is a novel approach aimed at enhancing the robustness of speech deepfake detection in real-time communication scenarios. The proposed methodology effectively addresses the challenges posed by various distortions and codec compressions encountered in RTC environments. The dataset, RTCFake, is a significant contribution, as it is specifically designed for the complexities of real-time communication, which is often overlooked in existing literature.
The authors provide a comprehensive evaluation of their proposed method using a large-scale dataset of approximately 600 hours of speech. The evaluation set includes both unseen RTC platforms and complex noise conditions, which enhances the realism of the testing environment. The reported improvements in cross-platform generalization and noise robustness are significant, indicating that the proposed method is effective in practical applications.
While the paper mentions the availability of the RTCFake dataset on Hugging Face, it lacks detailed implementation specifics regarding the PCL strategy and the models used. This omission could hinder reproducibility, as other researchers may struggle to replicate the results without clear guidance on the experimental setup.
One limitation is that the dataset may not encompass all possible real-time communication scenarios, potentially limiting the generalizability of the findings. Additionally, the paper does not address the computational efficiency of the proposed method, which is crucial for real-time applications.
The implications of this research are significant, as it addresses a pressing issue in the age of deepfake technology. The ability to detect speech deepfakes in real-time communication can have far-reaching effects on security, privacy, and trust in digital communications. The proposed dataset and methodology could serve as a foundation for future research in this area. The paper presents RTCFake, a novel dataset and a phoneme-guided consistency learning strategy for detecting speech deepfakes in real-time communication, addressing a critical gap in existing research. The methodology is innovative, and the experimental results demonstrate substantial improvements, making it a valuable contribution to the field of audio and speech processing.
Directional Selective Fixed-Filter Active Noise Control (D-SFANC) can effectively attenuate noise from different directions by selecting the suitable pre-trained control filter based on the Direction-of-Arrival (DoA) of the current noise. However, this method is weak at tracking the direction variations of non-stationary noise, such as that from a moving source. Therefore, this work proposes a Predictive Directional SFANC (PD-SFANC) method that uses a Convolutional Recurrent Neural Network (CRNN) to capture the hidden temporal dynamics of the moving noise and predict the control filter to cancel future noise. Accordingly, the proposed method can significantly improve its noise-tracking ability and dynamic noise-reduction performance. Furthermore, numerical simulations confirm the superiority of the proposed method for handling moving sources across various movement scenarios, compared to several representative ANC baselines.
Primary: Nanyang Technological University
All Institutions: Nanyang Technological University, Northwestern Polytechnical University
The main contribution of this paper is the introduction of a novel PD-SFANC method that leverages CRNNs for proactive noise control in dynamic environments. This work significantly advances the field of active noise control by addressing the challenges of tracking moving noise sources, offering a promising solution that could enhance the performance of ANC systems in real-world applications.
The proposed Predictive Directional SFANC (PD-SFANC) method effectively integrates a Convolutional Recurrent Neural Network (CRNN) for predicting the Direction-of-Arrival (DoA) of moving noise sources. The methodology is well-structured, utilizing a pre-trained control filter library and a dual-module architecture that separates the predictive and real-time noise control processes. This design addresses the limitations of existing methods, particularly the lag in filter adaptation for moving sources, showcasing a significant advancement in active noise control systems.
The experiments are comprehensive, utilizing numerical simulations to evaluate the performance of PD-SFANC against established baseline methods. The authors provide detailed descriptions of the simulation setup, including the dataset construction and the noise scenarios tested. The results demonstrate that PD-SFANC outperforms traditional methods in various movement scenarios, with robust noise reduction performance and accurate DoA predictions, reinforcing the effectiveness of the proposed approach.
The paper mentions that the code will be available on GitHub, which is a positive aspect for reproducibility. However, specific implementation details, such as hyperparameters and training settings, could be more explicitly stated to facilitate easier replication of the results by other researchers.
One limitation is that the proposed method is designed for single-source scenarios, which may restrict its applicability in environments with multiple overlapping noise sources. Additionally, while the CRNN shows strong performance, its reliance on a pre-trained filter library may limit adaptability to entirely new noise types not represented in the training data.
The implications of this research extend to various fields where noise control is critical, such as automotive, aviation, and consumer electronics. The ability to effectively manage noise from moving sources can enhance user experience in products like headphones, smart devices, and automotive noise cancellation systems, potentially leading to broader adoption of advanced ANC technologies. The main contribution of this paper is the introduction of a novel PD-SFANC method that leverages CRNNs for proactive noise control in dynamic environments. This work significantly advances the field of active noise control by addressing the challenges of tracking moving noise sources, offering a promising solution that could enhance the performance of ANC systems in real-world applications.
Human-imitated speech poses a greater challenge than AI-generated speech for both human listeners and automatic detection systems. Unlike AI-generated speech, which often contains artifacts, over-smoothed spectra, or robotic cues, imitated speech is produced naturally by humans, thereby preserving a higher degree of naturalness that makes imitation-based speech forgery significantly more challenging to detect using conventional acoustic or cepstral features. To overcome this challenge, this study proposes an auditory perception-based Spectro-Temporal Modulation (STM) representation framework for human-imitated speech detection. The STM representations are derived from two cochlear filterbank models: the Gammatone Filterbank (GTFB), which simulates frequency selectivity and can be regarded as a first approximation of cochlear filtering, and the Gammachirp Filterbank (GCFB), which further models both frequency selectivity and level-dependent asymmetry. These STM representations jointly capture temporal and spectral fluctuations in speech signals, corresponding to changes over time in the spectrogram and variations along the frequency axis related to human auditory perception. We also introduce a Segmental-STM representation to analyze short-term modulation patterns across overlapping time windows, enabling high-resolution modeling of temporal speech variations. Experimental results show that STM representations are effective for human-imitated speech detection, achieving accuracy levels close to those of human listeners. In addition, Segmental-STM representations are more effective, surpassing human perceptual performance. The findings demonstrate that perceptually inspired spectro-temporal modeling is promising for detecting imitation-based speech attacks and improving voice authentication robustness.
Primary: Japan Advanced Institute of Science and Technology
All Institutions: Japan Advanced Institute of Science and Technology
The paper presents a comprehensive framework for detecting human-imitated speech through innovative auditory-inspired representations, addressing a critical gap in the field. The methodology is well-founded in auditory processing principles, and the experimental results demonstrate significant advancements in detection accuracy, highlighting the potential for real-world applications in voice authentication and security.
The paper introduces a novel Spectro-Temporal Modulation (STM) representation framework based on auditory perception, utilizing Gammatone and Gammachirp filterbanks to capture temporal and spectral fluctuations in human-imitated speech. The methodology is well-grounded in auditory processing principles, and the introduction of Segmental-STM representation enhances the modeling of short-term modulation patterns, which is a significant advancement over conventional acoustic features. The approach is innovative, addressing a critical gap in the detection of human-imitated speech, which has been underexplored in existing literature.
The experimental setup is robust, utilizing a dataset specifically designed for human-imitated speech detection. The results indicate that the proposed STM representations outperform traditional acoustic features, achieving accuracy levels comparable to human listeners. The inclusion of multiple classifiers (SVM, KNN, Extra Trees) strengthens the evaluation, and the performance metrics are clearly presented. However, the dataset size could be a limitation, as only 100 samples were used for testing, which may affect the generalizability of the findings.
The paper provides a detailed description of the methodology, including the computation of STM representations and the machine learning models used. However, the lack of a publicly available dataset or code repository limits reproducibility. Future work should consider sharing the dataset and implementation details to facilitate independent validation of results.
The primary limitation is the small dataset size, which may restrict the robustness of the findings and their applicability to broader contexts. Additionally, while the results are promising, the study does not address potential variations in performance across different languages or speaker characteristics, which could affect the generalizability of the approach.
The proposed framework has significant implications for voice authentication and security systems, particularly in contexts where human-imitated speech poses a threat. By improving detection capabilities, this work could enhance the security of voice-based systems, making them more resilient against imitation attacks. The findings also contribute to the understanding of auditory perception in speech processing, potentially influencing future research in related fields. The paper presents a comprehensive framework for detecting human-imitated speech through innovative auditory-inspired representations, addressing a critical gap in the field. The methodology is well-founded in auditory processing principles, and the experimental results demonstrate significant advancements in detection accuracy, highlighting the potential for real-world applications in voice authentication and security.
Multi-speaker automatic speech recognition (ASR) aims to transcribe conversational speech involving multiple speakers, requiring the model to capture not only what was said, but also who said it and sometimes when it was spoken. Recent Speech-LLM approaches have shown the potential of unified modeling for this task, but jointly learning speaker attribution, temporal structure, and lexical recognition remains difficult and data-intensive. At the current stage, leveraging reliable speaker diarization as an explicit structural prior provides a practical and efficient way to simplify this task. To effectively exploit such priors, we propose DM-ASR, a diarization-aware multi-speaker ASR framework that reformulates the task as a multi-turn dialogue generation process. Given an audio chunk and diarization results, DM-ASR decomposes transcription into a sequence of speaker- and time-conditioned queries, each corresponding to one speaker in one time segment. This formulation converts multi-speaker recognition into a series of structured sub-tasks, explicitly decoupling speaker-temporal structure from linguistic content and enabling effective integration of diarization cues with the reasoning capability of large language models. We further introduce an optional word-level timestamp prediction mechanism that interleaves word and timestamp tokens, yielding richer structured outputs and better transcription quality. Our analysis shows that diarization systems provide more reliable speaker identities and segment-level boundaries, while LLMs excel at modeling linguistic content and long-range dependencies, demonstrating their complementary strengths. Experiments on Mandarin and English benchmarks show that the proposed approach achieves strong performance with relatively small models and training data, while remaining competitive with or outperforming existing unified approaches.
Primary: Wuhan University
All Institutions: Wuhan University, Tencent Ethereal Audio Lab, The Chinese University of Hong Kong
The main contribution of this paper is the introduction of DM-ASR, a diarization-aware multi-speaker ASR framework that effectively combines speaker attribution and temporal grounding through a structured dialogue generation approach. This innovative methodology not only improves transcription quality but also demonstrates the potential of integrating diarization cues with large language models, marking a significant advancement in the field of automatic speech recognition.
The proposed DM-ASR framework innovatively reformulates the multi-speaker ASR task as a multi-turn dialogue generation process, effectively integrating speaker diarization cues into the transcription process. This approach decouples speaker identity and temporal information from linguistic content, allowing for a structured generation that enhances both transcription accuracy and robustness against imperfect diarization cues. The introduction of special tokens for speaker and timestamp information, alongside the optional word-level timestamp prediction, represents a significant methodological advancement in the field.
The experiments conducted on both Mandarin and English datasets demonstrate the effectiveness of DM-ASR, achieving competitive performance with smaller models and limited training data compared to larger, more data-intensive systems. The results indicate that the framework not only outperforms traditional cascaded systems but also rivals state-of-the-art end-to-end models, showcasing the practical applicability and generalizability of the proposed method across different languages and conversational contexts.
The paper provides detailed implementation information, including the architecture of the model, training procedures, and datasets used, which enhances reproducibility. However, the lack of publicly available code or demo URLs limits the ability for others to directly replicate the findings without additional effort.
One notable limitation is the reliance on external diarization systems, which can introduce errors that affect overall performance. Additionally, while the model shows robustness against imperfect cues, it does not consistently outperform strong diarization front-ends under all conditions, indicating a potential area for improvement. The paper also does not explore the scalability of the method to larger datasets or more complex conversational scenarios.
The DM-ASR framework has significant implications for real-world applications in multi-speaker environments such as meetings, interviews, and call centers. By improving the accuracy of speaker attribution and temporal grounding in ASR systems, it could enhance accessibility for users requiring accurate transcriptions, such as those with hearing impairments. Furthermore, the integration of LLMs with diarization cues could pave the way for more advanced conversational AI systems capable of understanding and generating human-like dialogue. The main contribution of this paper is the introduction of DM-ASR, a diarization-aware multi-speaker ASR framework that effectively combines speaker attribution and temporal grounding through a structured dialogue generation approach. This innovative methodology not only improves transcription quality but also demonstrates the potential of integrating diarization cues with large language models, marking a significant advancement in the field of automatic speech recognition.
Rhythm transcription is a key subtask of notation-level Automatic Music Transcription (AMT). While deep learning models have been extensively used for detecting the metrical grid in audio and MIDI performances, beat-based rhythm quantization remains largely unexplored. In this work, we introduce a novel deep learning approach for quantizing MIDI performances using a priori beat information. Our method leverages the transformer architecture to effectively process synchronized score and performance data for training a quantization model. Key components of our approach include dataset preparation, a beat-based pre-quantization method to align performance and score times within a unified framework, and a MIDI tokenizer tailored for this task. We adapt a transformer model based on the T5 architecture to meet the specific requirements of rhythm quantization. The model is evaluated using a set of score-level metrics designed for objective assessment of quantization performance. Through systematic evaluation, we optimize both data representation and model architecture. Additionally, we apply performance and score augmentations, such as transposition, note deletion, and performance-side time jitter, to enhance the model's robustness. Finally, a qualitative analysis compares our model's quantization performance against state-of-the-art probabilistic and deep-learning models on various example pieces. Our model achieves an onset F1-score of 97.3% and a note value accuracy of 83.3% on the ASAP dataset. It generalizes well across time signatures, including those not seen during training, and produces readable score output. Fine-tuning on instrument-specific datasets further improves performance by capturing characteristic rhythmic and melodic patterns. This work contributes a robust and flexible framework for beat-based MIDI quantization using transformer models.
Primary: Klangio GmbH
All Institutions: Klangio GmbH, Institute of Industrial Information Technology, Karlsruhe Institute of Technology
This paper presents a novel transformer-based approach for beat-based rhythm quantization of MIDI performances, significantly advancing the field of Automatic Music Transcription. The integration of beat annotations into the quantization process enhances the model's performance and flexibility, marking a meaningful contribution to music information retrieval.
The methodology is robust, leveraging a transformer architecture tailored for rhythm quantization by incorporating beat annotations. The preprocessing steps for aligning performance and score data are well-defined, and the tokenization scheme is innovative, allowing for efficient encoding of musical data. The model's adaptability to different time signatures and its ability to generalize across unseen time signatures are significant contributions. However, the reliance on a priori beat information may limit its applicability in scenarios where such data is not available.
The experiments are comprehensive, utilizing a suitable dataset (ASAP) that includes diverse performance MIDI files. The evaluation metrics are well-chosen, focusing on onset F1-score and note value accuracy, which are critical for assessing quantization performance. The results demonstrate strong performance compared to state-of-the-art models, indicating the effectiveness of the proposed approach. However, the paper could benefit from more extensive comparisons with a broader range of existing methods.
The paper provides sufficient details on the model architecture, training process, and evaluation metrics, which would allow other researchers to replicate the study. However, the absence of a publicly available code repository limits reproducibility.
The main limitations include the dependency on beat annotations, which may not always be available, and the model's performance on more complex time signatures that were not part of the training set. Additionally, the focus on piano and guitar data may restrict the model's generalizability to other instruments.
This work has significant implications for music information retrieval and automatic music transcription, offering a new approach to rhythm quantization that could enhance the usability of MIDI data in various applications, including music education, performance analysis, and music generation. The model's ability to generalize across different time signatures and instruments could lead to broader applications in music technology. This paper presents a novel transformer-based approach for beat-based rhythm quantization of MIDI performances, significantly advancing the field of Automatic Music Transcription. The integration of beat annotations into the quantization process enhances the model's performance and flexibility, marking a meaningful contribution to music information retrieval.
Full-duplex interaction, where speakers and listeners converse simultaneously, is a key element of human communication often missing from traditional spoken dialogue systems. These systems, based on rigid turn-taking paradigms, struggle to respond naturally in dynamic conversations. The Full-Duplex Interaction Track of ICASSP 2026 Human-like Spoken Dialogue Systems Challenge (HumDial Challenge) aims to advance the evaluation of full-duplex systems by offering a framework for handling real-time interruptions, speech overlap, and dynamic turn negotiation. We introduce a comprehensive benchmark for full-duplex spoken dialogue systems, built from the HumDial Challenge. We release a high-quality dual-channel dataset of real human-recorded conversations, capturing interruptions, overlapping speech, and feedback mechanisms. This dataset forms the basis for the HumDial-FDBench benchmark, which assesses a system's ability to handle interruptions while maintaining conversational flow. Additionally, we create a public leaderboard to compare the performance of open-source and proprietary models, promoting transparent, reproducible evaluation. These resources support the development of more responsive, adaptive, and human-like dialogue systems.
Primary: Nanjing University
All Institutions: Nanjing University, Northwestern Polytechnical University, AISHELL
This paper presents a comprehensive study on full-duplex interaction in spoken dialogue systems, introducing a novel dataset and evaluation framework that significantly advance the field. The methodology is well-structured, and the results demonstrate the potential for developing more human-like dialogue systems, addressing key challenges in real-time conversational dynamics.
The paper introduces a dual-channel dataset that captures realistic conversational dynamics, including interruptions and overlapping speech, which is a significant advancement over existing datasets that primarily focus on single-channel recordings. The methodology for dataset construction combines LLM-generated scripts with human recordings, ensuring both authenticity and control over interaction behavior. The evaluation framework, HumDial-FDBench, is well-structured, providing clear metrics for assessing system performance in real-time dialogue scenarios. This comprehensive approach allows for a nuanced understanding of full-duplex interaction, making it a valuable resource for future research.
The experimental results are robust, with a clear comparison of various models' performance on the released benchmark. The paper provides detailed metrics for interruption handling, rejection behavior, and response latency, which are critical for evaluating the effectiveness of dialogue systems in real-world scenarios. The inclusion of a public leaderboard enhances the transparency and reproducibility of the results, encouraging further development in this area. However, the paper could benefit from more extensive discussion on the specific experimental setups and conditions under which the models were evaluated.
The paper emphasizes the release of a publicly available dataset and benchmark, which facilitates reproducibility. The authors provide a clear methodology for data collection and evaluation metrics, allowing other researchers to replicate their experiments. However, the lack of detailed implementation specifics for the models evaluated may hinder full reproducibility for those attempting to build upon this work.
One limitation is the potential bias in the dataset construction, as it relies on scripted dialogues performed by professional actors, which may not fully capture the variability of spontaneous human interactions. Additionally, the paper acknowledges challenges related to background noise and speaker overlap, which could affect model performance in real-world applications. The evaluation metrics primarily focus on behavioral correctness and latency, potentially overlooking other important aspects of dialogue quality.
The resources provided by this research have significant implications for the development of more natural and responsive spoken dialogue systems. By addressing the limitations of traditional turn-taking paradigms, this work paves the way for advancements in human-computer interaction, with applications in customer service, virtual assistants, and conversational agents. The emphasis on real-time interaction and the ability to handle interruptions could lead to more engaging and effective communication tools. This paper presents a comprehensive study on full-duplex interaction in spoken dialogue systems, introducing a novel dataset and evaluation framework that significantly advance the field. The methodology is well-structured, and the results demonstrate the potential for developing more human-like dialogue systems, addressing key challenges in real-time conversational dynamics.
Fine-grained local timing control is still absent from modern text-to-speech systems: existing approaches typically provide only utterance-level duration or global speaking-rate control, while precise token-level timing manipulation remains unavailable. To the best of our knowledge, MAGIC-TTS is the first TTS model with explicit local timing control over token-level content duration and pause. MAGIC-TTS is enabled by explicit token-level duration conditioning, carefully prepared high-confidence duration supervision, and training mechanisms that correct zero-value bias and make the model robust to missing local controls. On our timing-control benchmark, MAGIC-TTS substantially improves token-level duration and pause following over spontaneous synthesis. Even when no timing control is provided, MAGIC-TTS maintains natural high-quality synthesis. We further evaluate practical local editing with a scenario-based benchmark covering navigation guidance, guided reading, and accessibility-oriented code reading. In this setting, MAGIC-TTS realizes a reproducible uniform-timing baseline and then moves the edited regions toward the requested local targets with low mean bias. These results show that explicit fine-grained controllability can be implemented effectively in a high-quality TTS system and can support realistic local timing-editing applications.
Primary: South China University of Technology
All Institutions: South China University of Technology
MAGIC-TTS introduces the first TTS model with explicit local timing control over token-level content duration and pause. This comprehensive analysis highlights the model's innovative approach to TTS, its rigorous methodology, and its potential to significantly impact the field of speech synthesis by improving the quality and controllability of generated speech.
The methodology presented in MAGIC-TTS is robust, leveraging a flow-based TTS backbone to achieve explicit local timing control over token-level content duration and pause. The authors introduce a novel training mechanism that incorporates high-confidence duration supervision and zero-value correction, which effectively addresses the challenges of local timing manipulation in TTS systems. The separation of timing control from the acoustic generation process is a significant improvement, allowing for precise control without compromising synthesis quality. The detailed explanation of the training data pipeline and the careful construction of timing supervision demonstrate a thorough understanding of the complexities involved in TTS systems.
The experiments are well-designed, utilizing a comprehensive timing-control benchmark to validate the effectiveness of MAGIC-TTS. The results show substantial improvements in token-level duration and pause accuracy when explicit controls are provided, with clear metrics such as mean absolute error and correlation coefficients. The ablation studies further strengthen the claims by isolating the contributions of key components, confirming the importance of zero-value correction and cross-validated timing supervision. The practical local editing scenarios also illustrate the model's versatility and real-world applicability.
The paper provides sufficient details regarding the experimental setup, including model architecture, training configurations, and evaluation protocols, which supports reproducibility. However, the absence of a publicly available demo or project URL limits the practical reproducibility of the results, as external researchers would need to replicate the entire setup from scratch.
One limitation is the reliance on high-confidence supervision, which may not be easily attainable in all datasets or languages, potentially affecting the model's generalizability. Additionally, while the paper demonstrates improvements in timing control, it does not extensively explore the impact of these improvements on user experience or subjective quality assessments in real-world applications.
The advancements in fine-grained controllability in TTS systems have significant implications for applications such as navigation guidance, accessibility tools, and interactive voice assistants. By enabling precise local timing manipulation, MAGIC-TTS can enhance the expressiveness and naturalness of synthesized speech, making it more adaptable to various contexts and user needs. MAGIC-TTS introduces the first TTS model with explicit local timing control over token-level content duration and pause. This comprehensive analysis highlights the model's innovative approach to TTS, its rigorous methodology, and its potential to significantly impact the field of speech synthesis by improving the quality and controllability of generated speech.
This paper introduces PHOTON (PHysical Optical Tracking of Notes), a non-invasive optical sensing system for measuring key-lever motion in historical keyboard instruments. PHOTON tracks the vertical displacement of the key lever itself, capturing motion shaped by both performer input and the instrument's mechanically imposed, time-varying load. Reflective optical sensors mounted beneath the distal end of each lever provide continuous displacement, timing, and articulation data without interfering with the action. Unlike existing optical systems designed for modern pianos, PHOTON accommodates the diverse geometries, limited clearances, and non-standard layouts of harpsichords, clavichords, and early fortepianos. Its modular, low-profile architecture enables high-resolution, low-latency sensing across multiple manuals and variable key counts. Beyond performance capture, PHOTON provides real-time MIDI output and supports empirical study of expressive gesture, human-instrument interaction, and the construction of instrument-specific MIDI corpora using real historical mechanisms. The complete system is released as open-source hardware and software, from schematics and PCB layouts developed in KiCad to firmware written in CircuitPython, lowering the barrier to adoption, replication, and extension.
Primary: Institute for Logic, Language, and Computation
All Institutions: Institute for Logic, Language, and Computation, University of Amsterdam
The main contribution of this paper is the introduction of the PHOTON system, a non-invasive optical tracking technology for historical keyboard instruments that facilitates detailed analysis of key-lever motion and expressive gesture. This innovative approach, combined with its open-source nature, positions PHOTON as a valuable tool for researchers and performers alike, potentially transforming the study and practice of historical keyboard music.
The methodology presented in this paper is innovative and well-structured, focusing on a non-invasive optical sensing system tailored for historical keyboard instruments. The use of reflective optical sensors to measure key-lever motion is a significant advancement over existing systems, which are primarily designed for modern pianos. The modular and low-profile design allows for high-resolution data capture while accommodating the unique geometries of historical instruments. The authors provide a thorough explanation of the hardware design, including sensor selection, calibration, and integration, which demonstrates a strong understanding of the mechanical constraints involved. The open-source nature of the project enhances its accessibility and encourages further research and development.
While the paper does not present extensive experimental results, it includes a case study that illustrates the effectiveness of the PHOTON system in capturing key-action behavior on a harpsichord. The authors provide motion traces that reveal fine-grained aspects of touch and articulation, which are crucial for understanding performance nuances. However, more comprehensive experiments comparing PHOTON with existing systems or evaluating its performance across various historical instruments would strengthen the paper's contributions.
The authors emphasize reproducibility by providing detailed schematics, PCB layouts, and firmware source code. The use of widely available components and open-source tools further supports the project's replicability. The inclusion of a custom KiCad plugin for sensor placement is particularly noteworthy, as it simplifies the adaptation of the system to different keyboard layouts.
One limitation of the study is the lack of extensive empirical validation across a broader range of historical keyboard instruments. While the case study is informative, additional data from various setups would provide a more robust evaluation of the system's capabilities. Furthermore, ethical considerations regarding unobtrusive sensing are briefly mentioned but could benefit from a more in-depth discussion.
The PHOTON system has the potential to significantly impact the fields of musicology, performance practice, and instrument design. By enabling detailed empirical studies of expressive gesture and human-instrument interaction, it opens new avenues for research that have been historically underrepresented. The integration of real-time MIDI output and the ability to create instrument-specific MIDI corpora can enhance both educational and performance contexts, making historical keyboard instruments more accessible to contemporary musicians. The main contribution of this paper is the introduction of the PHOTON system, a non-invasive optical tracking technology for historical keyboard instruments that facilitates detailed analysis of key-lever motion and expressive gesture. This innovative approach, combined with its open-source nature, positions PHOTON as a valuable tool for researchers and performers alike, potentially transforming the study and practice of historical keyboard music.
Crowdsourced pairwise evaluation has emerged as a scalable approach for assessing foundation models. However, applying it to Text to Speech(TTS) introduces high variance due to linguistic diversity and multidimensional nature of speech perception. We present a controlled multidimensional pairwise evaluation framework for multilingual TTS that combines linguistic control with perceptually grounded annotation. Using 5K+ native and code-mixed sentences across 10 Indic languages, we evaluate 7 state-of-the-art TTS systems and collect over 120K pairwise comparisons from over 1900 native raters. In addition to overall preference, raters provide judgments across 6 perceptual dimensions: intelligibility, expressiveness, voice quality, liveliness, noise, and hallucinations. Using Bradley-Terry modeling, we construct a multilingual leaderboard, interpret human preference using SHAP analysis and analyze leaderboard reliability alongside model strengths and trade-offs across perceptual dimensions.
Primary: Indian Institute of Technology, Madras
All Institutions: Indian Institute of Technology, Madras, AI4Bharat, Josh Talks
The paper introduces a novel multidimensional pairwise evaluation framework for TTS systems in Indian languages, significantly advancing the methodology for evaluating speech synthesis quality. The comprehensive analysis of TTS systems across multiple perceptual dimensions and the large-scale data collection provide valuable insights that can inform future developments in the field.
The paper presents a well-structured controlled multidimensional pairwise evaluation framework tailored for multilingual TTS systems. The methodology includes a comprehensive benchmark construction with a diverse set of sentences across 10 Indic languages, addressing real-world linguistic phenomena such as code-mixing. The use of a multi-stage rater recruitment and training protocol enhances the reliability of human evaluations. The incorporation of perceptual dimensions and the Bradley-Terry model for ranking adds depth to the evaluation process, making it a significant advancement over traditional methods.
The experiments are extensive, involving over 120K pairwise comparisons from 1900+ native raters. The evaluation of 7 state-of-the-art TTS systems across multiple languages and perceptual dimensions provides a robust analysis of system performance. The results, including the construction of a multilingual leaderboard and insights into perceptual drivers of preference, are well-documented and demonstrate the effectiveness of the proposed evaluation framework.
While the paper outlines the methodology and evaluation process in detail, the lack of a publicly available dataset or code repository limits reproducibility. The authors mention that the benchmark and preference data will be released, which is a positive step towards enabling future research but currently hinders immediate reproducibility.
The study primarily focuses on TTS systems within the context of Indian languages, which may limit the generalizability of the findings to other languages or regions. Additionally, while the evaluation framework is robust, the reliance on subjective human judgments introduces variability that may affect the consistency of results across different contexts.
The findings have significant implications for the development of TTS systems in multilingual contexts, particularly in regions with high linguistic diversity like India. The proposed evaluation framework can serve as a model for future research in TTS and other areas of speech synthesis, potentially leading to improved accessibility and user experience in voice-driven applications. The paper introduces a novel multidimensional pairwise evaluation framework for TTS systems in Indian languages, significantly advancing the methodology for evaluating speech synthesis quality. The comprehensive analysis of TTS systems across multiple perceptual dimensions and the large-scale data collection provide valuable insights that can inform future developments in the field.
Portamento in string performance has been studied primarily as a binary presence-or-absence phenomenon, with existing research measuring frequency of occurrence and, less commonly, duration in milliseconds. This paper introduces a third quantitative descriptor; the spectrographic gradient of the portamento slide, measured in Hz/second, and demonstrates its measurement using a protocol combining Sonic Visualizer's melodic spectrogram layer, GIMP pixel analysis, and metric calibration against the spectrogram's known frequency axis. The gradient captures what duration alone cannot: the steepness of the pitch trajectory, which encodes the expressive character of the slide independently of its length. Applied to the opening measures of. Specifically because their monophonic texture permits reliable spectrographic pitch tracking. The method yields gradient values ranging from approximately 600~Hz/s in late-period recordings to over 4,000~Hz/s in early twentieth-century performances. The paper further documents a gain-recovery protocol that extends the analysable corpus to analogue recordings from the 1930s where portamento traces are faint in digital transfer. Applying the method to a corpus of 22 recordings spanning 1930--2012, the paper tests the hypothesis that gradient steepness correlates negatively with tempo: that slower performances produce steeper, longer slides while faster performances produce shallower slides or none at all. The results support this hypothesis, suggesting that the widely documented decline of portamento across the twentieth century is not a binary transition from presence to absence but a continuou
Primary: unknown
All Institutions: unknown
This paper introduces a new quantitative descriptor for portamento in string performance, significantly enhancing the analysis of expressive techniques in historical recordings. The innovative methodology and empirical findings provide valuable insights into the evolution of musical expression, making a meaningful contribution to the fields of musicology and audio analysis.
The paper introduces a novel methodology for measuring portamento in string performance through a spectrographic gradient, which is a significant advancement over existing binary measures of portamento presence and duration. The combination of Sonic Visualizer for spectrogram analysis and GIMP for pixel analysis is innovative, allowing for a more nuanced understanding of musical expressiveness. The calibration of the gradient measurement to physical units (Hz/second) adds rigor and comparability to the findings.
The experiments are well-structured, utilizing a corpus of 22 recordings spanning over eight decades. The analysis of gradient values and their correlation with tempo provides empirical support for the paper's hypotheses. The use of historical recordings adds depth to the findings, showing a continuous decline in portamento expressiveness rather than a simple absence.
The methodology is detailed, with clear steps for measurement and calibration, which should allow for reproducibility by other researchers. However, the reliance on human judgment in placing reference points for gradient measurement introduces variability that could affect reproducibility.
The study is limited to specific passages of two sonatas, which may not generalize across the entire cello repertoire. Additionally, the subjective nature of reference point placement could lead to inconsistencies in gradient measurement. The calibration constants are also specific to the settings used, which may limit comparisons with other studies.
This research has the potential to influence both musicology and performance practice by providing a quantitative framework for analyzing expressive techniques in string performance. The findings could inform teaching practices and performance interpretations, as well as contribute to the broader understanding of stylistic evolution in music. This paper introduces a new quantitative descriptor for portamento in string performance, significantly enhancing the analysis of expressive techniques in historical recordings. The innovative methodology and empirical findings provide valuable insights into the evolution of musical expression, making a meaningful contribution to the fields of musicology and audio analysis.