Generative audio modeling has largely been fragmented into specialized tasks, text-to-speech (TTS), text-to-music (TTM), and text-to-audio (TTA), each operating under heterogeneous control paradigms. Unifying these modalities remains a fundamental challenge due to the intrinsic dissonance between structured semantic representations (speech/music) and unstructured acoustic textures (sound effects). In this paper, we introduce UniSonate, a unified flow-matching framework capable of synthesizing speech, music, and sound effects through a standardized, reference-free natural language instruction interface. To reconcile structural disparities, we propose a novel dynamic token injection mechanism that projects unstructured environmental sounds into a structured temporal latent space, enabling precise duration control within a phoneme-driven Multimodal Diffusion Transformer (MM-DiT). Coupled with a multi-stage curriculum learning strategy, this approach effectively mitigates cross-modal optimization conflicts. Extensive experiments demonstrate that UniSonate achieves state-of-the-art performance in instruction-based TTS (WER 1.47%) and TTM (SongEval Coherence 3.18), while maintaining competitive fidelity in TTA. Crucially, we observe positive transfer, where joint training on diverse audio data significantly enhances structural coherence and prosodic expressiveness compared to single-task baselines. Audio samples are available at https://qiangchunyu.github.io/UniSonate/.
Primary: Tianjin University
All Institutions: Tianjin University, Kuaishou Technology, Institute of Automation, Chinese Academy of Sciences
UniSonate presents a unified framework for audio generation that synthesizes speech, music, and sound effects through a novel natural language interface. The technical contributions, including dynamic token injection and a multi-stage curriculum learning strategy, significantly advance the field of generative audio modeling, offering a comprehensive solution to the challenges of multimodal audio synthesis.
The methodology proposed in UniSonate is innovative, introducing a unified flow-matching framework that integrates speech, music, and sound effect generation through a natural language interface. The dynamic token injection mechanism is particularly noteworthy as it allows unstructured sound effects to be processed in a structured manner, enabling precise control over audio generation. This is complemented by a multi-stage curriculum learning strategy that effectively mitigates optimization conflicts, showcasing a thoughtful approach to training across diverse audio modalities.
The experimental evaluation is robust, with extensive comparisons against state-of-the-art models in TTS, TTM, and TTA. The paper presents clear metrics for performance evaluation, including WER, SongEval scores, and subjective evaluations like MOS. The results indicate that UniSonate achieves state-of-the-art performance in TTS and TTM while maintaining competitive fidelity in TTA, demonstrating the effectiveness of the proposed methods.
The paper provides a comprehensive description of the model architecture, training procedures, and datasets used, which supports reproducibility. However, the lack of a public code repository may hinder independent verification of results. The authors do mention the use of specific hardware configurations and hyperparameters, which aids in understanding the implementation details.
The paper acknowledges limitations, particularly in the sound effect generation where performance lags behind specialized models. Additionally, challenges in generating long-form audio content and the inherent ambiguity in natural language instructions are highlighted. These limitations suggest areas for future research and improvement.
The potential applications of UniSonate are significant, as it paves the way for general-purpose audio generation systems that can synthesize complex auditory scenes. However, ethical considerations regarding the misuse of generated audio, biases in training data, and copyright issues in music generation are critical and warrant careful attention. UniSonate presents a unified framework for audio generation that synthesizes speech, music, and sound effects through a novel natural language interface. The technical contributions, including dynamic token injection and a multi-stage curriculum learning strategy, significantly advance the field of generative audio modeling, offering a comprehensive solution to the challenges of multimodal audio synthesis.
Omnimodal Notation Processing (ONP) represents a unique frontier for omnimodal AI due to the rigorous, multi-dimensional alignment required across auditory, visual, and symbolic domains. Current research remains fragmented, focusing on isolated transcription tasks that fail to bridge the gap between superficial pattern recognition and the underlying musical logic. This landscape is further complicated by severe notation biases toward Western staff and the inherent unreliability of "LLM-as-a-judge" metrics, which often mask structural reasoning failures with systemic hallucinations. To establish a more rigorous standard, we introduce ONOTE, a multi-format benchmark that utilizes a deterministic pipeline--grounded in canonical pitch projection--to eliminate subjective scoring biases across diverse notation systems. Our evaluation of leading omnimodal models exposes a fundamental disconnect between perceptual accuracy and music-theoretic comprehension, providing a necessary framework for diagnosing reasoning vulnerabilities in complex, rule-constrained domains.
Primary: Beijing University of Posts and Telecommunications
All Institutions: Beijing University of Posts and Telecommunications, China Conservatory of Music, Nanyang Technological University
The paper introduces ONOTE, a comprehensive benchmark for evaluating Omnimodal Notation Processing, which addresses critical gaps in the assessment of music intelligence systems. The methodology and results presented are significant contributions to the field, paving the way for more effective and interpretable models in music AI.
The proposed ONOTE benchmark introduces a structured and deterministic evaluation framework for Omnimodal Notation Processing (ONP), addressing the limitations of existing models that often rely on subjective evaluations. The methodology effectively integrates multiple notation systems and tasks, ensuring a comprehensive assessment of model capabilities across auditory, visual, and symbolic domains. The use of canonical pitch projection and sequence alignment to eliminate biases is particularly innovative, allowing for a more rigorous comparison of model performance.
The experiments conducted on leading omnimodal models reveal significant insights into their performance across various tasks, including Visual Score Understanding (VSU), Cross-Format Notation Conversion (CNC), Audio-to-Symbolic Transcription (AST), and Symbolic Music Generation (SMG). The results highlight a clear disconnect between perceptual accuracy and music-theoretic comprehension, underscoring the benchmark's effectiveness in diagnosing reasoning vulnerabilities. The dataset construction and evaluation metrics are well-defined, providing a robust foundation for future research.
The paper provides detailed implementation details and a clear methodology for constructing the ONOTE benchmark, which enhances reproducibility. The availability of the dataset and code on GitHub further supports the reproducibility of the results, allowing other researchers to validate and build upon the work.
While the benchmark addresses several critical issues in music notation processing, it may still be limited by the inherent biases present in the datasets used for training and evaluation. Additionally, the focus on specific notation systems may not fully encompass the diversity of global musical representations, potentially limiting the generalizability of the findings.
The ONOTE benchmark has the potential to significantly influence the field of music intelligence by providing a standardized evaluation framework that encourages the development of more robust and interpretable omnimodal systems. Its implications extend beyond academic research, potentially impacting music education, automated composition, and music analysis tools. The paper introduces ONOTE, a comprehensive benchmark for evaluating Omnimodal Notation Processing, which addresses critical gaps in the assessment of music intelligence systems. The methodology and results presented are significant contributions to the field, paving the way for more effective and interpretable models in music AI.
While Large Audio Language Models (LALMs) achieve strong performance on short audio, they degrade on long-form inputs. This degradation is more severe in temporal awareness tasks, where temporal alignment becomes increasingly inaccurate as audio duration grows. We attribute these limitations to the lack of data, benchmarks, and modeling approaches tailored for long-form temporal awareness. To bridge this gap, we first construct LAT-Chronicle, a 1.2k hour long-form audio dataset with temporal annotations across real-world scenarios. We further develop LAT-Bench, the first human-verified benchmark supporting audio up to 30 minutes while covering three core tasks: Dense Audio Caption, Temporal Audio Grounding, and Targeted Audio Caption. Leveraging these resources, we propose LAT-Audio, formulating temporal awareness as a progressive global-to-local reasoning paradigm. A global timeline is first constructed as an aligned temporal-semantic context,and the Think-With-Audio Chain-of-Thought (TWA-CoT) is then introduced to perform iterative reasoning by incorporating local audio information via tool use. Experiments show that LAT-Audio surpasses existing models on long-form audio temporal awareness tasks and improves robustness to input duration. We release the dataset, benchmark, and model to facilitate future research at https://github.com/alanshaoTT/LAT-Audio-Repo.
Primary: Northwestern Polytechnical University
All Institutions: Northwestern Polytechnical University, Independent Researcher
The main contribution of this paper is the introduction of a novel framework and dataset for improving temporal awareness in long-form audio understanding, which significantly advances the state of the art in audio language models. The comprehensive methodology, robust experimental validation, and potential applications underscore its significance in the field of machine learning and audio processing.
The paper presents a comprehensive methodology that addresses the limitations of existing Large Audio Language Models (LALMs) in handling long-form audio. The authors construct a new dataset (LAT-Chronicle) and benchmark (LAT-Bench) specifically designed for Long-form Audio Temporal Awareness (LATA) tasks, which include Dense Audio Captioning, Temporal Audio Grounding, and Targeted Audio Captioning. The proposed LAT-Audio framework introduces a novel global-to-local reasoning paradigm and the Think-With-Audio Chain-of-Thought (TWA-CoT) approach, which iteratively refines audio understanding by leveraging local audio segments based on a constructed global timeline. This innovative approach is well-justified and effectively addresses the challenges posed by long-form audio inputs.
The experimental evaluation is robust, demonstrating the effectiveness of LAT-Audio against existing models across multiple tasks. The authors provide thorough comparisons with baseline models and conduct ablation studies to validate the importance of key components such as the global timeline and TWA-CoT. The results show significant improvements in performance metrics, indicating that the proposed methods enhance temporal awareness and robustness in long-form audio understanding. The inclusion of a diverse dataset and human-verified benchmarks adds credibility to the findings.
The paper includes detailed implementation details and a clear description of the training strategy, which enhances the reproducibility of the results. The authors provide access to the dataset, benchmark, and model through a GitHub repository, facilitating further research and validation of their findings by the community.
While the proposed framework shows promise, there are limitations, such as the computational overhead introduced by multi-turn reasoning and tool use, which may hinder real-time applications. Additionally, the focus on single-audio inputs limits the framework's applicability in more complex multimodal scenarios. Future work is needed to enhance efficiency and extend the framework to broader contexts.
The research has significant implications for various applications, including automated transcription, audio search engines, and multimedia content analysis. By improving long-form audio understanding, the work can enhance user experiences in domains such as education, entertainment, and accessibility for the hearing impaired. The open-source nature of the project encourages further innovation and exploration in the field of audio language processing. The main contribution of this paper is the introduction of a novel framework and dataset for improving temporal awareness in long-form audio understanding, which significantly advances the state of the art in audio language models. The comprehensive methodology, robust experimental validation, and potential applications underscore its significance in the field of machine learning and audio processing.
Mispronunciation Detection and Diagnosis (MDD) requires modeling fine-grained acoustic deviations. However, current ASR-derived MDD systems often face inherent limitations. In particular, CTC-based models favor sequence-level alignments that neglect transient mispronunciation cues, while explicit canonical priors bias predictions toward intended targets. To address these bottlenecks, we propose a prompt-free framework decoupling acoustic fidelity from canonical guidance. First, we introduce CROTTC, an acoustic model enforcing monotonic, frame-level alignment to accurately capture pronunciation deviations. Second, we implicitly inject mispronunciation information via the IF strategy under the knowledge transfer principle. Experiments show CROTTC-IF achieves a 71.77% F1-score on L2-ARCTIC and 71.70% F1-score on the Iqra'Eval2 leaderboard. With empirical analysis, we demonstrate that decoupling acoustics from explicit priors provides highly robust MDD.
Primary: The University of Tokyo
All Institutions: The University of Tokyo
The main contribution of this paper is the introduction of a prompt-free paradigm for mispronunciation detection that effectively separates acoustic fidelity from canonical bias, leading to improved diagnostic accuracy. This work significantly advances the field of MDD by addressing critical methodological challenges and demonstrating state-of-the-art performance across diverse benchmarks, thus paving the way for future research and applications in language learning and speech recognition.
The paper introduces a novel framework, CROTTC-IF, which effectively decouples acoustic fidelity from canonical guidance in Mispronunciation Detection and Diagnosis (MDD). The methodology is well-structured, incorporating a frame-wise acoustic model (CROTTC) that utilizes Optimal Temporal Transport Classification (OTTC) to capture fine-grained mispronunciation cues. Additionally, the Indirect Fusion (IF) strategy allows for implicit knowledge transfer, enhancing the model's performance without relying on explicit canonical prompts. The integration of Consistency Regularization further stabilizes predictions, showcasing a comprehensive approach to addressing the limitations of existing MDD systems.
The experimental evaluation is robust, with the authors conducting extensive tests on multiple datasets, including L2-ARCTIC and Iqra'Eval2. The reported F1-scores of 71.77% and 71.70% demonstrate competitive performance compared to state-of-the-art methods. The paper includes ablation studies that effectively highlight the contributions of different components of the proposed framework, providing a clear understanding of the impact of each method on overall performance.
The paper provides detailed implementation details, including architecture specifications, training protocols, and hyperparameter settings. However, the lack of a publicly accessible code repository limits the reproducibility of the results, as external researchers cannot easily verify or build upon the findings.
While the proposed framework shows promise, the paper does not address potential limitations regarding the generalizability of the model to spontaneous speech or other languages beyond the tested datasets. Additionally, the reliance on specific datasets may introduce biases that could affect the model's applicability in diverse real-world scenarios.
The advancements in MDD presented in this paper have significant implications for various applications, particularly in language learning and automated speech recognition. By improving the accuracy of mispronunciation detection, the framework can enhance educational tools for language learners and contribute to more effective speech therapy solutions. The main contribution of this paper is the introduction of a prompt-free paradigm for mispronunciation detection that effectively separates acoustic fidelity from canonical bias, leading to improved diagnostic accuracy. This work significantly advances the field of MDD by addressing critical methodological challenges and demonstrating state-of-the-art performance across diverse benchmarks, thus paving the way for future research and applications in language learning and speech recognition.
Multi-speaker automatic speech recognition (ASR) aims to transcribe conversational speech involving multiple speakers, requiring the model to capture not only what was said, but also who said it and sometimes when it was spoken. Recent Speech-LLM approaches have shown the potential of unified modeling for this task, but jointly learning speaker attribution, temporal structure, and lexical recognition remains difficult and data-intensive. At the current stage, leveraging reliable speaker diarization as an explicit structural prior provides a practical and efficient way to simplify this task. To effectively exploit such priors, we propose DM-ASR, a diarization-aware multi-speaker ASR framework that reformulates the task as a multi-turn dialogue generation process. Given an audio chunk and diarization results, DM-ASR decomposes transcription into a sequence of speaker- and time-conditioned queries, each corresponding to one speaker in one time segment. This formulation converts multi-speaker recognition into a series of structured sub-tasks, explicitly decoupling speaker-temporal structure from linguistic content and enabling effective integration of diarization cues with the reasoning capability of large language models. We further introduce an optional word-level timestamp prediction mechanism that interleaves word and timestamp tokens, yielding richer structured outputs and better transcription quality. Our analysis shows that diarization systems provide more reliable speaker identities and segment-level boundaries, while LLMs excel at modeling linguistic content and long-range dependencies, demonstrating their complementary strengths. Experiments on Mandarin and English benchmarks show that the proposed approach achieves strong performance with relatively small models and training data, while remaining competitive with or outperforming existing unified approaches.
Primary: Wuhan University
All Institutions: Wuhan University, Tencent Ethereal Audio Lab, The Chinese University of Hong Kong
The main contribution of this paper is the introduction of DM-ASR, a diarization-aware multi-speaker ASR framework that effectively combines speaker attribution and temporal grounding through a structured dialogue generation approach. This innovative methodology not only improves transcription quality but also demonstrates the potential of integrating diarization cues with large language models, marking a significant advancement in the field of automatic speech recognition.
The proposed DM-ASR framework innovatively reformulates the multi-speaker ASR task as a multi-turn dialogue generation process, effectively integrating speaker diarization cues into the transcription process. This approach decouples speaker identity and temporal information from linguistic content, allowing for a structured generation that enhances both transcription accuracy and robustness against imperfect diarization cues. The introduction of special tokens for speaker and timestamp information, alongside the optional word-level timestamp prediction, represents a significant methodological advancement in the field.
The experiments conducted on both Mandarin and English datasets demonstrate the effectiveness of DM-ASR, achieving competitive performance with smaller models and limited training data compared to larger, more data-intensive systems. The results indicate that the framework not only outperforms traditional cascaded systems but also rivals state-of-the-art end-to-end models, showcasing the practical applicability and generalizability of the proposed method across different languages and conversational contexts.
The paper provides detailed implementation information, including the architecture of the model, training procedures, and datasets used, which enhances reproducibility. However, the lack of publicly available code or demo URLs limits the ability for others to directly replicate the findings without additional effort.
One notable limitation is the reliance on external diarization systems, which can introduce errors that affect overall performance. Additionally, while the model shows robustness against imperfect cues, it does not consistently outperform strong diarization front-ends under all conditions, indicating a potential area for improvement. The paper also does not explore the scalability of the method to larger datasets or more complex conversational scenarios.
The DM-ASR framework has significant implications for real-world applications in multi-speaker environments such as meetings, interviews, and call centers. By improving the accuracy of speaker attribution and temporal grounding in ASR systems, it could enhance accessibility for users requiring accurate transcriptions, such as those with hearing impairments. Furthermore, the integration of LLMs with diarization cues could pave the way for more advanced conversational AI systems capable of understanding and generating human-like dialogue. The main contribution of this paper is the introduction of DM-ASR, a diarization-aware multi-speaker ASR framework that effectively combines speaker attribution and temporal grounding through a structured dialogue generation approach. This innovative methodology not only improves transcription quality but also demonstrates the potential of integrating diarization cues with large language models, marking a significant advancement in the field of automatic speech recognition.
Rhythm transcription is a key subtask of notation-level Automatic Music Transcription (AMT). While deep learning models have been extensively used for detecting the metrical grid in audio and MIDI performances, beat-based rhythm quantization remains largely unexplored. In this work, we introduce a novel deep learning approach for quantizing MIDI performances using a priori beat information. Our method leverages the transformer architecture to effectively process synchronized score and performance data for training a quantization model. Key components of our approach include dataset preparation, a beat-based pre-quantization method to align performance and score times within a unified framework, and a MIDI tokenizer tailored for this task. We adapt a transformer model based on the T5 architecture to meet the specific requirements of rhythm quantization. The model is evaluated using a set of score-level metrics designed for objective assessment of quantization performance. Through systematic evaluation, we optimize both data representation and model architecture. Additionally, we apply performance and score augmentations, such as transposition, note deletion, and performance-side time jitter, to enhance the model's robustness. Finally, a qualitative analysis compares our model's quantization performance against state-of-the-art probabilistic and deep-learning models on various example pieces. Our model achieves an onset F1-score of 97.3% and a note value accuracy of 83.3% on the ASAP dataset. It generalizes well across time signatures, including those not seen during training, and produces readable score output. Fine-tuning on instrument-specific datasets further improves performance by capturing characteristic rhythmic and melodic patterns. This work contributes a robust and flexible framework for beat-based MIDI quantization using transformer models.
Primary: Klangio GmbH
All Institutions: Klangio GmbH, Institute of Industrial Information Technology, Karlsruhe Institute of Technology
This paper presents a novel transformer-based approach for beat-based rhythm quantization of MIDI performances, significantly advancing the field of Automatic Music Transcription. The integration of beat annotations into the quantization process enhances the model's performance and flexibility, marking a meaningful contribution to music information retrieval.
The methodology is robust, leveraging a transformer architecture tailored for rhythm quantization by incorporating beat annotations. The preprocessing steps for aligning performance and score data are well-defined, and the tokenization scheme is innovative, allowing for efficient encoding of musical data. The model's adaptability to different time signatures and its ability to generalize across unseen time signatures are significant contributions. However, the reliance on a priori beat information may limit its applicability in scenarios where such data is not available.
The experiments are comprehensive, utilizing a suitable dataset (ASAP) that includes diverse performance MIDI files. The evaluation metrics are well-chosen, focusing on onset F1-score and note value accuracy, which are critical for assessing quantization performance. The results demonstrate strong performance compared to state-of-the-art models, indicating the effectiveness of the proposed approach. However, the paper could benefit from more extensive comparisons with a broader range of existing methods.
The paper provides sufficient details on the model architecture, training process, and evaluation metrics, which would allow other researchers to replicate the study. However, the absence of a publicly available code repository limits reproducibility.
The main limitations include the dependency on beat annotations, which may not always be available, and the model's performance on more complex time signatures that were not part of the training set. Additionally, the focus on piano and guitar data may restrict the model's generalizability to other instruments.
This work has significant implications for music information retrieval and automatic music transcription, offering a new approach to rhythm quantization that could enhance the usability of MIDI data in various applications, including music education, performance analysis, and music generation. The model's ability to generalize across different time signatures and instruments could lead to broader applications in music technology. This paper presents a novel transformer-based approach for beat-based rhythm quantization of MIDI performances, significantly advancing the field of Automatic Music Transcription. The integration of beat annotations into the quantization process enhances the model's performance and flexibility, marking a meaningful contribution to music information retrieval.
Full-duplex interaction, where speakers and listeners converse simultaneously, is a key element of human communication often missing from traditional spoken dialogue systems. These systems, based on rigid turn-taking paradigms, struggle to respond naturally in dynamic conversations. The Full-Duplex Interaction Track of ICASSP 2026 Human-like Spoken Dialogue Systems Challenge (HumDial Challenge) aims to advance the evaluation of full-duplex systems by offering a framework for handling real-time interruptions, speech overlap, and dynamic turn negotiation. We introduce a comprehensive benchmark for full-duplex spoken dialogue systems, built from the HumDial Challenge. We release a high-quality dual-channel dataset of real human-recorded conversations, capturing interruptions, overlapping speech, and feedback mechanisms. This dataset forms the basis for the HumDial-FDBench benchmark, which assesses a system's ability to handle interruptions while maintaining conversational flow. Additionally, we create a public leaderboard to compare the performance of open-source and proprietary models, promoting transparent, reproducible evaluation. These resources support the development of more responsive, adaptive, and human-like dialogue systems.
Primary: Nanjing University
All Institutions: Nanjing University, Northwestern Polytechnical University, AISHELL
This paper presents a comprehensive study on full-duplex interaction in spoken dialogue systems, introducing a novel dataset and evaluation framework that significantly advance the field. The methodology is well-structured, and the results demonstrate the potential for developing more human-like dialogue systems, addressing key challenges in real-time conversational dynamics.
The paper introduces a dual-channel dataset that captures realistic conversational dynamics, including interruptions and overlapping speech, which is a significant advancement over existing datasets that primarily focus on single-channel recordings. The methodology for dataset construction combines LLM-generated scripts with human recordings, ensuring both authenticity and control over interaction behavior. The evaluation framework, HumDial-FDBench, is well-structured, providing clear metrics for assessing system performance in real-time dialogue scenarios. This comprehensive approach allows for a nuanced understanding of full-duplex interaction, making it a valuable resource for future research.
The experimental results are robust, with a clear comparison of various models' performance on the released benchmark. The paper provides detailed metrics for interruption handling, rejection behavior, and response latency, which are critical for evaluating the effectiveness of dialogue systems in real-world scenarios. The inclusion of a public leaderboard enhances the transparency and reproducibility of the results, encouraging further development in this area. However, the paper could benefit from more extensive discussion on the specific experimental setups and conditions under which the models were evaluated.
The paper emphasizes the release of a publicly available dataset and benchmark, which facilitates reproducibility. The authors provide a clear methodology for data collection and evaluation metrics, allowing other researchers to replicate their experiments. However, the lack of detailed implementation specifics for the models evaluated may hinder full reproducibility for those attempting to build upon this work.
One limitation is the potential bias in the dataset construction, as it relies on scripted dialogues performed by professional actors, which may not fully capture the variability of spontaneous human interactions. Additionally, the paper acknowledges challenges related to background noise and speaker overlap, which could affect model performance in real-world applications. The evaluation metrics primarily focus on behavioral correctness and latency, potentially overlooking other important aspects of dialogue quality.
The resources provided by this research have significant implications for the development of more natural and responsive spoken dialogue systems. By addressing the limitations of traditional turn-taking paradigms, this work paves the way for advancements in human-computer interaction, with applications in customer service, virtual assistants, and conversational agents. The emphasis on real-time interaction and the ability to handle interruptions could lead to more engaging and effective communication tools. This paper presents a comprehensive study on full-duplex interaction in spoken dialogue systems, introducing a novel dataset and evaluation framework that significantly advance the field. The methodology is well-structured, and the results demonstrate the potential for developing more human-like dialogue systems, addressing key challenges in real-time conversational dynamics.
Fine-grained local timing control is still absent from modern text-to-speech systems: existing approaches typically provide only utterance-level duration or global speaking-rate control, while precise token-level timing manipulation remains unavailable. To the best of our knowledge, MAGIC-TTS is the first TTS model with explicit local timing control over token-level content duration and pause. MAGIC-TTS is enabled by explicit token-level duration conditioning, carefully prepared high-confidence duration supervision, and training mechanisms that correct zero-value bias and make the model robust to missing local controls. On our timing-control benchmark, MAGIC-TTS substantially improves token-level duration and pause following over spontaneous synthesis. Even when no timing control is provided, MAGIC-TTS maintains natural high-quality synthesis. We further evaluate practical local editing with a scenario-based benchmark covering navigation guidance, guided reading, and accessibility-oriented code reading. In this setting, MAGIC-TTS realizes a reproducible uniform-timing baseline and then moves the edited regions toward the requested local targets with low mean bias. These results show that explicit fine-grained controllability can be implemented effectively in a high-quality TTS system and can support realistic local timing-editing applications.
Primary: South China University of Technology
All Institutions: South China University of Technology
MAGIC-TTS introduces the first TTS model with explicit local timing control over token-level content duration and pause. This comprehensive analysis highlights the model's innovative approach to TTS, its rigorous methodology, and its potential to significantly impact the field of speech synthesis by improving the quality and controllability of generated speech.
The methodology presented in MAGIC-TTS is robust, leveraging a flow-based TTS backbone to achieve explicit local timing control over token-level content duration and pause. The authors introduce a novel training mechanism that incorporates high-confidence duration supervision and zero-value correction, which effectively addresses the challenges of local timing manipulation in TTS systems. The separation of timing control from the acoustic generation process is a significant improvement, allowing for precise control without compromising synthesis quality. The detailed explanation of the training data pipeline and the careful construction of timing supervision demonstrate a thorough understanding of the complexities involved in TTS systems.
The experiments are well-designed, utilizing a comprehensive timing-control benchmark to validate the effectiveness of MAGIC-TTS. The results show substantial improvements in token-level duration and pause accuracy when explicit controls are provided, with clear metrics such as mean absolute error and correlation coefficients. The ablation studies further strengthen the claims by isolating the contributions of key components, confirming the importance of zero-value correction and cross-validated timing supervision. The practical local editing scenarios also illustrate the model's versatility and real-world applicability.
The paper provides sufficient details regarding the experimental setup, including model architecture, training configurations, and evaluation protocols, which supports reproducibility. However, the absence of a publicly available demo or project URL limits the practical reproducibility of the results, as external researchers would need to replicate the entire setup from scratch.
One limitation is the reliance on high-confidence supervision, which may not be easily attainable in all datasets or languages, potentially affecting the model's generalizability. Additionally, while the paper demonstrates improvements in timing control, it does not extensively explore the impact of these improvements on user experience or subjective quality assessments in real-world applications.
The advancements in fine-grained controllability in TTS systems have significant implications for applications such as navigation guidance, accessibility tools, and interactive voice assistants. By enabling precise local timing manipulation, MAGIC-TTS can enhance the expressiveness and naturalness of synthesized speech, making it more adaptable to various contexts and user needs. MAGIC-TTS introduces the first TTS model with explicit local timing control over token-level content duration and pause. This comprehensive analysis highlights the model's innovative approach to TTS, its rigorous methodology, and its potential to significantly impact the field of speech synthesis by improving the quality and controllability of generated speech.
This paper introduces PHOTON (PHysical Optical Tracking of Notes), a non-invasive optical sensing system for measuring key-lever motion in historical keyboard instruments. PHOTON tracks the vertical displacement of the key lever itself, capturing motion shaped by both performer input and the instrument's mechanically imposed, time-varying load. Reflective optical sensors mounted beneath the distal end of each lever provide continuous displacement, timing, and articulation data without interfering with the action. Unlike existing optical systems designed for modern pianos, PHOTON accommodates the diverse geometries, limited clearances, and non-standard layouts of harpsichords, clavichords, and early fortepianos. Its modular, low-profile architecture enables high-resolution, low-latency sensing across multiple manuals and variable key counts. Beyond performance capture, PHOTON provides real-time MIDI output and supports empirical study of expressive gesture, human-instrument interaction, and the construction of instrument-specific MIDI corpora using real historical mechanisms. The complete system is released as open-source hardware and software, from schematics and PCB layouts developed in KiCad to firmware written in CircuitPython, lowering the barrier to adoption, replication, and extension.
Primary: Institute for Logic, Language, and Computation
All Institutions: Institute for Logic, Language, and Computation, University of Amsterdam
The main contribution of this paper is the introduction of the PHOTON system, a non-invasive optical tracking technology for historical keyboard instruments that facilitates detailed analysis of key-lever motion and expressive gesture. This innovative approach, combined with its open-source nature, positions PHOTON as a valuable tool for researchers and performers alike, potentially transforming the study and practice of historical keyboard music.
The methodology presented in this paper is innovative and well-structured, focusing on a non-invasive optical sensing system tailored for historical keyboard instruments. The use of reflective optical sensors to measure key-lever motion is a significant advancement over existing systems, which are primarily designed for modern pianos. The modular and low-profile design allows for high-resolution data capture while accommodating the unique geometries of historical instruments. The authors provide a thorough explanation of the hardware design, including sensor selection, calibration, and integration, which demonstrates a strong understanding of the mechanical constraints involved. The open-source nature of the project enhances its accessibility and encourages further research and development.
While the paper does not present extensive experimental results, it includes a case study that illustrates the effectiveness of the PHOTON system in capturing key-action behavior on a harpsichord. The authors provide motion traces that reveal fine-grained aspects of touch and articulation, which are crucial for understanding performance nuances. However, more comprehensive experiments comparing PHOTON with existing systems or evaluating its performance across various historical instruments would strengthen the paper's contributions.
The authors emphasize reproducibility by providing detailed schematics, PCB layouts, and firmware source code. The use of widely available components and open-source tools further supports the project's replicability. The inclusion of a custom KiCad plugin for sensor placement is particularly noteworthy, as it simplifies the adaptation of the system to different keyboard layouts.
One limitation of the study is the lack of extensive empirical validation across a broader range of historical keyboard instruments. While the case study is informative, additional data from various setups would provide a more robust evaluation of the system's capabilities. Furthermore, ethical considerations regarding unobtrusive sensing are briefly mentioned but could benefit from a more in-depth discussion.
The PHOTON system has the potential to significantly impact the fields of musicology, performance practice, and instrument design. By enabling detailed empirical studies of expressive gesture and human-instrument interaction, it opens new avenues for research that have been historically underrepresented. The integration of real-time MIDI output and the ability to create instrument-specific MIDI corpora can enhance both educational and performance contexts, making historical keyboard instruments more accessible to contemporary musicians. The main contribution of this paper is the introduction of the PHOTON system, a non-invasive optical tracking technology for historical keyboard instruments that facilitates detailed analysis of key-lever motion and expressive gesture. This innovative approach, combined with its open-source nature, positions PHOTON as a valuable tool for researchers and performers alike, potentially transforming the study and practice of historical keyboard music.
Portamento in string performance has been studied primarily as a binary presence-or-absence phenomenon, with existing research measuring frequency of occurrence and, less commonly, duration in milliseconds. This paper introduces a third quantitative descriptor; the spectrographic gradient of the portamento slide, measured in Hz/second, and demonstrates its measurement using a protocol combining Sonic Visualizer's melodic spectrogram layer, GIMP pixel analysis, and metric calibration against the spectrogram's known frequency axis. The gradient captures what duration alone cannot: the steepness of the pitch trajectory, which encodes the expressive character of the slide independently of its length. Applied to the opening measures of. Specifically because their monophonic texture permits reliable spectrographic pitch tracking. The method yields gradient values ranging from approximately 600~Hz/s in late-period recordings to over 4,000~Hz/s in early twentieth-century performances. The paper further documents a gain-recovery protocol that extends the analysable corpus to analogue recordings from the 1930s where portamento traces are faint in digital transfer. Applying the method to a corpus of 22 recordings spanning 1930--2012, the paper tests the hypothesis that gradient steepness correlates negatively with tempo: that slower performances produce steeper, longer slides while faster performances produce shallower slides or none at all. The results support this hypothesis, suggesting that the widely documented decline of portamento across the twentieth century is not a binary transition from presence to absence but a continuou
Primary: unknown
All Institutions: unknown
This paper introduces a new quantitative descriptor for portamento in string performance, significantly enhancing the analysis of expressive techniques in historical recordings. The innovative methodology and empirical findings provide valuable insights into the evolution of musical expression, making a meaningful contribution to the fields of musicology and audio analysis.
The paper introduces a novel methodology for measuring portamento in string performance through a spectrographic gradient, which is a significant advancement over existing binary measures of portamento presence and duration. The combination of Sonic Visualizer for spectrogram analysis and GIMP for pixel analysis is innovative, allowing for a more nuanced understanding of musical expressiveness. The calibration of the gradient measurement to physical units (Hz/second) adds rigor and comparability to the findings.
The experiments are well-structured, utilizing a corpus of 22 recordings spanning over eight decades. The analysis of gradient values and their correlation with tempo provides empirical support for the paper's hypotheses. The use of historical recordings adds depth to the findings, showing a continuous decline in portamento expressiveness rather than a simple absence.
The methodology is detailed, with clear steps for measurement and calibration, which should allow for reproducibility by other researchers. However, the reliance on human judgment in placing reference points for gradient measurement introduces variability that could affect reproducibility.
The study is limited to specific passages of two sonatas, which may not generalize across the entire cello repertoire. Additionally, the subjective nature of reference point placement could lead to inconsistencies in gradient measurement. The calibration constants are also specific to the settings used, which may limit comparisons with other studies.
This research has the potential to influence both musicology and performance practice by providing a quantitative framework for analyzing expressive techniques in string performance. The findings could inform teaching practices and performance interpretations, as well as contribute to the broader understanding of stylistic evolution in music. This paper introduces a new quantitative descriptor for portamento in string performance, significantly enhancing the analysis of expressive techniques in historical recordings. The innovative methodology and empirical findings provide valuable insights into the evolution of musical expression, making a meaningful contribution to the fields of musicology and audio analysis.
Voiceprints are widely used for authentication; however, they are easily captured in public settings and cannot be revoked once leaked. Existing anonymization systems operate inside recording devices, which makes them ineffective when microphones or software are untrusted, as in conference rooms, lecture halls, and interviews. We present EchoMask, the first practical physical-layer system for real-time voiceprint anonymization using acoustic metamaterials. By modifying sound waves before they reach the microphone, EchoMask prevents attackers from capturing clean voiceprints through compromised devices. Our design combines three key innovations: frequency-selective interference to disrupt voiceprint features while preserving speech intelligibility, an acoustic-field model to ensure stability under speaker movement, and reconfigurable structures that create time-varying interference to prevent learning or canceling a fixed acoustic pattern. EchoMask is low-cost, power-free, and 3D-printable, requiring no machine learning, software support, or microphone modification. Experiments conducted across eight microphones in diverse environments demonstrate that EchoMask increases the Miss-match Rate, i.e., the fraction of failed voiceprint matching attempts, to over 90%, while maintaining high speech intelligibility.
Primary: Northwest University
All Institutions: Northwest University, University of Leeds
This paper presents a pioneering approach to voiceprint anonymization using acoustic metamaterials, addressing critical challenges in real-time applications while maintaining speech intelligibility. The combination of innovative design principles and thorough experimental validation positions this work as a significant contribution to the field of audio privacy and security.
The methodology presented in this paper is innovative, leveraging acoustic metamaterials for voiceprint anonymization in real-time scenarios. The authors effectively address three critical challenges: maintaining speech intelligibility while disrupting identity cues, ensuring stability under speaker movement, and preventing predictable acoustic patterns. The design principles are well-structured, focusing on targeted low-frequency perturbation, dynamic stability, and passive randomization, which collectively enhance the robustness of the system. The use of numerical simulations and physical experimentation to validate the design is commendable, although the lack of machine learning integration may limit adaptability in some contexts.
The experiments are comprehensive, evaluating the system across various microphones and real-world conditions. The results demonstrate a high Miss-match Rate (MMR) of over 90%, indicating effective voiceprint protection while maintaining speech intelligibility. The inclusion of subjective listening tests (Mean Opinion Score) further strengthens the evaluation by providing insights into perceived audio quality. However, the paper could benefit from a more detailed breakdown of the experimental setup and conditions to enhance transparency.
While the paper provides a solid theoretical foundation and experimental results, it lacks specific implementation details that would facilitate reproducibility. Key parameters, such as the exact configurations of the metamaterials and the experimental setups, are not thoroughly documented. Additionally, the absence of a project URL or code repository limits the ability of other researchers to replicate the work.
The primary limitations include the reliance on passive metamaterials, which may restrict adaptability to varying acoustic environments and speaker dynamics. The system's performance under extreme conditions (e.g., very high noise levels or rapid speaker movement) is not fully explored. Furthermore, while the approach is innovative, it does not incorporate machine learning techniques that could enhance performance through adaptive learning.
The implications of this research are significant, particularly in enhancing privacy and security in voice-based authentication systems. The ability to anonymize voiceprints in real-time without requiring modifications to existing devices opens up new avenues for protecting users in public and shared environments. The findings could influence future designs of microphones and voice interaction systems, promoting user privacy in increasingly digital and interconnected spaces. This paper presents a pioneering approach to voiceprint anonymization using acoustic metamaterials, addressing critical challenges in real-time applications while maintaining speech intelligibility. The combination of innovative design principles and thorough experimental validation positions this work as a significant contribution to the field of audio privacy and security.
Audio carries richer information than text, including emotion, speaker traits, and environmental context, while also enabling lower-latency processing compared to speech-to-text pipelines. However, recent multimodal information retrieval research has predominantly focused on images, largely overlooking audio, especially in the setting of interleaved audio-text contextual retrieval. In this work, we introduce the Audio-Text Interleaved contextual Retrieval (ATIR) task, where queries can alternate between audio and text modalities. We construct an ATIR benchmark by integrating several Automatic Speech Recognition (ASR), QA, and retrieval datasets, ultimately unifying four types of contextual retrieval tasks. This benchmark substantially addresses the limitations of existing audio retrieval datasets in semantic retrieval. To study this task, we evaluate several off-the-shelf retrievers and train our ATIR model based on a Multimodal Large Language Model (MLLM). We further introduce a novel token compression mechanism that is orthogonal to existing compression methods, thereby alleviating the issue of excessive audio tokens in MLLM-based ATIR models. Experimental results demonstrate that our ATIR model achieves substantial improvements over strong baselines.
Primary: Renmin University of China
All Institutions: Renmin University of China
The paper presents a novel approach to audio-text interleaved contextual retrieval, introducing the ATIR task and a benchmark that significantly enhances the capabilities of existing retrieval systems. The comprehensive methodology, innovative technical contributions, and thorough experimental validation position this work as a meaningful advancement in the field of multimodal information retrieval.
The methodology presented in the paper is robust, introducing the ATIR task and a comprehensive benchmark that addresses the limitations of existing audio retrieval datasets. The novel token compression mechanism and the bi-encoder architecture with a token selector module are innovative contributions that enhance the performance of interleaved audio-text retrieval. The synthesis pipeline for data generation is well-structured, ensuring high-quality multimodal data that is critical for training effective models.
The experimental evaluation is thorough, demonstrating significant improvements over strong baselines across various retrieval settings. The use of multiple metrics (Recall@k and nDCG@k) provides a comprehensive assessment of model performance. The ablation studies effectively validate the contributions of the proposed components, particularly the token selector's impact on retrieval efficiency and accuracy.
The paper provides detailed implementation information, including model architecture, training configurations, and hyperparameters, which supports reproducibility. However, the lack of a publicly available project or demo URL limits accessibility for other researchers wishing to replicate the results.
The paper acknowledges limitations, such as the focus on single-document retrieval and the potential for future exploration of more complex retrieval scenarios. Additionally, the lightweight representation design may restrict performance in certain contexts, and the evaluation is primarily centered on QA-centric tasks, leaving broader applications untested.
The introduction of the ATIR task and benchmark has the potential to significantly influence multimodal retrieval research, particularly in applications involving conversational agents and hybrid search systems. The findings could lead to advancements in how audio and text are integrated for more effective information retrieval systems. The paper presents a novel approach to audio-text interleaved contextual retrieval, introducing the ATIR task and a benchmark that significantly enhances the capabilities of existing retrieval systems. The comprehensive methodology, innovative technical contributions, and thorough experimental validation position this work as a meaningful advancement in the field of multimodal information retrieval.
We propose a new approach for the second stage of a practical two-stage Optical Music Recognition (OMR) pipeline. Given symbol and event candidates from the visual pipeline, we decode them into an editable, verifiable, and exportable score structure. We focus on complex polyphonic staff notation, especially piano scores, where voice separation and intra-measure timing are the main bottlenecks. Our approach formulates second-stage decoding as a structure decoding problem and uses topology recognition with probability-guided search (BeadSolver) as its core method. We also describe a data strategy that combines procedural generation with recognition-feedback annotations. The result is a practical decoding component for real OMR systems and a path to accumulate structured score data for future end-to-end, multimodal, and RL-style methods.
Primary: FindLab
All Institutions: FindLab
The paper introduces a novel two-stage OMR approach that effectively decodes complex polyphonic music into structured formats, significantly advancing the field of music recognition. The methodology leverages innovative techniques to address longstanding challenges in music transcription, with implications for both practical applications and future research directions.
The paper presents a two-stage Optical Music Recognition (OMR) pipeline that innovatively formulates the second stage as a structure decoding problem. The use of topology recognition with a probability-guided search (BeadSolver) is a significant methodological advancement, addressing the complex challenges of voice separation and timing in polyphonic music. The integration of procedural generation with recognition-feedback annotations for training data further enhances the robustness of the proposed method.
The experiments are well-structured, comparing the proposed BeadSolver against rule-based and linear-equations baselines. The results demonstrate clear improvements in the quality of the structured output, indicating that the proposed method effectively addresses the limitations of existing approaches. However, specific quantitative results and metrics used for evaluation could be more explicitly detailed to strengthen the findings.
The paper outlines the methodology and provides a clear description of the data pipeline and model architecture, which aids in reproducibility. However, the absence of publicly available code or datasets limits the ability to fully replicate the results.
The paper does not address potential limitations in handling highly variable music notations or the scalability of the proposed method to broader music genres beyond piano scores. Additionally, the reliance on procedural generation for training data may introduce biases that are not fully explored.
The proposed OMR system has the potential to significantly enhance the accessibility of historical and contemporary music scores, enabling better integration into digital music platforms and educational tools. This could foster greater engagement with music education and preservation efforts. The paper introduces a novel two-stage OMR approach that effectively decodes complex polyphonic music into structured formats, significantly advancing the field of music recognition. The methodology leverages innovative techniques to address longstanding challenges in music transcription, with implications for both practical applications and future research directions.
Evaluation of musical source separation (MSS) has traditionally relied on Blind Source Separation Evaluation (BSS-Eval) metrics. However, recent work suggests that BSS-Eval metrics exhibit low correlation between metrics and perceptual audio quality ratings from a listening test, which is considered the gold standard evaluation method. As an alternative approach in singing voice separation, embedding-based intrusive metrics that leverage latent representations from large self-supervised audio models such as Music undERstanding with large-scale self-supervised Training (MERT) embeddings have been introduced. In this work, we analyze the correlation of perceptual audio quality ratings with two intrusive embedding-based metrics: a mean squared error (MSE) and an intrusive variant of the Fréchet Audio Distance (FAD) calculated on MERT embeddings. Experiments on two independent datasets show that these metrics correlate more strongly with perceptual audio quality ratings than traditional BSS-Eval metrics across all analyzed stem and model types.
Primary: University of Music and Performing Arts Graz
All Institutions: University of Music and Performing Arts Graz
The main contribution of this paper is the introduction of embedding-based intrusive evaluation metrics for musical source separation, which demonstrate stronger correlations with perceptual audio quality ratings than traditional BSS-Eval metrics. This work significantly advances the evaluation methodologies in the field, providing a more perceptually relevant framework for assessing audio separation models.
The paper introduces a novel approach to evaluate musical source separation (MSS) using embedding-based intrusive metrics derived from MERT representations. The methodology is well-structured, leveraging self-supervised audio models to compute metrics that correlate better with human perceptual ratings compared to traditional BSS-Eval metrics. The use of two specific metrics (MSE and an intrusive variant of FAD) is innovative, and the paper provides a clear explanation of how these metrics are calculated and their significance in the context of MSS evaluation.
The experiments are robust, utilizing two independent datasets (Bake-Off and GenSVS) to validate the proposed metrics. The correlation analysis conducted using Spearman's rank correlation coefficient (SRCC) and Pearson's correlation coefficient (PCC) is appropriate and effectively demonstrates the superiority of the embedding-based metrics over traditional methods. The results are well-presented, with clear tables and figures that summarize the findings.
The paper provides sufficient detail about the datasets and the implementation of the metrics, including references to the Python packages used. However, the absence of direct access to the datasets limits full reproducibility for external researchers. The code repository linked in the paper enhances reproducibility for the proposed metrics and analyses.
One limitation is the reliance on specific datasets, which may not fully represent the diversity of musical sources encountered in real-world applications. Additionally, while the proposed metrics show improved correlation with perceptual ratings, the paper does not explore their performance across a broader range of audio genres or separation tasks.
The findings have significant implications for the field of audio processing and music technology, as they suggest a more reliable evaluation framework for MSS models. This could lead to improved development and assessment of audio separation technologies, benefiting applications in music production, audio restoration, and content creation. The approach could also inspire further research into embedding-based evaluation metrics in other audio-related tasks. The main contribution of this paper is the introduction of embedding-based intrusive evaluation metrics for musical source separation, which demonstrate stronger correlations with perceptual audio quality ratings than traditional BSS-Eval metrics. This work significantly advances the evaluation methodologies in the field, providing a more perceptually relevant framework for assessing audio separation models.
Real-time multimodal agents transport raw audio and screenshots using networking stacks designed for human receivers, which optimize for perceptual fidelity and smooth playout. Yet agent models act as event-driven processors with no inherent sense of physical time, consuming task-relevant semantics rather than reconstructing signals in real time. This fundamental difference shifts the transport goal from the technical problem of signal fidelity (Shannon-Weaver Level A) to the semantic problem of meaning preservation (Level B). This mismatch imposes significant overhead. In visual pipelines, screenshot upload accounts for over 60% of end-to-end action latency on constrained uplinks, and in voice pipelines, conventional transport carries massive redundancy, sending 43-64x more data than needed to maintain task accuracy. We present Sema, a semantic transport system that combines discrete audio tokenizers with a hybrid screen representation (lossless accessibility-tree or OCR text, plus compact visual tokens) and bursty token delivery that eliminates jitter buffers. In simulations under emulated WAN conditions, Sema reduces uplink bandwidth by 64x for audio and 130-210x for screenshots while preserving task accuracy within 0.7 percentage points of the raw baseline.
Primary: Unaffiliated
All Institutions: Unaffiliated, Pine AI
The paper presents Sema, a semantic transport system that significantly reduces bandwidth requirements for real-time multimodal agents while maintaining task accuracy. The innovative approach and strong experimental results position this work as a meaningful contribution to the field of machine learning, particularly in audio and multimodal communication contexts.
The methodology presented in the paper introduces a novel semantic transport system, Sema, which shifts the focus from traditional signal fidelity to semantic meaning preservation. The authors effectively combine discrete audio tokenization with a hybrid screen representation, optimizing for real-time multimodal agent communication. The approach is well-structured, leveraging existing technologies in a new context, and the design principles are clearly articulated. However, the paper could benefit from a more detailed exploration of the implementation specifics and potential integration challenges with existing systems.
The experimental evaluation is robust, utilizing simulations under emulated WAN conditions to demonstrate significant reductions in uplink bandwidth for both audio and screenshots while maintaining task accuracy. The results are compelling, showcasing the effectiveness of the proposed system in practical scenarios. However, the reliance on simulation rather than real-world testing limits the generalizability of the findings.
The paper lacks sufficient implementation details that would facilitate reproducibility. While the authors describe their methods and evaluations, the absence of a publicly available codebase or detailed algorithmic descriptions hinders other researchers from replicating the study.
The primary limitations include the lack of real-world testing, which raises questions about the performance of the system in diverse network conditions. Additionally, the paper does not address potential challenges in integrating the proposed system with existing multimodal agent architectures, which could affect its adoption.
The implications of this work are significant, as it addresses a critical bottleneck in multimodal agent communication by optimizing data transport for AI models rather than human users. This could lead to more efficient and responsive AI systems, enhancing applications in various domains such as virtual assistants, gaming, and remote collaboration tools. The paper presents Sema, a semantic transport system that significantly reduces bandwidth requirements for real-time multimodal agents while maintaining task accuracy. The innovative approach and strong experimental results position this work as a meaningful contribution to the field of machine learning, particularly in audio and multimodal communication contexts.