Large Audio-Language Models (LALMs) have recently achieved strong performance across various audio-centric tasks. However, hallucination, where models generate responses that are semantically incorrect or acoustically unsupported, remains largely underexplored in the audio domain. Existing hallucination benchmarks mainly focus on text or vision, while the few audio-oriented studies are limited in scale, modality coverage, and diagnostic depth. We therefore introduce HalluAudio, the first large-scale benchmark for evaluating hallucinations across speech, environmental sound, and music. HalluAudio comprises over 5K human-verified QA pairs and spans diverse task types, including binary judgments, multi-choice reasoning, attribute verification, and open-ended QA. To systematically induce hallucinations, we design adversarial prompts and mixed-audio conditions. Beyond accuracy, our evaluation protocol measures hallucination rate, yes/no bias, error-type analysis, and refusal rate, enabling a fine-grained analysis of LALM failure modes. We benchmark a broad range of open-source and proprietary models, providing the first large-scale comparison across speech, sound, and music. Our results reveal significant deficiencies in acoustic grounding, temporal reasoning, and music attribute understanding, underscoring the need for reliable and robust LALMs.
Primary: College of Intelligence and Computing, Tianjin University
All Institutions: College of Intelligence and Computing, Tianjin University, ASUS Intelligent Cloud Services
The paper presents HalluAudio, a comprehensive benchmark for evaluating hallucination detection in Large Audio-Language Models, addressing a critical gap in the audio domain. The innovative methodology and thorough experimental evaluation contribute significantly to the understanding of model behavior, making it a valuable resource for future research in audio processing and machine learning.
The methodology is robust, featuring a systematic approach to constructing the HalluAudio benchmark, which includes a five-step pipeline for data collection and validation. The use of adversarial prompts and mixed-audio conditions to induce hallucinations is particularly innovative, allowing for a nuanced exploration of model behavior across various audio tasks. The incorporation of multiple task types and the detailed analysis of hallucination behaviors add depth to the evaluation process.
The experimental evaluation is comprehensive, benchmarking a wide range of LALMs across three audio domains with over 5,000 human-verified QA pairs. The results reveal significant deficiencies in current models, highlighting specific failure modes such as acoustic grounding and temporal reasoning. The analysis of Yes/No bias and false refusal rates provides valuable insights into model behavior beyond mere accuracy, making the findings relevant and actionable for future research.
The paper outlines a clear methodology for dataset construction and evaluation, which enhances reproducibility. The use of a human-in-the-loop approach for validation ensures high-quality data, and the detailed description of the evaluation metrics allows for replication of the experiments. However, the actual implementation details of the models evaluated are not provided, which could hinder complete reproducibility.
One limitation is the potential bias introduced by the specific audio clips selected for the benchmark, which may not represent the full diversity of audio scenarios encountered in real-world applications. Additionally, while the benchmark is comprehensive, it may not cover all possible hallucination scenarios, leaving some gaps in the evaluation of LALMs. The reliance on human verification, while ensuring quality, may also introduce subjectivity.
The introduction of HalluAudio has the potential to significantly impact the development of more reliable LALMs by providing a standardized framework for evaluating hallucination behaviors. This benchmark could guide researchers in identifying and addressing the limitations of current models, ultimately leading to improvements in audio understanding and reasoning capabilities in practical applications. The paper presents HalluAudio, a comprehensive benchmark for evaluating hallucination detection in Large Audio-Language Models, addressing a critical gap in the audio domain. The innovative methodology and thorough experimental evaluation contribute significantly to the understanding of model behavior, making it a valuable resource for future research in audio processing and machine learning.
Generative audio modeling has largely been fragmented into specialized tasks, text-to-speech (TTS), text-to-music (TTM), and text-to-audio (TTA), each operating under heterogeneous control paradigms. Unifying these modalities remains a fundamental challenge due to the intrinsic dissonance between structured semantic representations (speech/music) and unstructured acoustic textures (sound effects). In this paper, we introduce UniSonate, a unified flow-matching framework capable of synthesizing speech, music, and sound effects through a standardized, reference-free natural language instruction interface. To reconcile structural disparities, we propose a novel dynamic token injection mechanism that projects unstructured environmental sounds into a structured temporal latent space, enabling precise duration control within a phoneme-driven Multimodal Diffusion Transformer (MM-DiT). Coupled with a multi-stage curriculum learning strategy, this approach effectively mitigates cross-modal optimization conflicts. Extensive experiments demonstrate that UniSonate achieves state-of-the-art performance in instruction-based TTS (WER 1.47%) and TTM (SongEval Coherence 3.18), while maintaining competitive fidelity in TTA. Crucially, we observe positive transfer, where joint training on diverse audio data significantly enhances structural coherence and prosodic expressiveness compared to single-task baselines. Audio samples are available at https://qiangchunyu.github.io/UniSonate/.
Primary: Tianjin University
All Institutions: Tianjin University, Kuaishou Technology, Institute of Automation, Chinese Academy of Sciences
UniSonate presents a unified framework for audio generation that synthesizes speech, music, and sound effects through a novel natural language interface. The technical contributions, including dynamic token injection and a multi-stage curriculum learning strategy, significantly advance the field of generative audio modeling, offering a comprehensive solution to the challenges of multimodal audio synthesis.
The methodology proposed in UniSonate is innovative, introducing a unified flow-matching framework that integrates speech, music, and sound effect generation through a natural language interface. The dynamic token injection mechanism is particularly noteworthy as it allows unstructured sound effects to be processed in a structured manner, enabling precise control over audio generation. This is complemented by a multi-stage curriculum learning strategy that effectively mitigates optimization conflicts, showcasing a thoughtful approach to training across diverse audio modalities.
The experimental evaluation is robust, with extensive comparisons against state-of-the-art models in TTS, TTM, and TTA. The paper presents clear metrics for performance evaluation, including WER, SongEval scores, and subjective evaluations like MOS. The results indicate that UniSonate achieves state-of-the-art performance in TTS and TTM while maintaining competitive fidelity in TTA, demonstrating the effectiveness of the proposed methods.
The paper provides a comprehensive description of the model architecture, training procedures, and datasets used, which supports reproducibility. However, the lack of a public code repository may hinder independent verification of results. The authors do mention the use of specific hardware configurations and hyperparameters, which aids in understanding the implementation details.
The paper acknowledges limitations, particularly in the sound effect generation where performance lags behind specialized models. Additionally, challenges in generating long-form audio content and the inherent ambiguity in natural language instructions are highlighted. These limitations suggest areas for future research and improvement.
The potential applications of UniSonate are significant, as it paves the way for general-purpose audio generation systems that can synthesize complex auditory scenes. However, ethical considerations regarding the misuse of generated audio, biases in training data, and copyright issues in music generation are critical and warrant careful attention. UniSonate presents a unified framework for audio generation that synthesizes speech, music, and sound effects through a novel natural language interface. The technical contributions, including dynamic token injection and a multi-stage curriculum learning strategy, significantly advance the field of generative audio modeling, offering a comprehensive solution to the challenges of multimodal audio synthesis.
High-fidelity character voice synthesis is a cornerstone of immersive multimedia applications, particularly for interacting with anime avatars and digital humans. However, existing systems struggle to maintain consistent persona traits across diverse emotional contexts. To bridge this gap, we present ATRIE, a unified framework utilizing a Persona-Prosody Dual-Track (P2-DT) architecture. Our system disentangles generation into a static Timbre Track (via Scalar Quantization) and a dynamic Prosody Track (via Hierarchical Flow-Matching), distilled from a 14B LLM teacher. This design enables robust identity preservation (Zero-Shot Speaker Verification EER: 0.04) and rich emotional expression. Evaluated on our extended AnimeTTS-Bench (50 characters), ATRIE achieves state-of-the-art performance in both generation and cross-modal retrieval (mAP: 0.75), establishing a new paradigm for persona-driven multimedia content creation.
Primary: Guangdong University of Technology
All Institutions: Guangdong University of Technology, South China University of Technology
ATRIE presents a novel framework for high-fidelity, character-consistent voice synthesis that bridges semantic understanding and acoustic realization. The integration of LLM-guided emotional reasoning with a lightweight adapter represents a significant advancement in TTS technology, with potential applications across multiple domains.
The methodology presented in ATRIE is innovative, leveraging a dual-track architecture that separates static timbre from dynamic prosody, which is a significant advancement in the field of persona-driven speech synthesis. The use of a large language model (LLM) for distilling emotional reasoning into a lightweight adapter is particularly noteworthy, as it allows for real-time inference without the computational burden of the LLM during synthesis. The contrastive persona alignment mechanism is a clever approach to ensure character identity preservation while allowing for emotional variability. Overall, the proposed methods are well-structured and address critical challenges in TTS synthesis.
The experimental evaluation is robust, utilizing a newly established benchmark, AnimeTTS-Bench, which includes a diverse set of characters and strict zero-shot protocols. The paper reports state-of-the-art results across multiple metrics, including character consistency and emotional expression accuracy, demonstrating the effectiveness of ATRIE compared to existing systems. The inclusion of both qualitative and quantitative analyses strengthens the findings, providing a comprehensive view of the system's performance.
The paper provides sufficient details regarding the implementation of ATRIE, including hyperparameters, training protocols, and evaluation metrics. However, the absence of a publicly available demo or project URL limits the ability for independent verification of results. The authors do mention using PyTorch and provide a clear description of the architecture, which aids in reproducibility.
While ATRIE shows strong performance, there are limitations noted in the paper, such as the potential latency introduced by the LLM during inference and the reliance on a well-curated reference library for optimal performance. The model's performance may degrade for characters with limited voice data, and the system's effectiveness in languages other than Japanese remains untested.
The implications of ATRIE are significant, particularly in entertainment, accessibility, and education. By enabling consistent and emotionally expressive voice synthesis for virtual characters, the technology can enhance user engagement in various applications. However, ethical considerations regarding voice cloning and misinformation are crucial, and the authors advocate for responsible usage and detection mechanisms. ATRIE presents a novel framework for high-fidelity, character-consistent voice synthesis that bridges semantic understanding and acoustic realization. The integration of LLM-guided emotional reasoning with a lightweight adapter represents a significant advancement in TTS technology, with potential applications across multiple domains.
While Large Audio Language Models (LALMs) achieve strong performance on short audio, they degrade on long-form inputs. This degradation is more severe in temporal awareness tasks, where temporal alignment becomes increasingly inaccurate as audio duration grows. We attribute these limitations to the lack of data, benchmarks, and modeling approaches tailored for long-form temporal awareness. To bridge this gap, we first construct LAT-Chronicle, a 1.2k hour long-form audio dataset with temporal annotations across real-world scenarios. We further develop LAT-Bench, the first human-verified benchmark supporting audio up to 30 minutes while covering three core tasks: Dense Audio Caption, Temporal Audio Grounding, and Targeted Audio Caption. Leveraging these resources, we propose LAT-Audio, formulating temporal awareness as a progressive global-to-local reasoning paradigm. A global timeline is first constructed as an aligned temporal-semantic context,and the Think-With-Audio Chain-of-Thought (TWA-CoT) is then introduced to perform iterative reasoning by incorporating local audio information via tool use. Experiments show that LAT-Audio surpasses existing models on long-form audio temporal awareness tasks and improves robustness to input duration. We release the dataset, benchmark, and model to facilitate future research at https://github.com/alanshaoTT/LAT-Audio-Repo.
Primary: Northwestern Polytechnical University
All Institutions: Northwestern Polytechnical University, Independent Researcher
The main contribution of this paper is the introduction of a novel framework and dataset for improving temporal awareness in long-form audio understanding, which significantly advances the state of the art in audio language models. The comprehensive methodology, robust experimental validation, and potential applications underscore its significance in the field of machine learning and audio processing.
The paper presents a comprehensive methodology that addresses the limitations of existing Large Audio Language Models (LALMs) in handling long-form audio. The authors construct a new dataset (LAT-Chronicle) and benchmark (LAT-Bench) specifically designed for Long-form Audio Temporal Awareness (LATA) tasks, which include Dense Audio Captioning, Temporal Audio Grounding, and Targeted Audio Captioning. The proposed LAT-Audio framework introduces a novel global-to-local reasoning paradigm and the Think-With-Audio Chain-of-Thought (TWA-CoT) approach, which iteratively refines audio understanding by leveraging local audio segments based on a constructed global timeline. This innovative approach is well-justified and effectively addresses the challenges posed by long-form audio inputs.
The experimental evaluation is robust, demonstrating the effectiveness of LAT-Audio against existing models across multiple tasks. The authors provide thorough comparisons with baseline models and conduct ablation studies to validate the importance of key components such as the global timeline and TWA-CoT. The results show significant improvements in performance metrics, indicating that the proposed methods enhance temporal awareness and robustness in long-form audio understanding. The inclusion of a diverse dataset and human-verified benchmarks adds credibility to the findings.
The paper includes detailed implementation details and a clear description of the training strategy, which enhances the reproducibility of the results. The authors provide access to the dataset, benchmark, and model through a GitHub repository, facilitating further research and validation of their findings by the community.
While the proposed framework shows promise, there are limitations, such as the computational overhead introduced by multi-turn reasoning and tool use, which may hinder real-time applications. Additionally, the focus on single-audio inputs limits the framework's applicability in more complex multimodal scenarios. Future work is needed to enhance efficiency and extend the framework to broader contexts.
The research has significant implications for various applications, including automated transcription, audio search engines, and multimedia content analysis. By improving long-form audio understanding, the work can enhance user experiences in domains such as education, entertainment, and accessibility for the hearing impaired. The open-source nature of the project encourages further innovation and exploration in the field of audio language processing. The main contribution of this paper is the introduction of a novel framework and dataset for improving temporal awareness in long-form audio understanding, which significantly advances the state of the art in audio language models. The comprehensive methodology, robust experimental validation, and potential applications underscore its significance in the field of machine learning and audio processing.
Mispronunciation Detection and Diagnosis (MDD) requires modeling fine-grained acoustic deviations. However, current ASR-derived MDD systems often face inherent limitations. In particular, CTC-based models favor sequence-level alignments that neglect transient mispronunciation cues, while explicit canonical priors bias predictions toward intended targets. To address these bottlenecks, we propose a prompt-free framework decoupling acoustic fidelity from canonical guidance. First, we introduce CROTTC, an acoustic model enforcing monotonic, frame-level alignment to accurately capture pronunciation deviations. Second, we implicitly inject mispronunciation information via the IF strategy under the knowledge transfer principle. Experiments show CROTTC-IF achieves a 71.77% F1-score on L2-ARCTIC and 71.70% F1-score on the Iqra'Eval2 leaderboard. With empirical analysis, we demonstrate that decoupling acoustics from explicit priors provides highly robust MDD.
Primary: The University of Tokyo
All Institutions: The University of Tokyo
The main contribution of this paper is the introduction of a prompt-free paradigm for mispronunciation detection that effectively separates acoustic fidelity from canonical bias, leading to improved diagnostic accuracy. This work significantly advances the field of MDD by addressing critical methodological challenges and demonstrating state-of-the-art performance across diverse benchmarks, thus paving the way for future research and applications in language learning and speech recognition.
The paper introduces a novel framework, CROTTC-IF, which effectively decouples acoustic fidelity from canonical guidance in Mispronunciation Detection and Diagnosis (MDD). The methodology is well-structured, incorporating a frame-wise acoustic model (CROTTC) that utilizes Optimal Temporal Transport Classification (OTTC) to capture fine-grained mispronunciation cues. Additionally, the Indirect Fusion (IF) strategy allows for implicit knowledge transfer, enhancing the model's performance without relying on explicit canonical prompts. The integration of Consistency Regularization further stabilizes predictions, showcasing a comprehensive approach to addressing the limitations of existing MDD systems.
The experimental evaluation is robust, with the authors conducting extensive tests on multiple datasets, including L2-ARCTIC and Iqra'Eval2. The reported F1-scores of 71.77% and 71.70% demonstrate competitive performance compared to state-of-the-art methods. The paper includes ablation studies that effectively highlight the contributions of different components of the proposed framework, providing a clear understanding of the impact of each method on overall performance.
The paper provides detailed implementation details, including architecture specifications, training protocols, and hyperparameter settings. However, the lack of a publicly accessible code repository limits the reproducibility of the results, as external researchers cannot easily verify or build upon the findings.
While the proposed framework shows promise, the paper does not address potential limitations regarding the generalizability of the model to spontaneous speech or other languages beyond the tested datasets. Additionally, the reliance on specific datasets may introduce biases that could affect the model's applicability in diverse real-world scenarios.
The advancements in MDD presented in this paper have significant implications for various applications, particularly in language learning and automated speech recognition. By improving the accuracy of mispronunciation detection, the framework can enhance educational tools for language learners and contribute to more effective speech therapy solutions. The main contribution of this paper is the introduction of a prompt-free paradigm for mispronunciation detection that effectively separates acoustic fidelity from canonical bias, leading to improved diagnostic accuracy. This work significantly advances the field of MDD by addressing critical methodological challenges and demonstrating state-of-the-art performance across diverse benchmarks, thus paving the way for future research and applications in language learning and speech recognition.
Multi-speaker automatic speech recognition (ASR) aims to transcribe conversational speech involving multiple speakers, requiring the model to capture not only what was said, but also who said it and sometimes when it was spoken. Recent Speech-LLM approaches have shown the potential of unified modeling for this task, but jointly learning speaker attribution, temporal structure, and lexical recognition remains difficult and data-intensive. At the current stage, leveraging reliable speaker diarization as an explicit structural prior provides a practical and efficient way to simplify this task. To effectively exploit such priors, we propose DM-ASR, a diarization-aware multi-speaker ASR framework that reformulates the task as a multi-turn dialogue generation process. Given an audio chunk and diarization results, DM-ASR decomposes transcription into a sequence of speaker- and time-conditioned queries, each corresponding to one speaker in one time segment. This formulation converts multi-speaker recognition into a series of structured sub-tasks, explicitly decoupling speaker-temporal structure from linguistic content and enabling effective integration of diarization cues with the reasoning capability of large language models. We further introduce an optional word-level timestamp prediction mechanism that interleaves word and timestamp tokens, yielding richer structured outputs and better transcription quality. Our analysis shows that diarization systems provide more reliable speaker identities and segment-level boundaries, while LLMs excel at modeling linguistic content and long-range dependencies, demonstrating their complementary strengths. Experiments on Mandarin and English benchmarks show that the proposed approach achieves strong performance with relatively small models and training data, while remaining competitive with or outperforming existing unified approaches.
Primary: Wuhan University
All Institutions: Wuhan University, Tencent Ethereal Audio Lab, The Chinese University of Hong Kong
The main contribution of this paper is the introduction of DM-ASR, a diarization-aware multi-speaker ASR framework that effectively combines speaker attribution and temporal grounding through a structured dialogue generation approach. This innovative methodology not only improves transcription quality but also demonstrates the potential of integrating diarization cues with large language models, marking a significant advancement in the field of automatic speech recognition.
The proposed DM-ASR framework innovatively reformulates the multi-speaker ASR task as a multi-turn dialogue generation process, effectively integrating speaker diarization cues into the transcription process. This approach decouples speaker identity and temporal information from linguistic content, allowing for a structured generation that enhances both transcription accuracy and robustness against imperfect diarization cues. The introduction of special tokens for speaker and timestamp information, alongside the optional word-level timestamp prediction, represents a significant methodological advancement in the field.
The experiments conducted on both Mandarin and English datasets demonstrate the effectiveness of DM-ASR, achieving competitive performance with smaller models and limited training data compared to larger, more data-intensive systems. The results indicate that the framework not only outperforms traditional cascaded systems but also rivals state-of-the-art end-to-end models, showcasing the practical applicability and generalizability of the proposed method across different languages and conversational contexts.
The paper provides detailed implementation information, including the architecture of the model, training procedures, and datasets used, which enhances reproducibility. However, the lack of publicly available code or demo URLs limits the ability for others to directly replicate the findings without additional effort.
One notable limitation is the reliance on external diarization systems, which can introduce errors that affect overall performance. Additionally, while the model shows robustness against imperfect cues, it does not consistently outperform strong diarization front-ends under all conditions, indicating a potential area for improvement. The paper also does not explore the scalability of the method to larger datasets or more complex conversational scenarios.
The DM-ASR framework has significant implications for real-world applications in multi-speaker environments such as meetings, interviews, and call centers. By improving the accuracy of speaker attribution and temporal grounding in ASR systems, it could enhance accessibility for users requiring accurate transcriptions, such as those with hearing impairments. Furthermore, the integration of LLMs with diarization cues could pave the way for more advanced conversational AI systems capable of understanding and generating human-like dialogue. The main contribution of this paper is the introduction of DM-ASR, a diarization-aware multi-speaker ASR framework that effectively combines speaker attribution and temporal grounding through a structured dialogue generation approach. This innovative methodology not only improves transcription quality but also demonstrates the potential of integrating diarization cues with large language models, marking a significant advancement in the field of automatic speech recognition.
Rhythm transcription is a key subtask of notation-level Automatic Music Transcription (AMT). While deep learning models have been extensively used for detecting the metrical grid in audio and MIDI performances, beat-based rhythm quantization remains largely unexplored. In this work, we introduce a novel deep learning approach for quantizing MIDI performances using a priori beat information. Our method leverages the transformer architecture to effectively process synchronized score and performance data for training a quantization model. Key components of our approach include dataset preparation, a beat-based pre-quantization method to align performance and score times within a unified framework, and a MIDI tokenizer tailored for this task. We adapt a transformer model based on the T5 architecture to meet the specific requirements of rhythm quantization. The model is evaluated using a set of score-level metrics designed for objective assessment of quantization performance. Through systematic evaluation, we optimize both data representation and model architecture. Additionally, we apply performance and score augmentations, such as transposition, note deletion, and performance-side time jitter, to enhance the model's robustness. Finally, a qualitative analysis compares our model's quantization performance against state-of-the-art probabilistic and deep-learning models on various example pieces. Our model achieves an onset F1-score of 97.3% and a note value accuracy of 83.3% on the ASAP dataset. It generalizes well across time signatures, including those not seen during training, and produces readable score output. Fine-tuning on instrument-specific datasets further improves performance by capturing characteristic rhythmic and melodic patterns. This work contributes a robust and flexible framework for beat-based MIDI quantization using transformer models.
Primary: Klangio GmbH
All Institutions: Klangio GmbH, Institute of Industrial Information Technology, Karlsruhe Institute of Technology
This paper presents a novel transformer-based approach for beat-based rhythm quantization of MIDI performances, significantly advancing the field of Automatic Music Transcription. The integration of beat annotations into the quantization process enhances the model's performance and flexibility, marking a meaningful contribution to music information retrieval.
The methodology is robust, leveraging a transformer architecture tailored for rhythm quantization by incorporating beat annotations. The preprocessing steps for aligning performance and score data are well-defined, and the tokenization scheme is innovative, allowing for efficient encoding of musical data. The model's adaptability to different time signatures and its ability to generalize across unseen time signatures are significant contributions. However, the reliance on a priori beat information may limit its applicability in scenarios where such data is not available.
The experiments are comprehensive, utilizing a suitable dataset (ASAP) that includes diverse performance MIDI files. The evaluation metrics are well-chosen, focusing on onset F1-score and note value accuracy, which are critical for assessing quantization performance. The results demonstrate strong performance compared to state-of-the-art models, indicating the effectiveness of the proposed approach. However, the paper could benefit from more extensive comparisons with a broader range of existing methods.
The paper provides sufficient details on the model architecture, training process, and evaluation metrics, which would allow other researchers to replicate the study. However, the absence of a publicly available code repository limits reproducibility.
The main limitations include the dependency on beat annotations, which may not always be available, and the model's performance on more complex time signatures that were not part of the training set. Additionally, the focus on piano and guitar data may restrict the model's generalizability to other instruments.
This work has significant implications for music information retrieval and automatic music transcription, offering a new approach to rhythm quantization that could enhance the usability of MIDI data in various applications, including music education, performance analysis, and music generation. The model's ability to generalize across different time signatures and instruments could lead to broader applications in music technology. This paper presents a novel transformer-based approach for beat-based rhythm quantization of MIDI performances, significantly advancing the field of Automatic Music Transcription. The integration of beat annotations into the quantization process enhances the model's performance and flexibility, marking a meaningful contribution to music information retrieval.
Full-duplex interaction, where speakers and listeners converse simultaneously, is a key element of human communication often missing from traditional spoken dialogue systems. These systems, based on rigid turn-taking paradigms, struggle to respond naturally in dynamic conversations. The Full-Duplex Interaction Track of ICASSP 2026 Human-like Spoken Dialogue Systems Challenge (HumDial Challenge) aims to advance the evaluation of full-duplex systems by offering a framework for handling real-time interruptions, speech overlap, and dynamic turn negotiation. We introduce a comprehensive benchmark for full-duplex spoken dialogue systems, built from the HumDial Challenge. We release a high-quality dual-channel dataset of real human-recorded conversations, capturing interruptions, overlapping speech, and feedback mechanisms. This dataset forms the basis for the HumDial-FDBench benchmark, which assesses a system's ability to handle interruptions while maintaining conversational flow. Additionally, we create a public leaderboard to compare the performance of open-source and proprietary models, promoting transparent, reproducible evaluation. These resources support the development of more responsive, adaptive, and human-like dialogue systems.
Primary: Nanjing University
All Institutions: Nanjing University, Northwestern Polytechnical University, AISHELL
This paper presents a comprehensive study on full-duplex interaction in spoken dialogue systems, introducing a novel dataset and evaluation framework that significantly advance the field. The methodology is well-structured, and the results demonstrate the potential for developing more human-like dialogue systems, addressing key challenges in real-time conversational dynamics.
The paper introduces a dual-channel dataset that captures realistic conversational dynamics, including interruptions and overlapping speech, which is a significant advancement over existing datasets that primarily focus on single-channel recordings. The methodology for dataset construction combines LLM-generated scripts with human recordings, ensuring both authenticity and control over interaction behavior. The evaluation framework, HumDial-FDBench, is well-structured, providing clear metrics for assessing system performance in real-time dialogue scenarios. This comprehensive approach allows for a nuanced understanding of full-duplex interaction, making it a valuable resource for future research.
The experimental results are robust, with a clear comparison of various models' performance on the released benchmark. The paper provides detailed metrics for interruption handling, rejection behavior, and response latency, which are critical for evaluating the effectiveness of dialogue systems in real-world scenarios. The inclusion of a public leaderboard enhances the transparency and reproducibility of the results, encouraging further development in this area. However, the paper could benefit from more extensive discussion on the specific experimental setups and conditions under which the models were evaluated.
The paper emphasizes the release of a publicly available dataset and benchmark, which facilitates reproducibility. The authors provide a clear methodology for data collection and evaluation metrics, allowing other researchers to replicate their experiments. However, the lack of detailed implementation specifics for the models evaluated may hinder full reproducibility for those attempting to build upon this work.
One limitation is the potential bias in the dataset construction, as it relies on scripted dialogues performed by professional actors, which may not fully capture the variability of spontaneous human interactions. Additionally, the paper acknowledges challenges related to background noise and speaker overlap, which could affect model performance in real-world applications. The evaluation metrics primarily focus on behavioral correctness and latency, potentially overlooking other important aspects of dialogue quality.
The resources provided by this research have significant implications for the development of more natural and responsive spoken dialogue systems. By addressing the limitations of traditional turn-taking paradigms, this work paves the way for advancements in human-computer interaction, with applications in customer service, virtual assistants, and conversational agents. The emphasis on real-time interaction and the ability to handle interruptions could lead to more engaging and effective communication tools. This paper presents a comprehensive study on full-duplex interaction in spoken dialogue systems, introducing a novel dataset and evaluation framework that significantly advance the field. The methodology is well-structured, and the results demonstrate the potential for developing more human-like dialogue systems, addressing key challenges in real-time conversational dynamics.
Fine-grained local timing control is still absent from modern text-to-speech systems: existing approaches typically provide only utterance-level duration or global speaking-rate control, while precise token-level timing manipulation remains unavailable. To the best of our knowledge, MAGIC-TTS is the first TTS model with explicit local timing control over token-level content duration and pause. MAGIC-TTS is enabled by explicit token-level duration conditioning, carefully prepared high-confidence duration supervision, and training mechanisms that correct zero-value bias and make the model robust to missing local controls. On our timing-control benchmark, MAGIC-TTS substantially improves token-level duration and pause following over spontaneous synthesis. Even when no timing control is provided, MAGIC-TTS maintains natural high-quality synthesis. We further evaluate practical local editing with a scenario-based benchmark covering navigation guidance, guided reading, and accessibility-oriented code reading. In this setting, MAGIC-TTS realizes a reproducible uniform-timing baseline and then moves the edited regions toward the requested local targets with low mean bias. These results show that explicit fine-grained controllability can be implemented effectively in a high-quality TTS system and can support realistic local timing-editing applications.
Primary: South China University of Technology
All Institutions: South China University of Technology
MAGIC-TTS introduces the first TTS model with explicit local timing control over token-level content duration and pause. This comprehensive analysis highlights the model's innovative approach to TTS, its rigorous methodology, and its potential to significantly impact the field of speech synthesis by improving the quality and controllability of generated speech.
The methodology presented in MAGIC-TTS is robust, leveraging a flow-based TTS backbone to achieve explicit local timing control over token-level content duration and pause. The authors introduce a novel training mechanism that incorporates high-confidence duration supervision and zero-value correction, which effectively addresses the challenges of local timing manipulation in TTS systems. The separation of timing control from the acoustic generation process is a significant improvement, allowing for precise control without compromising synthesis quality. The detailed explanation of the training data pipeline and the careful construction of timing supervision demonstrate a thorough understanding of the complexities involved in TTS systems.
The experiments are well-designed, utilizing a comprehensive timing-control benchmark to validate the effectiveness of MAGIC-TTS. The results show substantial improvements in token-level duration and pause accuracy when explicit controls are provided, with clear metrics such as mean absolute error and correlation coefficients. The ablation studies further strengthen the claims by isolating the contributions of key components, confirming the importance of zero-value correction and cross-validated timing supervision. The practical local editing scenarios also illustrate the model's versatility and real-world applicability.
The paper provides sufficient details regarding the experimental setup, including model architecture, training configurations, and evaluation protocols, which supports reproducibility. However, the absence of a publicly available demo or project URL limits the practical reproducibility of the results, as external researchers would need to replicate the entire setup from scratch.
One limitation is the reliance on high-confidence supervision, which may not be easily attainable in all datasets or languages, potentially affecting the model's generalizability. Additionally, while the paper demonstrates improvements in timing control, it does not extensively explore the impact of these improvements on user experience or subjective quality assessments in real-world applications.
The advancements in fine-grained controllability in TTS systems have significant implications for applications such as navigation guidance, accessibility tools, and interactive voice assistants. By enabling precise local timing manipulation, MAGIC-TTS can enhance the expressiveness and naturalness of synthesized speech, making it more adaptable to various contexts and user needs. MAGIC-TTS introduces the first TTS model with explicit local timing control over token-level content duration and pause. This comprehensive analysis highlights the model's innovative approach to TTS, its rigorous methodology, and its potential to significantly impact the field of speech synthesis by improving the quality and controllability of generated speech.
This paper introduces PHOTON (PHysical Optical Tracking of Notes), a non-invasive optical sensing system for measuring key-lever motion in historical keyboard instruments. PHOTON tracks the vertical displacement of the key lever itself, capturing motion shaped by both performer input and the instrument's mechanically imposed, time-varying load. Reflective optical sensors mounted beneath the distal end of each lever provide continuous displacement, timing, and articulation data without interfering with the action. Unlike existing optical systems designed for modern pianos, PHOTON accommodates the diverse geometries, limited clearances, and non-standard layouts of harpsichords, clavichords, and early fortepianos. Its modular, low-profile architecture enables high-resolution, low-latency sensing across multiple manuals and variable key counts. Beyond performance capture, PHOTON provides real-time MIDI output and supports empirical study of expressive gesture, human-instrument interaction, and the construction of instrument-specific MIDI corpora using real historical mechanisms. The complete system is released as open-source hardware and software, from schematics and PCB layouts developed in KiCad to firmware written in CircuitPython, lowering the barrier to adoption, replication, and extension.
Primary: Institute for Logic, Language, and Computation
All Institutions: Institute for Logic, Language, and Computation, University of Amsterdam
The main contribution of this paper is the introduction of the PHOTON system, a non-invasive optical tracking technology for historical keyboard instruments that facilitates detailed analysis of key-lever motion and expressive gesture. This innovative approach, combined with its open-source nature, positions PHOTON as a valuable tool for researchers and performers alike, potentially transforming the study and practice of historical keyboard music.
The methodology presented in this paper is innovative and well-structured, focusing on a non-invasive optical sensing system tailored for historical keyboard instruments. The use of reflective optical sensors to measure key-lever motion is a significant advancement over existing systems, which are primarily designed for modern pianos. The modular and low-profile design allows for high-resolution data capture while accommodating the unique geometries of historical instruments. The authors provide a thorough explanation of the hardware design, including sensor selection, calibration, and integration, which demonstrates a strong understanding of the mechanical constraints involved. The open-source nature of the project enhances its accessibility and encourages further research and development.
While the paper does not present extensive experimental results, it includes a case study that illustrates the effectiveness of the PHOTON system in capturing key-action behavior on a harpsichord. The authors provide motion traces that reveal fine-grained aspects of touch and articulation, which are crucial for understanding performance nuances. However, more comprehensive experiments comparing PHOTON with existing systems or evaluating its performance across various historical instruments would strengthen the paper's contributions.
The authors emphasize reproducibility by providing detailed schematics, PCB layouts, and firmware source code. The use of widely available components and open-source tools further supports the project's replicability. The inclusion of a custom KiCad plugin for sensor placement is particularly noteworthy, as it simplifies the adaptation of the system to different keyboard layouts.
One limitation of the study is the lack of extensive empirical validation across a broader range of historical keyboard instruments. While the case study is informative, additional data from various setups would provide a more robust evaluation of the system's capabilities. Furthermore, ethical considerations regarding unobtrusive sensing are briefly mentioned but could benefit from a more in-depth discussion.
The PHOTON system has the potential to significantly impact the fields of musicology, performance practice, and instrument design. By enabling detailed empirical studies of expressive gesture and human-instrument interaction, it opens new avenues for research that have been historically underrepresented. The integration of real-time MIDI output and the ability to create instrument-specific MIDI corpora can enhance both educational and performance contexts, making historical keyboard instruments more accessible to contemporary musicians. The main contribution of this paper is the introduction of the PHOTON system, a non-invasive optical tracking technology for historical keyboard instruments that facilitates detailed analysis of key-lever motion and expressive gesture. This innovative approach, combined with its open-source nature, positions PHOTON as a valuable tool for researchers and performers alike, potentially transforming the study and practice of historical keyboard music.
Portamento in string performance has been studied primarily as a binary presence-or-absence phenomenon, with existing research measuring frequency of occurrence and, less commonly, duration in milliseconds. This paper introduces a third quantitative descriptor; the spectrographic gradient of the portamento slide, measured in Hz/second, and demonstrates its measurement using a protocol combining Sonic Visualizer's melodic spectrogram layer, GIMP pixel analysis, and metric calibration against the spectrogram's known frequency axis. The gradient captures what duration alone cannot: the steepness of the pitch trajectory, which encodes the expressive character of the slide independently of its length. Applied to the opening measures of. Specifically because their monophonic texture permits reliable spectrographic pitch tracking. The method yields gradient values ranging from approximately 600~Hz/s in late-period recordings to over 4,000~Hz/s in early twentieth-century performances. The paper further documents a gain-recovery protocol that extends the analysable corpus to analogue recordings from the 1930s where portamento traces are faint in digital transfer. Applying the method to a corpus of 22 recordings spanning 1930--2012, the paper tests the hypothesis that gradient steepness correlates negatively with tempo: that slower performances produce steeper, longer slides while faster performances produce shallower slides or none at all. The results support this hypothesis, suggesting that the widely documented decline of portamento across the twentieth century is not a binary transition from presence to absence but a continuou
Primary: unknown
All Institutions: unknown
This paper introduces a new quantitative descriptor for portamento in string performance, significantly enhancing the analysis of expressive techniques in historical recordings. The innovative methodology and empirical findings provide valuable insights into the evolution of musical expression, making a meaningful contribution to the fields of musicology and audio analysis.
The paper introduces a novel methodology for measuring portamento in string performance through a spectrographic gradient, which is a significant advancement over existing binary measures of portamento presence and duration. The combination of Sonic Visualizer for spectrogram analysis and GIMP for pixel analysis is innovative, allowing for a more nuanced understanding of musical expressiveness. The calibration of the gradient measurement to physical units (Hz/second) adds rigor and comparability to the findings.
The experiments are well-structured, utilizing a corpus of 22 recordings spanning over eight decades. The analysis of gradient values and their correlation with tempo provides empirical support for the paper's hypotheses. The use of historical recordings adds depth to the findings, showing a continuous decline in portamento expressiveness rather than a simple absence.
The methodology is detailed, with clear steps for measurement and calibration, which should allow for reproducibility by other researchers. However, the reliance on human judgment in placing reference points for gradient measurement introduces variability that could affect reproducibility.
The study is limited to specific passages of two sonatas, which may not generalize across the entire cello repertoire. Additionally, the subjective nature of reference point placement could lead to inconsistencies in gradient measurement. The calibration constants are also specific to the settings used, which may limit comparisons with other studies.
This research has the potential to influence both musicology and performance practice by providing a quantitative framework for analyzing expressive techniques in string performance. The findings could inform teaching practices and performance interpretations, as well as contribute to the broader understanding of stylistic evolution in music. This paper introduces a new quantitative descriptor for portamento in string performance, significantly enhancing the analysis of expressive techniques in historical recordings. The innovative methodology and empirical findings provide valuable insights into the evolution of musical expression, making a meaningful contribution to the fields of musicology and audio analysis.
Omnimodal Notation Processing (ONP) represents a unique frontier for omnimodal AI due to the rigorous, multi-dimensional alignment required across auditory, visual, and symbolic domains. Current research remains fragmented, focusing on isolated transcription tasks that fail to bridge the gap between superficial pattern recognition and the underlying musical logic. This landscape is further complicated by severe notation biases toward Western staff and the inherent unreliability of "LLM-as-a-judge" metrics, which often mask structural reasoning failures with systemic hallucinations. To establish a more rigorous standard, we introduce ONOTE, a multi-format benchmark that utilizes a deterministic pipeline--grounded in canonical pitch projection--to eliminate subjective scoring biases across diverse notation systems. Our evaluation of leading omnimodal models exposes a fundamental disconnect between perceptual accuracy and music-theoretic comprehension, providing a necessary framework for diagnosing reasoning vulnerabilities in complex, rule-constrained domains.
Primary: Beijing University of Posts and Telecommunications
All Institutions: Beijing University of Posts and Telecommunications, China Conservatory of Music, Nanyang Technological University
The paper introduces ONOTE, a comprehensive benchmark for evaluating Omnimodal Notation Processing, which addresses critical gaps in the assessment of music intelligence systems. The methodology and results presented are significant contributions to the field, paving the way for more effective and interpretable models in music AI.
The proposed ONOTE benchmark introduces a structured and deterministic evaluation framework for Omnimodal Notation Processing (ONP), addressing the limitations of existing models that often rely on subjective evaluations. The methodology effectively integrates multiple notation systems and tasks, ensuring a comprehensive assessment of model capabilities across auditory, visual, and symbolic domains. The use of canonical pitch projection and sequence alignment to eliminate biases is particularly innovative, allowing for a more rigorous comparison of model performance.
The experiments conducted on leading omnimodal models reveal significant insights into their performance across various tasks, including Visual Score Understanding (VSU), Cross-Format Notation Conversion (CNC), Audio-to-Symbolic Transcription (AST), and Symbolic Music Generation (SMG). The results highlight a clear disconnect between perceptual accuracy and music-theoretic comprehension, underscoring the benchmark's effectiveness in diagnosing reasoning vulnerabilities. The dataset construction and evaluation metrics are well-defined, providing a robust foundation for future research.
The paper provides detailed implementation details and a clear methodology for constructing the ONOTE benchmark, which enhances reproducibility. The availability of the dataset and code on GitHub further supports the reproducibility of the results, allowing other researchers to validate and build upon the work.
While the benchmark addresses several critical issues in music notation processing, it may still be limited by the inherent biases present in the datasets used for training and evaluation. Additionally, the focus on specific notation systems may not fully encompass the diversity of global musical representations, potentially limiting the generalizability of the findings.
The ONOTE benchmark has the potential to significantly influence the field of music intelligence by providing a standardized evaluation framework that encourages the development of more robust and interpretable omnimodal systems. Its implications extend beyond academic research, potentially impacting music education, automated composition, and music analysis tools. The paper introduces ONOTE, a comprehensive benchmark for evaluating Omnimodal Notation Processing, which addresses critical gaps in the assessment of music intelligence systems. The methodology and results presented are significant contributions to the field, paving the way for more effective and interpretable models in music AI.
Voiceprints are widely used for authentication; however, they are easily captured in public settings and cannot be revoked once leaked. Existing anonymization systems operate inside recording devices, which makes them ineffective when microphones or software are untrusted, as in conference rooms, lecture halls, and interviews. We present EchoMask, the first practical physical-layer system for real-time voiceprint anonymization using acoustic metamaterials. By modifying sound waves before they reach the microphone, EchoMask prevents attackers from capturing clean voiceprints through compromised devices. Our design combines three key innovations: frequency-selective interference to disrupt voiceprint features while preserving speech intelligibility, an acoustic-field model to ensure stability under speaker movement, and reconfigurable structures that create time-varying interference to prevent learning or canceling a fixed acoustic pattern. EchoMask is low-cost, power-free, and 3D-printable, requiring no machine learning, software support, or microphone modification. Experiments conducted across eight microphones in diverse environments demonstrate that EchoMask increases the Miss-match Rate, i.e., the fraction of failed voiceprint matching attempts, to over 90%, while maintaining high speech intelligibility.
Primary: Northwest University
All Institutions: Northwest University, University of Leeds
This paper presents a pioneering approach to voiceprint anonymization using acoustic metamaterials, addressing critical challenges in real-time applications while maintaining speech intelligibility. The combination of innovative design principles and thorough experimental validation positions this work as a significant contribution to the field of audio privacy and security.
The methodology presented in this paper is innovative, leveraging acoustic metamaterials for voiceprint anonymization in real-time scenarios. The authors effectively address three critical challenges: maintaining speech intelligibility while disrupting identity cues, ensuring stability under speaker movement, and preventing predictable acoustic patterns. The design principles are well-structured, focusing on targeted low-frequency perturbation, dynamic stability, and passive randomization, which collectively enhance the robustness of the system. The use of numerical simulations and physical experimentation to validate the design is commendable, although the lack of machine learning integration may limit adaptability in some contexts.
The experiments are comprehensive, evaluating the system across various microphones and real-world conditions. The results demonstrate a high Miss-match Rate (MMR) of over 90%, indicating effective voiceprint protection while maintaining speech intelligibility. The inclusion of subjective listening tests (Mean Opinion Score) further strengthens the evaluation by providing insights into perceived audio quality. However, the paper could benefit from a more detailed breakdown of the experimental setup and conditions to enhance transparency.
While the paper provides a solid theoretical foundation and experimental results, it lacks specific implementation details that would facilitate reproducibility. Key parameters, such as the exact configurations of the metamaterials and the experimental setups, are not thoroughly documented. Additionally, the absence of a project URL or code repository limits the ability of other researchers to replicate the work.
The primary limitations include the reliance on passive metamaterials, which may restrict adaptability to varying acoustic environments and speaker dynamics. The system's performance under extreme conditions (e.g., very high noise levels or rapid speaker movement) is not fully explored. Furthermore, while the approach is innovative, it does not incorporate machine learning techniques that could enhance performance through adaptive learning.
The implications of this research are significant, particularly in enhancing privacy and security in voice-based authentication systems. The ability to anonymize voiceprints in real-time without requiring modifications to existing devices opens up new avenues for protecting users in public and shared environments. The findings could influence future designs of microphones and voice interaction systems, promoting user privacy in increasingly digital and interconnected spaces. This paper presents a pioneering approach to voiceprint anonymization using acoustic metamaterials, addressing critical challenges in real-time applications while maintaining speech intelligibility. The combination of innovative design principles and thorough experimental validation positions this work as a significant contribution to the field of audio privacy and security.
Audio carries richer information than text, including emotion, speaker traits, and environmental context, while also enabling lower-latency processing compared to speech-to-text pipelines. However, recent multimodal information retrieval research has predominantly focused on images, largely overlooking audio, especially in the setting of interleaved audio-text contextual retrieval. In this work, we introduce the Audio-Text Interleaved contextual Retrieval (ATIR) task, where queries can alternate between audio and text modalities. We construct an ATIR benchmark by integrating several Automatic Speech Recognition (ASR), QA, and retrieval datasets, ultimately unifying four types of contextual retrieval tasks. This benchmark substantially addresses the limitations of existing audio retrieval datasets in semantic retrieval. To study this task, we evaluate several off-the-shelf retrievers and train our ATIR model based on a Multimodal Large Language Model (MLLM). We further introduce a novel token compression mechanism that is orthogonal to existing compression methods, thereby alleviating the issue of excessive audio tokens in MLLM-based ATIR models. Experimental results demonstrate that our ATIR model achieves substantial improvements over strong baselines.
Primary: Renmin University of China
All Institutions: Renmin University of China
The paper presents a novel approach to audio-text interleaved contextual retrieval, introducing the ATIR task and a benchmark that significantly enhances the capabilities of existing retrieval systems. The comprehensive methodology, innovative technical contributions, and thorough experimental validation position this work as a meaningful advancement in the field of multimodal information retrieval.
The methodology presented in the paper is robust, introducing the ATIR task and a comprehensive benchmark that addresses the limitations of existing audio retrieval datasets. The novel token compression mechanism and the bi-encoder architecture with a token selector module are innovative contributions that enhance the performance of interleaved audio-text retrieval. The synthesis pipeline for data generation is well-structured, ensuring high-quality multimodal data that is critical for training effective models.
The experimental evaluation is thorough, demonstrating significant improvements over strong baselines across various retrieval settings. The use of multiple metrics (Recall@k and nDCG@k) provides a comprehensive assessment of model performance. The ablation studies effectively validate the contributions of the proposed components, particularly the token selector's impact on retrieval efficiency and accuracy.
The paper provides detailed implementation information, including model architecture, training configurations, and hyperparameters, which supports reproducibility. However, the lack of a publicly available project or demo URL limits accessibility for other researchers wishing to replicate the results.
The paper acknowledges limitations, such as the focus on single-document retrieval and the potential for future exploration of more complex retrieval scenarios. Additionally, the lightweight representation design may restrict performance in certain contexts, and the evaluation is primarily centered on QA-centric tasks, leaving broader applications untested.
The introduction of the ATIR task and benchmark has the potential to significantly influence multimodal retrieval research, particularly in applications involving conversational agents and hybrid search systems. The findings could lead to advancements in how audio and text are integrated for more effective information retrieval systems. The paper presents a novel approach to audio-text interleaved contextual retrieval, introducing the ATIR task and a benchmark that significantly enhances the capabilities of existing retrieval systems. The comprehensive methodology, innovative technical contributions, and thorough experimental validation position this work as a meaningful advancement in the field of multimodal information retrieval.
We propose a new approach for the second stage of a practical two-stage Optical Music Recognition (OMR) pipeline. Given symbol and event candidates from the visual pipeline, we decode them into an editable, verifiable, and exportable score structure. We focus on complex polyphonic staff notation, especially piano scores, where voice separation and intra-measure timing are the main bottlenecks. Our approach formulates second-stage decoding as a structure decoding problem and uses topology recognition with probability-guided search (BeadSolver) as its core method. We also describe a data strategy that combines procedural generation with recognition-feedback annotations. The result is a practical decoding component for real OMR systems and a path to accumulate structured score data for future end-to-end, multimodal, and RL-style methods.
Primary: FindLab
All Institutions: FindLab
The paper introduces a novel two-stage OMR approach that effectively decodes complex polyphonic music into structured formats, significantly advancing the field of music recognition. The methodology leverages innovative techniques to address longstanding challenges in music transcription, with implications for both practical applications and future research directions.
The paper presents a two-stage Optical Music Recognition (OMR) pipeline that innovatively formulates the second stage as a structure decoding problem. The use of topology recognition with a probability-guided search (BeadSolver) is a significant methodological advancement, addressing the complex challenges of voice separation and timing in polyphonic music. The integration of procedural generation with recognition-feedback annotations for training data further enhances the robustness of the proposed method.
The experiments are well-structured, comparing the proposed BeadSolver against rule-based and linear-equations baselines. The results demonstrate clear improvements in the quality of the structured output, indicating that the proposed method effectively addresses the limitations of existing approaches. However, specific quantitative results and metrics used for evaluation could be more explicitly detailed to strengthen the findings.
The paper outlines the methodology and provides a clear description of the data pipeline and model architecture, which aids in reproducibility. However, the absence of publicly available code or datasets limits the ability to fully replicate the results.
The paper does not address potential limitations in handling highly variable music notations or the scalability of the proposed method to broader music genres beyond piano scores. Additionally, the reliance on procedural generation for training data may introduce biases that are not fully explored.
The proposed OMR system has the potential to significantly enhance the accessibility of historical and contemporary music scores, enabling better integration into digital music platforms and educational tools. This could foster greater engagement with music education and preservation efforts. The paper introduces a novel two-stage OMR approach that effectively decodes complex polyphonic music into structured formats, significantly advancing the field of music recognition. The methodology leverages innovative techniques to address longstanding challenges in music transcription, with implications for both practical applications and future research directions.
Evaluation of musical source separation (MSS) has traditionally relied on Blind Source Separation Evaluation (BSS-Eval) metrics. However, recent work suggests that BSS-Eval metrics exhibit low correlation between metrics and perceptual audio quality ratings from a listening test, which is considered the gold standard evaluation method. As an alternative approach in singing voice separation, embedding-based intrusive metrics that leverage latent representations from large self-supervised audio models such as Music undERstanding with large-scale self-supervised Training (MERT) embeddings have been introduced. In this work, we analyze the correlation of perceptual audio quality ratings with two intrusive embedding-based metrics: a mean squared error (MSE) and an intrusive variant of the Fréchet Audio Distance (FAD) calculated on MERT embeddings. Experiments on two independent datasets show that these metrics correlate more strongly with perceptual audio quality ratings than traditional BSS-Eval metrics across all analyzed stem and model types.
Primary: University of Music and Performing Arts Graz
All Institutions: University of Music and Performing Arts Graz
The main contribution of this paper is the introduction of embedding-based intrusive evaluation metrics for musical source separation, which demonstrate stronger correlations with perceptual audio quality ratings than traditional BSS-Eval metrics. This work significantly advances the evaluation methodologies in the field, providing a more perceptually relevant framework for assessing audio separation models.
The paper introduces a novel approach to evaluate musical source separation (MSS) using embedding-based intrusive metrics derived from MERT representations. The methodology is well-structured, leveraging self-supervised audio models to compute metrics that correlate better with human perceptual ratings compared to traditional BSS-Eval metrics. The use of two specific metrics (MSE and an intrusive variant of FAD) is innovative, and the paper provides a clear explanation of how these metrics are calculated and their significance in the context of MSS evaluation.
The experiments are robust, utilizing two independent datasets (Bake-Off and GenSVS) to validate the proposed metrics. The correlation analysis conducted using Spearman's rank correlation coefficient (SRCC) and Pearson's correlation coefficient (PCC) is appropriate and effectively demonstrates the superiority of the embedding-based metrics over traditional methods. The results are well-presented, with clear tables and figures that summarize the findings.
The paper provides sufficient detail about the datasets and the implementation of the metrics, including references to the Python packages used. However, the absence of direct access to the datasets limits full reproducibility for external researchers. The code repository linked in the paper enhances reproducibility for the proposed metrics and analyses.
One limitation is the reliance on specific datasets, which may not fully represent the diversity of musical sources encountered in real-world applications. Additionally, while the proposed metrics show improved correlation with perceptual ratings, the paper does not explore their performance across a broader range of audio genres or separation tasks.
The findings have significant implications for the field of audio processing and music technology, as they suggest a more reliable evaluation framework for MSS models. This could lead to improved development and assessment of audio separation technologies, benefiting applications in music production, audio restoration, and content creation. The approach could also inspire further research into embedding-based evaluation metrics in other audio-related tasks. The main contribution of this paper is the introduction of embedding-based intrusive evaluation metrics for musical source separation, which demonstrate stronger correlations with perceptual audio quality ratings than traditional BSS-Eval metrics. This work significantly advances the evaluation methodologies in the field, providing a more perceptually relevant framework for assessing audio separation models.
Real-time multimodal agents transport raw audio and screenshots using networking stacks designed for human receivers, which optimize for perceptual fidelity and smooth playout. Yet agent models act as event-driven processors with no inherent sense of physical time, consuming task-relevant semantics rather than reconstructing signals in real time. This fundamental difference shifts the transport goal from the technical problem of signal fidelity (Shannon-Weaver Level A) to the semantic problem of meaning preservation (Level B). This mismatch imposes significant overhead. In visual pipelines, screenshot upload accounts for over 60% of end-to-end action latency on constrained uplinks, and in voice pipelines, conventional transport carries massive redundancy, sending 43-64x more data than needed to maintain task accuracy. We present Sema, a semantic transport system that combines discrete audio tokenizers with a hybrid screen representation (lossless accessibility-tree or OCR text, plus compact visual tokens) and bursty token delivery that eliminates jitter buffers. In simulations under emulated WAN conditions, Sema reduces uplink bandwidth by 64x for audio and 130-210x for screenshots while preserving task accuracy within 0.7 percentage points of the raw baseline.
Primary: Unaffiliated
All Institutions: Unaffiliated, Pine AI
The paper presents Sema, a semantic transport system that significantly reduces bandwidth requirements for real-time multimodal agents while maintaining task accuracy. The innovative approach and strong experimental results position this work as a meaningful contribution to the field of machine learning, particularly in audio and multimodal communication contexts.
The methodology presented in the paper introduces a novel semantic transport system, Sema, which shifts the focus from traditional signal fidelity to semantic meaning preservation. The authors effectively combine discrete audio tokenization with a hybrid screen representation, optimizing for real-time multimodal agent communication. The approach is well-structured, leveraging existing technologies in a new context, and the design principles are clearly articulated. However, the paper could benefit from a more detailed exploration of the implementation specifics and potential integration challenges with existing systems.
The experimental evaluation is robust, utilizing simulations under emulated WAN conditions to demonstrate significant reductions in uplink bandwidth for both audio and screenshots while maintaining task accuracy. The results are compelling, showcasing the effectiveness of the proposed system in practical scenarios. However, the reliance on simulation rather than real-world testing limits the generalizability of the findings.
The paper lacks sufficient implementation details that would facilitate reproducibility. While the authors describe their methods and evaluations, the absence of a publicly available codebase or detailed algorithmic descriptions hinders other researchers from replicating the study.
The primary limitations include the lack of real-world testing, which raises questions about the performance of the system in diverse network conditions. Additionally, the paper does not address potential challenges in integrating the proposed system with existing multimodal agent architectures, which could affect its adoption.
The implications of this work are significant, as it addresses a critical bottleneck in multimodal agent communication by optimizing data transport for AI models rather than human users. This could lead to more efficient and responsive AI systems, enhancing applications in various domains such as virtual assistants, gaming, and remote collaboration tools. The paper presents Sema, a semantic transport system that significantly reduces bandwidth requirements for real-time multimodal agents while maintaining task accuracy. The innovative approach and strong experimental results position this work as a meaningful contribution to the field of machine learning, particularly in audio and multimodal communication contexts.
Tokenizing music to fit the general framework of language models is a compelling challenge, especially considering the diverse symbolic structures in which music can be represented (e.g., sequences, grids, and graphs). To date, most approaches tokenize symbolic music as sequences of musical events, such as onsets, pitches, time shifts, or compound note events. This strategy is intuitive and has proven effective in Transformer-based models, but it treats the regularity of musical time implicitly: individual tokens may span different durations, resulting in non-uniform time progression. In this paper, we instead consider whether an alternative tokenization is possible, where a uniform-length musical step (e.g., a beat) serves as the basic unit. Specifically, we encode all events within a single time step at the same pitch as one token, and group tokens explicitly by time step, which resembles a sparse encoding of a piano-roll representation. We evaluate the proposed tokenization on music continuation and accompaniment generation tasks, comparing it with mainstream event-based methods. Results show improved musical quality and structural coherence, while additional analyses confirm higher efficiency and more effective capture of long-range patterns with the proposed tokenization.
Primary: New York University
All Institutions: South China University of Technology, National University of Singapore, New York University
The main contribution of this paper is the introduction of BEAT, a novel tokenization framework for symbolic music generation that enhances the coherence and quality of generated music while facilitating real-time accompaniment. This work significantly advances the field by integrating structured tokenization with autoregressive modeling, addressing key challenges in music generation and representation.
The proposed BEAT tokenization method introduces a novel approach to symbolic music representation by utilizing uniform temporal steps, which addresses the limitations of existing event-based and notation-based methods. The authors effectively leverage the concept of beats as fundamental units, allowing for compact representation while maintaining temporal regularity. The methodology is well-structured, detailing the encoding process, beat-level assembly, and sequence construction, which collectively enhance the model's ability to generate coherent musical outputs. The integration of a Transformer model with this tokenization is a significant advancement, as it facilitates real-time generation and accommodates various music generation tasks.
The experiments conducted are comprehensive, involving both objective and subjective evaluations across multiple music generation tasks, including piano and multi-track continuation. The use of established metrics such as Groove Consistency, Scale Consistency, and Fréchet Music Distance provides a robust framework for assessing the performance of the BEAT method against baseline models. The subjective evaluations, which include listener surveys, further validate the effectiveness of BEAT in producing high-quality musical outputs. The results consistently demonstrate that BEAT outperforms existing methods, indicating its practical applicability in real-world scenarios.
The paper provides sufficient implementation details, including model architecture, training datasets, and evaluation protocols, which enhance the reproducibility of the results. However, the absence of a public code repository limits the ease with which other researchers can replicate the findings. The authors could improve reproducibility by sharing their code and datasets, allowing for independent verification of their results.
While the BEAT method shows promise, there are limitations regarding the diversity of the training datasets, which primarily reflect Western musical traditions. This cultural bias may restrict the model's applicability to other musical styles. Additionally, the reliance on subjective evaluations, while valuable, introduces variability based on listener preferences, which may not universally represent the quality of generated music.
The development of BEAT has the potential to significantly impact the field of generative music AI, enhancing artistic expression and creativity. By providing a structured framework for music generation, it can assist musicians and learners in exploring new creative avenues. However, the potential for over-reliance on automated systems raises concerns about the erosion of fundamental musical skills. Furthermore, the focus on Western music could lead to a homogenization of musical styles, underscoring the need for diverse datasets in future research. The main contribution of this paper is the introduction of BEAT, a novel tokenization framework for symbolic music generation that enhances the coherence and quality of generated music while facilitating real-time accompaniment. This work significantly advances the field by integrating structured tokenization with autoregressive modeling, addressing key challenges in music generation and representation.
The intonational structure of Seoul Korean has been defined with discrete tonal categories within the Autosegmental-Metrical model of intonational phonology. However, it is challenging to map continuous $F_0$ contours to these invariant categories due to variable $F_0$ realizations in real-world speech. Our paper proposes Dual-Glob, a deep supervised contrastive learning framework to robustly classify fine-grained pitch accent patterns in Seoul Korean. Unlike conventional local predictive models, our approach captures holistic $F_0$ contour shapes by enforcing structural consistency between clean and augmented views in a shared latent space. To this aim, we introduce the first large-scale benchmark dataset, consisting of manually annotated 10,093 Accentual Phrases in Seoul Korean. Experimental results show that our Dual-Glob significantly outperforms strong baseline models with state-of-the-art accuracy (77.75%) and F1-score (51.54%). Therefore, our work supports AM-based intonational phonology using data-driven methodology, showing that deep contrastive learning effectively captures holistic structural features of continuous $F_0$ contours.
Primary: Rutgers University
All Institutions: Rutgers University, Gachon University, Hanyang Institute for Phonetics and Cognitive Sciences of Language (HIPCS)
The main contribution of this paper is the introduction of the Dual-Glob framework for pitch accent classification in Seoul Korean, which leverages deep supervised contrastive learning to effectively capture the structural features of continuous $F_0$ contours. This work not only presents a novel methodology but also establishes a valuable benchmark dataset, paving the way for future research in the intersection of linguistics and machine learning.
The proposed Dual-Glob framework employs deep supervised contrastive learning to enhance pitch accent classification by focusing on the holistic representation of $F_0$ contours. This approach is innovative as it contrasts with traditional local predictive models by enforcing structural consistency between clean and augmented views, which is crucial for capturing the nuances of pitch accents in Seoul Korean. The introduction of a large-scale benchmark dataset with 10,093 manually annotated Accentual Phrases is a significant methodological advancement, providing a solid foundation for the proposed learning framework.
The experimental results demonstrate that the Dual-Glob framework achieves state-of-the-art performance with an accuracy of 77.75% and an F1-score of 51.54%. The paper effectively compares its results against strong baseline models, showcasing the robustness of the proposed method. However, the paper could benefit from more detailed discussions on the experimental setup, including data splits, training protocols, and hyperparameter tuning, to allow for better reproducibility and understanding of the results.
The paper lacks sufficient details regarding the implementation of the Dual-Glob framework, such as specific model architectures, training procedures, and evaluation metrics. This omission may hinder reproducibility. Including a supplementary material section with code or detailed configuration settings would significantly enhance the paper's reproducibility.
One limitation of the study is the focus on a specific language (Seoul Korean), which may limit the generalizability of the findings to other languages or dialects. Additionally, while the proposed method shows improved performance, the F1-score indicates that there may still be challenges in accurately classifying certain pitch accent patterns, suggesting room for further refinement.
The findings of this research have the potential to advance the understanding of intonational phonology and improve speech recognition systems for Seoul Korean. By leveraging deep learning techniques, this work could contribute to more robust language processing tools, which may also be applicable to other tonal languages. Furthermore, the introduction of a benchmark dataset can foster further research in this area, encouraging the development of more sophisticated models for pitch accent classification. The main contribution of this paper is the introduction of the Dual-Glob framework for pitch accent classification in Seoul Korean, which leverages deep supervised contrastive learning to effectively capture the structural features of continuous $F_0$ contours. This work not only presents a novel methodology but also establishes a valuable benchmark dataset, paving the way for future research in the intersection of linguistics and machine learning.
Unification of automatic speech recognition (ASR) systems reduces development and maintenance costs, but training a single model to perform well in both offline and low-latency streaming settings remains challenging. We present a Unified ASR framework for Transducer (RNNT) training that supports both offline and streaming decoding within a single model, using chunk-limited attention with right context and dynamic chunked convolutions. To further close the gap between offline and streaming performance, we introduce an efficient Triton implementation of mode-consistency regularization for RNNT (MCR-RNNT), which encourages agreement across training modes. Experiments show that the proposed approach improves streaming accuracy at low latency while preserving offline performance and scaling to larger model sizes and training datasets. The proposed Unified ASR framework and the English model checkpoint are open-sourced.
Primary: NVIDIA
All Institutions: NVIDIA, NVIDIA
The paper presents a novel Unified ASR framework that effectively bridges the performance gap between offline and streaming automatic speech recognition systems. This work is significant for its methodological innovations, comprehensive experimental validation, and potential impact on the deployment of efficient ASR solutions in various applications.
The paper introduces a Unified ASR framework for RNNT that effectively combines offline and streaming capabilities within a single model. The use of chunk-limited attention and dynamic chunked convolutions is well-justified, addressing the challenges of context limitations in streaming scenarios. The innovative mode-consistency regularization (MCR-RNNT) is a significant methodological advancement, as it directly targets the performance gap between offline and streaming modes. The dual-mode training strategy is also a thoughtful approach to optimizing model performance across different operational contexts.
The experiments are comprehensive, utilizing a large dataset of 120,000 hours of labeled English speech, which is crucial for validating the proposed methods. The evaluation across multiple test sets enhances the robustness of the results. The paper reports significant improvements in streaming accuracy while maintaining offline performance, which is a critical requirement for practical ASR systems. The ablation studies provide valuable insights into the effectiveness of the proposed MCR-RNNT loss and the impact of varying right context parameters.
The authors mention that the model and code will be open-sourced, which is a positive aspect for reproducibility. However, the paper lacks detailed implementation specifics that would aid in replicating the experiments, such as hyperparameter settings and training configurations.
While the proposed methods show promise, the paper does not thoroughly discuss potential limitations, such as the computational overhead introduced by dual-mode training and the scalability of the approach to other languages or dialects. Additionally, the performance in extremely low-latency scenarios could be further explored.
The advancements in unified ASR systems have significant implications for real-world applications, especially in environments requiring both high accuracy and low latency, such as virtual assistants and real-time transcription services. The open-sourcing of the model also encourages further research and development in the ASR community. The paper presents a novel Unified ASR framework that effectively bridges the performance gap between offline and streaming automatic speech recognition systems. This work is significant for its methodological innovations, comprehensive experimental validation, and potential impact on the deployment of efficient ASR solutions in various applications.
The self-noise of capacitive sensors, primarily caused by thermal noise from the gate-bias resistor in the preamplifier, imposes a fundamental limit on measurement sensitivity. In electret condenser microphones (ECMs), this resistor simultaneously determines the noise low-pass cutoff frequency and the signal high-pass cutoff frequency through a single RC time constant, creating a trade-off between noise reduction and signal bandwidth. This paper proposes PDS-Amp (Photoelectric DC Servo Amplifier), a circuit technique that replaces the gate-bias resistor with a photoelectric element functioning as an ultra-high-impedance current source. A DC servo loop using lag-lead compensation feeds back the preamplifier output through an LED to control the photocurrent, thereby stabilizing the gate bias while decoupling the noise and signal cutoff frequencies. A custom photosensor based on the external photoelectric effect of a zinc photocathode was fabricated to achieve sub-picoampere dark current, overcoming the limitations of commercial semiconductor photodiodes. Combined with a cascode JFET preamplifier that minimizes input capacitance through bootstrap action, PDS-Amp achieved a self-noise of 11 dBA with a 12 pF dummy microphone. Despite using a small-diameter ECM capsule, this performance is comparable to that of large-diaphragm condenser microphones costing several thousand dollars. Recording experiments with an actual ECM capsule qualitatively confirmed a significant reduction in background noise. The proposed technique is applicable not only to microphones but broadly to capacitive sensors including accelerometers, pressure sensors, and pyroelectric sensors.
Primary: National Agriculture and Food Research Organization (NARO)
All Institutions: National Agriculture and Food Research Organization (NARO), University of Tsukuba
This paper presents the PDS-Amp, a novel circuit technique that effectively reduces self-noise in capacitive sensors, demonstrating significant improvements in performance and potential applications across various sensor technologies. The comprehensive methodology and experimental validation underscore its importance in advancing the field of audio and sensor technology.
The methodology presented in this paper is innovative as it introduces the PDS-Amp, which replaces the conventional gate-bias resistor with a photoelectric element to significantly reduce self-noise in capacitive sensors. The use of a DC servo loop to stabilize the gate bias while decoupling noise and signal cutoff frequencies is a novel approach that addresses the inherent trade-offs in traditional designs. The theoretical background is well-articulated, providing a solid foundation for the proposed method.
The experiments conducted are thorough, including noise spectral density comparisons and self-noise evaluations using both dummy microphones and actual ECM capsules. The results demonstrate a significant reduction in self-noise, achieving 11 dBA, which is a substantial improvement over conventional methods. The qualitative recording experiments further validate the effectiveness of the proposed technique in real-world applications.
While the paper provides detailed descriptions of the circuit design and experimental setup, the lack of publicly available code or a project repository limits reproducibility. Future work should include sharing the circuit schematics and experimental data to enhance reproducibility.
The paper acknowledges potential limitations regarding the long-term stability of the custom photosensor, the increased complexity of the circuit due to the DC servo loop, and the need for close proximity between the photoelectric element and the LED. These factors could pose challenges in practical applications.
The PDS-Amp technique has significant implications for various capacitive sensors beyond microphones, including accelerometers and pressure sensors, potentially leading to advancements in sensor technology across multiple fields. The ability to achieve low self-noise without increasing size or voltage requirements could revolutionize the design of compact, high-performance sensors. This paper presents the PDS-Amp, a novel circuit technique that effectively reduces self-noise in capacitive sensors, demonstrating significant improvements in performance and potential applications across various sensor technologies. The comprehensive methodology and experimental validation underscore its importance in advancing the field of audio and sensor technology.
While generative models have set new benchmarks for Target Speaker Extraction (TSE), their inherent reliance on global context precludes deployment in real-time applications. Direct adaptation to streaming scenarios often leads to catastrophic inference performance degradation due to the severe mismatch between training and streaming inference. To bridge this gap, we present the first autoregressive (AR) models tailored for streaming TSE. Our approach introduces a Chunk-wise Interleaved Splicing Paradigm that ensures highly efficient and stable streaming inference. To ensure the coherence between the extracted speech segments, we design a historical context refinement mechanism that mitigates boundary discontinuities by leveraging historical information. Experiments on Libri2Mix show that while AR generative baseline exhibits performance degradation at low latencies, our approach maintains 100% stability and superior intelligibility. Furthermore, our streaming results are comparable to or even surpass offline baselines. Additionally, our model achieves a Real-Time-Factor (RTF) of 0.248 on consumer-level GPUs. This work provides empirical evidence that AR generative backbones are viable for latency-sensitive applications through the Chunk-wise Interleaved Splicing Paradigm.
Primary: Shenzhen International Graduate School, Tsinghua University
All Institutions: Shenzhen International Graduate School, Tsinghua University, The Chinese University of Hong Kong, SenseTime Research
The paper presents the first autoregressive generative backbone tailored for streaming Target Speaker Extraction, filling a critical research void. The technical contributions, particularly the innovative Chunk-wise Interleaved Splicing Paradigm and historical context refinement mechanism, represent significant advancements in the field, with the potential to improve real-time audio processing applications substantially.
The paper presents a novel autoregressive model specifically designed for streaming Target Speaker Extraction (TSE), introducing the Chunk-wise Interleaved Splicing Paradigm. This approach effectively addresses the mismatch between training and streaming inference by ensuring causality and stability in real-time applications. The historical context refinement mechanism is a significant addition that enhances the coherence of extracted speech segments, mitigating boundary discontinuities. The methodology is well-structured, with clear definitions and a logical flow from problem identification to proposed solutions. The use of autoregressive models in a streaming context is innovative, and the interleaved splicing paradigm is a clever engineering solution that maintains efficiency.
The experimental results are robust, showcasing a comprehensive evaluation against both generative and discriminative baselines. The use of the Libri2Mix dataset is appropriate, and the metrics employed (DNSMOS, NISQA, WER, etc.) are relevant for assessing speech quality and intelligibility. The results demonstrate that the proposed method not only maintains stability at low latencies but also achieves comparable or superior performance to offline models. The ablation studies provide valuable insights into the effectiveness of the historical context refinement and input strategies, reinforcing the contributions of the proposed methodology.
The paper provides sufficient implementation details, including the architecture of the model, the training protocol, and the evaluation metrics. However, the lack of a public demo or project URL limits the reproducibility of the results. Future work could benefit from sharing code and models to facilitate further research and validation by the community.
One limitation is the reliance on specific datasets, which may not fully capture the diversity of real-world scenarios. Additionally, while the proposed method shows promise, the performance at extreme low latencies (e.g., below 80ms) is not thoroughly evaluated. There may also be concerns regarding the generalizability of the model to other languages or dialects, which could affect its applicability in diverse settings.
This work has significant implications for real-time applications such as teleconferencing, voice-controlled systems, and multi-turn dialogue interactions. By enabling high-quality speech extraction in latency-sensitive environments, the proposed method can enhance user experiences in various audio processing applications. The approach could also inspire further research into autoregressive models for other real-time audio tasks, potentially leading to broader advancements in the field of speech processing. The paper presents the first autoregressive generative backbone tailored for streaming Target Speaker Extraction, filling a critical research void. The technical contributions, particularly the innovative Chunk-wise Interleaved Splicing Paradigm and historical context refinement mechanism, represent significant advancements in the field, with the potential to improve real-time audio processing applications substantially.
The rapid advancement of Audio Large Language Models (ALMs), driven by Neural Audio Codecs (NACs), has led to the emergence of highly realistic speech deepfakes, commonly referred to as CodecFakes (CFs). Consequently, CF detection has attracted increasing attention from the research community. However, existing studies predominantly focus on English or Chinese, leaving the vulnerability of Indic languages largely unexplored. To bridge this gap, we introduce Indic-CodecFake (ICF) dataset, the first large-scale benchmark comprising real and NAC-synthesized speech across multiple Indic languages, diverse speaker profiles, and multiple NAC types. We use IndicSUPERB as the real speech corpus for generation of ICF dataset. Our experiments demonstrate that state-of-the-art (SOTA) CF detectors trained on English-centric datasets fail to generalize to ICF, underscoring the challenges posed by phonetic diversity and prosodic variability in Indic speech. Further, we present systematic evaluation of SOTA ALMs in a zero-shot setting on ICF dataset. We evaluate these ALMs as they have shown effectiveness for different speech tasks. However, our findings reveal that current ALMs exhibit consistently poor performance. To address this, we propose SATYAM, a novel hyperbolic ALM tailored for CF detection in Indic languages. SATYAM integrates semantic representations from Whisper and prosodic representations from TRILLsson using through Bhattacharya distance in hyperbolic space and subsequently performs the same alignment procedure between the fused speech representation and an input conditioning prompt. This dual-stage fusion framework enables SATYAM to effectively model hierarchical relationships both within speech (semantic-prosodic) and across modalities (speech-text). Extensive evaluations show that SATYAM consistently outperforms competitive end-to-end and ALM-based baselines on the ICF benchmark.
Primary: IIIT-Delhi, India
All Institutions: UPES, India, Veer Bahadur Singh Purvanchal University, India, IIIT-Delhi, India
The paper significantly advances the field of audio deepfake detection by introducing a dedicated dataset for Indic languages and a novel hyperbolic ALM tailored for this context. The comprehensive methodology and robust experimental results underscore its potential impact on both academic research and practical applications in speech technology.
The paper introduces a novel hyperbolic ALM, SATYAM, which integrates semantic and prosodic representations for detecting speech deepfakes in Indic languages. The methodology is well-structured, leveraging Bhattacharya distance in hyperbolic space for alignment, which is a unique approach in the context of audio deepfake detection. The dual-stage fusion framework is innovative, allowing for effective modeling of hierarchical relationships within speech and across modalities. The use of existing models like Whisper and TRILLsson as encoders is appropriate, and the choice of hyperbolic geometry adds a compelling dimension to the representation of speech data.
The experiments are robust, featuring a comprehensive evaluation of the proposed ICF dataset and comparisons against state-of-the-art models. The zero-shot evaluation of SOTA ALMs demonstrates the challenges in generalizing to Indic languages, providing a strong rationale for the development of SATYAM. The results show significant improvements over existing baselines, validating the effectiveness of the proposed framework. The paper also includes thorough ablation studies that highlight the contributions of various components, enhancing the credibility of the findings.
The authors provide a clear description of the dataset generation process, model architecture, and training procedures, which facilitates reproducibility. The inclusion of a project URL with code and dataset access further supports this aspect. However, the paper could benefit from more detailed hyperparameter settings and training configurations to enhance clarity.
One notable limitation is the reliance on a single family of LLM decoders, which may restrict the generalizability of the findings. Additionally, while the proposed framework shows promise, the performance on noisy conditions could be further explored to assess robustness in real-world applications. The paper acknowledges these limitations and suggests future work to explore alternative encoder and decoder architectures.
The introduction of the ICF dataset and the SATYAM model has significant implications for speech technology in multilingual contexts, particularly in combating the rising threat of speech deepfakes. By focusing on Indic languages, the work addresses a critical gap in the literature and provides a foundation for future research in low-resource language settings. The ethical considerations outlined also reflect a responsible approach to the potential misuse of the technology. The paper significantly advances the field of audio deepfake detection by introducing a dedicated dataset for Indic languages and a novel hyperbolic ALM tailored for this context. The comprehensive methodology and robust experimental results underscore its potential impact on both academic research and practical applications in speech technology.
Recent advances in Text-To-Speech (TTS) synthesis have seen the popularity of multi-stage approaches that first predict semantic tokens and then generate acoustic tokens. In this paper, we extend the coarse-to-fine generation paradigm to the temporal domain and introduce Chain-of-Details (CoD), a novel framework that explicitly models temporal coarse-to-fine dynamics in speech generation using a cascaded architecture. Our method progressively refines temporal details across multiple stages, with each stage targeting a specific temporal granularity. All temporal detail predictions are performed using a shared decoder, enabling efficient parameter utilization across different temporal resolutions. Notably, we observe that the lowest detail level naturally performs phonetic planning without the need for an explicit phoneme duration predictor. We evaluate our method on several datasets and compare it against several baselines. Experimental results show that CoD achieves competitive performance with significantly fewer parameters than existing approaches. Our findings demonstrate that explicit modeling of temporal dynamics with the CoD framework leads to more natural speech synthesis.
Primary: Canva Research
All Institutions: Canva Research, Dolby Laboratories
The main contribution of this paper is the introduction of the Chain-of-Details framework, which innovatively extends the coarse-to-fine generation paradigm to incorporate temporal dynamics in TTS synthesis, leading to more natural speech generation with improved efficiency. This work represents a meaningful advancement in the field of audio synthesis, combining theoretical insights with practical applications that could influence future developments in TTS technologies.
The proposed Chain-of-Details (CoD) framework introduces a novel approach to modeling temporal dynamics in Text-To-Speech (TTS) synthesis through a multi-stage, cascaded architecture. This method effectively refines speech generation across various temporal resolutions, which is a significant advancement over existing coarse-to-fine generation paradigms. The use of a shared decoder across different temporal levels enhances parameter efficiency and consistency. The methodology is well-grounded in previous work, yet it innovatively extends the temporal modeling aspect, which has been largely overlooked in prior TTS systems.
The experimental evaluation is robust, utilizing multiple datasets, including LibriSpeech and SeedTTS, to validate the effectiveness of the CoD framework. The results demonstrate competitive performance in terms of Word Error Rate (WER) with fewer parameters compared to existing models, indicating a significant improvement in efficiency. The inclusion of ablation studies further strengthens the findings by providing insights into the effects of different temporal levels and token types.
The paper provides detailed implementation specifics, including model architecture, training procedures, and evaluation metrics, which enhances reproducibility. However, the lack of publicly available code or demo URLs may hinder broader accessibility for researchers looking to replicate or build upon this work.
While the CoD framework shows promise, the paper does not address potential limitations related to the scalability of the model to more complex speech patterns or the generalization to diverse languages and accents. Additionally, the reliance on specific datasets may limit the applicability of the findings to other contexts.
The implications of this research are significant, as improved TTS systems can enhance accessibility for individuals with speech impairments, improve user experiences in virtual assistants, and contribute to advancements in human-computer interaction. The explicit modeling of temporal dynamics could also pave the way for more nuanced applications in multimedia content creation and entertainment. The main contribution of this paper is the introduction of the Chain-of-Details framework, which innovatively extends the coarse-to-fine generation paradigm to incorporate temporal dynamics in TTS synthesis, leading to more natural speech generation with improved efficiency. This work represents a meaningful advancement in the field of audio synthesis, combining theoretical insights with practical applications that could influence future developments in TTS technologies.