Speech processing for low-resource dialects remains a fundamental challenge in developing inclusive and robust speech technologies. Despite its linguistic significance and large speaker population, the Wu dialect of Chinese has long been hindered by the lack of large-scale speech data, standardized evaluation benchmarks, and publicly available models. In this work, we present WenetSpeech-Wu, the first large-scale, multi-dimensionally annotated open-source speech corpus for the Wu dialect, comprising approximately 8,000 hours of diverse speech data. Building upon this dataset, we introduce WenetSpeech-Wu-Bench, the first standardized and publicly accessible benchmark for systematic evaluation of Wu dialect speech processing, covering automatic speech recognition (ASR), Wu-to-Mandarin translation, speaker attribute prediction, speech emotion recognition, text-to-speech (TTS) synthesis, and instruction-following TTS (instruct TTS). Furthermore, we release a suite of strong open-source models trained on WenetSpeech-Wu, establishing competitive performance across multiple tasks and empirically validating the effectiveness of the proposed dataset. Together, these contributions lay the foundation for a comprehensive Wu dialect speech processing ecosystem, and we open-source proposed datasets, benchmarks, and models to support future research on dialectal speech intelligence.
Primary: ASLP Lab
All Institutions: ASLP Lab
The paper presents WenetSpeech-Wu, a pioneering effort to create a comprehensive speech processing ecosystem for the Wu dialect, addressing critical gaps in data, benchmarks, and models. The technical contributions are substantial, with a well-defined methodology and rigorous experimental validation, making it a significant advancement in the field of speech processing for low-resource dialects.
The methodology employed in this paper is robust, focusing on the creation of a large-scale, multi-dimensionally annotated speech corpus for the Wu dialect. The authors detail a comprehensive data collection and annotation pipeline that integrates various automated and semi-automated techniques to ensure high-quality outputs. The use of advanced models for automatic transcription and emotion recognition, along with a structured quality grading system, reflects a thorough approach to addressing the challenges of low-resource dialect speech processing.
The experiments conducted are extensive and well-structured, covering a variety of speech processing tasks including ASR, TTS, and emotion recognition. The introduction of the WenetSpeech-Wu-Bench as a standardized benchmark is particularly noteworthy, as it facilitates fair evaluation and comparison across different models. The performance metrics reported demonstrate significant improvements over existing systems, validating the effectiveness of the proposed models and dataset.
The paper provides sufficient details regarding the data collection, annotation processes, and model training procedures, which enhances reproducibility. The availability of the dataset and models on GitHub further supports the research community in replicating and building upon this work. However, the reliance on automated annotations may introduce some variability that could affect reproducibility in practical applications.
While the dataset is extensive, the authors acknowledge that the distribution across dialects and domains is not perfectly balanced, which could impact model generalization. Additionally, the automated annotation processes, despite quality controls, may still contain noise, potentially affecting the reliability of certain tasks.
This work has significant implications for the development of inclusive speech technologies, particularly for underrepresented dialects. By establishing a comprehensive ecosystem for Wu dialect speech processing, the authors pave the way for future research and applications in low-resource language settings, potentially enhancing accessibility and usability in various domains. The paper presents WenetSpeech-Wu, a pioneering effort to create a comprehensive speech processing ecosystem for the Wu dialect, addressing critical gaps in data, benchmarks, and models. The technical contributions are substantial, with a well-defined methodology and rigorous experimental validation, making it a significant advancement in the field of speech processing for low-resource dialects.
We present a family of open-source Music Foundation Models designed to advance large-scale music understanding and generation across diverse tasks and modalities. Our framework consists of four major components: (1) HeartCLAP, an audio-text alignment model; (2) HeartTranscriptor, a robust lyric recognition model optimized for real-world music scenarios; and (3) HeartCodec, a low-frame-rate (12.5 Hz) yet high-fidelity music codec tokenizer that captures long-range musical structure while preserving fine-grained acoustic details and enabling efficient autoregressive modeling; (4) HeartMuLa, an LLM-based song generation model capable of synthesizing high-fidelity music under rich, user-controllable conditions (e.g., textual style descriptions, lyrics, and reference audio). In addition, it provides two specialized modes: (i) fine-grained musical attribute control, which allows users to specify the style of different song sections (e.g., intro, verse, chorus) using natural language prompts; and (ii) short, engaging music generation, which is suitable as background music for short videos. Lastly, HeartMuLa improves significantly when scaled to 7B parameters. For the first time, we show that a Suno-level, commercial-grade system can be reproduced using academic-scale data and GPU resources. We expect these foundation models to serve as strong baselines for future research and to facilitate practical applications in multimodal content production.
Primary: Peking University
All Institutions: Peking University, Scale Global, Ario, The Chinese University of Hong Kong
The main contribution of this paper is the introduction of a family of open-source music foundation models that unify various aspects of music understanding and generation, providing a robust framework for future research and practical applications in the creative industry. The integration of advanced methodologies and the potential for user-controlled generation mark a significant advancement in the field of machine learning for audio.
The paper introduces a comprehensive framework for music generation and understanding, comprising four distinct models that address various aspects of music processing. HeartCLAP focuses on audio-text alignment, HeartTranscriptor enhances lyric recognition, HeartCodec serves as a high-fidelity music tokenizer, and HeartMuLa is an LLM-based model for music generation. The methodology is well-structured, leveraging existing advancements in audio processing while innovating through the integration of low-frame-rate tokenization and user-controllable generation, which is a significant step forward in the field.
The authors claim extensive experiments demonstrating improvements over existing baselines in terms of reconstruction quality and generation performance. However, specific details regarding datasets, experimental setups, and quantitative results are not provided in the abstract or methodology sections, which limits the ability to fully assess the robustness of their claims. The scaling to 7B parameters is a notable highlight, suggesting that the model can handle larger contexts effectively.
The paper mentions that the models are open-sourced, which is a positive aspect for reproducibility. However, without a dedicated URL for the code or datasets, it is challenging to evaluate the ease of reproducing the results. Clear documentation and access to the models would be necessary to facilitate further research.
The paper does not address potential limitations or challenges encountered during the model development or the implications of using large-scale models in music generation. Additionally, while the ethical considerations are mentioned, the practical challenges of implementing watermarking and ensuring accountability in AI-generated music are not thoroughly discussed.
The HeartMuLa framework has the potential to significantly impact the music industry by providing tools for content creators to generate music efficiently and creatively. The open-source nature of the project could foster collaboration and innovation in the field of AI music generation, although ethical implications and copyright issues remain critical considerations. The main contribution of this paper is the introduction of a family of open-source music foundation models that unify various aspects of music understanding and generation, providing a robust framework for future research and practical applications in the creative industry. The integration of advanced methodologies and the potential for user-controlled generation mark a significant advancement in the field of machine learning for audio.
Despite rapid progress in text-to-speech (TTS), open-source systems still lack truly instruction-following, fine-grained control over core speech attributes (e.g., pitch, speaking rate, age, emotion, and style). We present VoiceSculptor, an open-source unified system that bridges this gap by integrating instruction-based voice design and high-fidelity voice cloning in a single framework. It generates controllable speaker timbre directly from natural-language descriptions, supports iterative refinement via Retrieval-Augmented Generation (RAG), and provides attribute-level edits across multiple dimensions. The designed voice is then rendered into a prompt waveform and fed into a cloning model to enable high-fidelity timbre transfer for downstream speech synthesis. VoiceSculptor achieves open-source state-of-the-art (SOTA) on InstructTTSEval-Zh, and is fully open-sourced, including code and pretrained models, to advance reproducible instruction-controlled TTS research.
Primary: Northwestern Polytechnical University
All Institutions: Northwestern Polytechnical University, Yutu Zhineng, Shanghai Lingguang Zhaxian Technology, WeNet Open Source Community
VoiceSculptor represents a significant advancement in the field of text-to-speech synthesis by integrating user-driven voice design with high-fidelity cloning capabilities. The innovative approach to controllable speech attributes and the commitment to open-source principles position this work as a valuable contribution to the ongoing evolution of TTS technologies.
The methodology presented in VoiceSculptor is innovative as it combines instruction-based voice design with high-fidelity voice cloning. The use of natural language descriptions to generate speaker timbre is a significant advancement, enabling users to have fine-grained control over various speech attributes. The incorporation of Retrieval-Augmented Generation (RAG) for iterative refinement is a notable methodological enhancement that allows for dynamic adjustments based on user feedback. However, the paper could benefit from a more detailed explanation of the underlying algorithms and their integration.
The authors claim to achieve state-of-the-art results on the InstructTTSEval-Zh benchmark, which indicates a strong performance in the field. However, the evaluation metrics and comparison with existing systems are not sufficiently detailed in the abstract. A comprehensive analysis of the experiments, including dataset descriptions, training procedures, and qualitative assessments of the generated voices, would strengthen the paper's credibility. The results should also include user studies to validate the effectiveness of the controllability features.
The authors mention that the system is fully open-sourced, which is a positive aspect for reproducibility. However, specific details regarding the setup, dependencies, and instructions for running the code are not provided in the abstract. A clear and thorough documentation in the supplementary materials or appendix would enhance reproducibility.
The paper acknowledges potential limitations but does not elaborate on them in the abstract. Some possible limitations include the scalability of the system to different languages, the quality of generated voices in diverse emotional states, and the computational resources required for high-fidelity cloning. Addressing these limitations in the full paper would provide a more balanced view of the system's capabilities.
VoiceSculptor has the potential to significantly impact various applications, including personalized voice assistants, content creation, and accessibility tools for individuals with speech impairments. By enabling users to design their voices through simple instructions, the system democratizes voice synthesis technology and could lead to broader adoption in both consumer and professional settings. VoiceSculptor represents a significant advancement in the field of text-to-speech synthesis by integrating user-driven voice design with high-fidelity cloning capabilities. The innovative approach to controllable speech attributes and the commitment to open-source principles position this work as a valuable contribution to the ongoing evolution of TTS technologies.
Vision Language Action (VLA) models promise an open-vocabulary interface that can translate perceptual ambiguity into semantically grounded driving decisions, yet they still treat language as a static prior fixed at inference time. As a result, the model must infer continuously shifting objectives from pixels alone, yielding delayed or overly conservative maneuvers. We argue that effective VLAs for autonomous driving need an online channel in which users can influence driving with specific intentions. To this end, we present EchoVLA, a user-aware VLA that couples camera streams with in situ audio instructions. We augment the nuScenes dataset with temporally aligned, intent-specific speech commands generated by converting ego-motion descriptions into synthetic audios. Further, we compose emotional speech-trajectory pairs into a multimodal Chain-of-Thought (CoT) for fine-tuning a Multimodal Large Model (MLM) based on Qwen2.5-Omni. Specifically, we synthesize the audio-augmented dataset with different emotion types paired with corresponding driving behaviors, leveraging the emotional cues embedded in tone, pitch, and speech tempo to reflect varying user states, such as urgent or hesitant intentions, thus enabling our EchoVLA to interpret not only the semantic content but also the emotional context of audio commands for more nuanced and emotionally adaptive driving behavior. In open-loop benchmarks, our approach reduces the average L2 error by $59.4\%$ and the collision rate by $74.4\%$ compared to the baseline of vision-only perception. More experiments on nuScenes dataset validate that EchoVLA not only steers the trajectory through audio instructions, but also modulates driving behavior in response to the emotions detected in the user's speech.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of EchoVLA, a user-aware VLA model that couples audio instructions with visual inputs to improve autonomous driving behavior. This work represents a significant step towards creating more interactive and emotionally intelligent autonomous systems, though it requires further development in reproducibility and real-world applicability.
The proposed methodology introduces EchoVLA, which integrates audio instructions with visual inputs to enhance user-aware driving behavior. The use of emotional speech-trajectory pairs and the synthesis of an audio-augmented dataset are innovative aspects that address the limitations of traditional VLA models. However, the methodology could benefit from a more detailed explanation of the data generation process and the specific architecture of the Multimodal Large Model (MLM) used.
The experiments conducted on the nuScenes dataset demonstrate significant improvements in driving performance metrics, such as a 59.4% reduction in average L2 error and a 74.4% decrease in collision rates compared to a vision-only baseline. While these results are promising, additional comparisons with other state-of-the-art methods could strengthen the findings. The benchmarks should also include a qualitative assessment of driving behavior to complement the quantitative metrics.
The paper lacks detailed implementation specifics and code availability, which are crucial for reproducibility. Without access to the dataset augmentation process and the model training details, it would be challenging for other researchers to replicate the results or build upon this work.
One limitation is the reliance on synthetic audio commands, which may not fully capture the variability and complexity of real-world user interactions. Additionally, the emotional context interpretation may be limited by the quality and diversity of the training data. The paper does not address potential biases in the emotional speech data or how they might affect driving behavior.
The integration of emotional intelligence into autonomous driving systems has the potential to enhance user experience and safety. By allowing vehicles to respond to user intentions and emotional states, this research could pave the way for more adaptive and user-friendly autonomous systems. However, ethical considerations regarding user privacy and the implications of emotional manipulation in driving contexts must be addressed. The main contribution of this paper is the introduction of EchoVLA, a user-aware VLA model that couples audio instructions with visual inputs to improve autonomous driving behavior. This work represents a significant step towards creating more interactive and emotionally intelligent autonomous systems, though it requires further development in reproducibility and real-world applicability.
Speech processing for low-resource dialects remains a fundamental challenge in developing inclusive and robust speech technologies. Despite its linguistic significance and large speaker population, the Wu dialect of Chinese has long been hindered by the lack of large-scale speech data, standardized evaluation benchmarks, and publicly available models. In this work, we present WenetSpeech-Wu, the first large-scale, multi-dimensionally annotated open-source speech corpus for the Wu dialect, comprising approximately 8,000 hours of diverse speech data. Building upon this dataset, we introduce WenetSpeech-Wu-Bench, the first standardized and publicly accessible benchmark for systematic evaluation of Wu dialect speech processing, covering automatic speech recognition (ASR), Wu-to-Mandarin translation, speaker attribute prediction, speech emotion recognition, text-to-speech (TTS) synthesis, and instruction-following TTS (instruct TTS). Furthermore, we release a suite of strong open-source models trained on WenetSpeech-Wu, establishing competitive performance across multiple tasks and empirically validating the effectiveness of the proposed dataset. Together, these contributions lay the foundation for a comprehensive Wu dialect speech processing ecosystem, and we open-source proposed datasets, benchmarks, and models to support future research on dialectal speech intelligence.
Primary: ASLP Lab
All Institutions: ASLP Lab
The paper presents WenetSpeech-Wu, a pioneering effort to create a comprehensive speech processing ecosystem for the Wu dialect, addressing critical gaps in data, benchmarks, and models. The technical contributions are substantial, with a well-defined methodology and rigorous experimental validation, making it a significant advancement in the field of speech processing for low-resource dialects.
The methodology employed in this paper is robust, focusing on the creation of a large-scale, multi-dimensionally annotated speech corpus for the Wu dialect. The authors detail a comprehensive data collection and annotation pipeline that integrates various automated and semi-automated techniques to ensure high-quality outputs. The use of advanced models for automatic transcription and emotion recognition, along with a structured quality grading system, reflects a thorough approach to addressing the challenges of low-resource dialect speech processing.
The experiments conducted are extensive and well-structured, covering a variety of speech processing tasks including ASR, TTS, and emotion recognition. The introduction of the WenetSpeech-Wu-Bench as a standardized benchmark is particularly noteworthy, as it facilitates fair evaluation and comparison across different models. The performance metrics reported demonstrate significant improvements over existing systems, validating the effectiveness of the proposed models and dataset.
The paper provides sufficient details regarding the data collection, annotation processes, and model training procedures, which enhances reproducibility. The availability of the dataset and models on GitHub further supports the research community in replicating and building upon this work. However, the reliance on automated annotations may introduce some variability that could affect reproducibility in practical applications.
While the dataset is extensive, the authors acknowledge that the distribution across dialects and domains is not perfectly balanced, which could impact model generalization. Additionally, the automated annotation processes, despite quality controls, may still contain noise, potentially affecting the reliability of certain tasks.
This work has significant implications for the development of inclusive speech technologies, particularly for underrepresented dialects. By establishing a comprehensive ecosystem for Wu dialect speech processing, the authors pave the way for future research and applications in low-resource language settings, potentially enhancing accessibility and usability in various domains. The paper presents WenetSpeech-Wu, a pioneering effort to create a comprehensive speech processing ecosystem for the Wu dialect, addressing critical gaps in data, benchmarks, and models. The technical contributions are substantial, with a well-defined methodology and rigorous experimental validation, making it a significant advancement in the field of speech processing for low-resource dialects.
Empathetic speech dialogue requires not only understanding linguistic content but also perceiving rich paralinguistic information such as prosody, tone, and emotional intensity for affective understandings. Existing speech-to-speech large language models either rely on ASR transcription or use encoders to extract latent representations, often weakening affective information and contextual coherence in multi-turn dialogues. To address this, we propose \textbf{ES4R}, a framework for speech-based empathetic response generation. Our core innovation lies in explicitly modeling structured affective context before speech encoding, rather than relying on implicit learning by the encoder or explicit emotion supervision. Specifically, we introduce a dual-level attention mechanism to capture turn-level affective states and dialogue-level affective dynamics. The resulting affective representations are then integrated with textual semantics through speech-guided cross-modal attention to generate empathetic responses. For speech output, we employ energy-based strategy selection and style fusion to achieve empathetic speech synthesis. ES4R consistently outperforms strong baselines in both automatic and human evaluations and remains robust across different LLM backbones.
Primary: unknown
All Institutions: unknown
The paper presents ES4R, a novel framework for empathetic response generation that effectively models structured affective context from speech inputs, significantly advancing the field of empathetic dialogue systems.
The proposed ES4R framework introduces a dual-level attention mechanism for affective context modeling, which is a significant advancement over existing methods that rely on implicit learning or explicit emotion supervision. By explicitly modeling affective states and dynamics before encoding, the methodology addresses key limitations in current empathetic response generation systems. The integration of speech-guided cross-modal attention further enhances the model's ability to generate contextually relevant and emotionally resonant responses.
The experiments conducted on the AvaMERG dataset demonstrate the effectiveness of ES4R, with consistent improvements over strong baselines in both automatic and human evaluations. The inclusion of ablation studies provides insights into the contributions of different components, validating the importance of the dual-level attention and cross-modal fusion mechanisms. The results indicate robust performance across various LLM backbones, showcasing the framework's versatility.
The paper provides detailed descriptions of the architecture, training strategies, and evaluation metrics used, which supports reproducibility. However, the absence of a publicly available implementation or code repository limits the ability for others to replicate the results fully.
While the framework shows promise, it relies on a relatively simple set of empathetic speaking styles and does not incorporate additional paralinguistic cues, which could enhance the naturalness of the generated speech. Future work is needed to explore more nuanced emotional expressions and the incorporation of speaker-aware personalization.
The development of empathetic dialogue systems has significant implications for applications in mental health support, customer service, and social robotics, where understanding and responding to human emotions is crucial. The ES4R framework could enhance user experience in these domains by providing more emotionally aware interactions. The paper presents ES4R, a novel framework for empathetic response generation that effectively models structured affective context from speech inputs, significantly advancing the field of empathetic dialogue systems.
Recent end-to-end spoken dialogue systems leverage speech tokenizers and neural audio codecs to enable LLMs to operate directly on discrete speech representations. However, these models often exhibit limited speaker identity preservation, hindering personalized voice interaction. In this work, we present Chroma 1.0, the first open-source, real-time, end-to-end spoken dialogue model that achieves both low-latency interaction and high-fidelity personalized voice cloning. Chroma achieves sub-second end-to-end latency through an interleaved text-audio token schedule (1:2) that supports streaming generation, while maintaining high-quality personalized voice synthesis across multi-turn conversations. Our experimental results demonstrate that Chroma achieves a 10.96% relative improvement in speaker similarity over the human baseline, with a Real-Time Factor (RTF) of 0.43, while maintaining strong reasoning and dialogue capabilities. Our code and models are publicly available at https://github.com/FlashLabs-AI-Corp/FlashLabs-Chroma and https://huggingface.co/FlashLabs/Chroma-4B .
Primary: unknown
All Institutions: unknown
Chroma 1.0 represents a significant advancement in real-time spoken dialogue systems, combining low-latency interaction with high-fidelity personalized voice cloning. The innovative architecture and training strategy contribute to its effectiveness, although further clarity in methodology and experimental details would strengthen its impact in the field.
The methodology presented in Chroma 1.0 is innovative, particularly in its use of an interleaved text-audio token schedule which allows for real-time processing. The model architecture, which includes a Backbone and a Decoder, is well-structured to ensure causal alignment between text and audio. The two-stage training strategy is a thoughtful approach to stabilize optimization and enhance the quality of voice cloning. However, the paper could benefit from a clearer explanation of the interdependencies between the Backbone and Decoder components.
The experimental results are promising, showing a 10.96% relative improvement in speaker similarity over a human baseline, which is significant for applications in personalized voice interaction. The reported Real-Time Factor (RTF) of 0.43 indicates that the model is capable of real-time performance, which is crucial for dialogue systems. However, the paper lacks detailed descriptions of the datasets used for training and evaluation, which would help in assessing the robustness of the results.
The authors have made their code and models publicly available, which is a positive aspect for reproducibility. However, the paper does not provide sufficient details about the training process, hyperparameters, or the specific datasets used, which could hinder other researchers from replicating the results effectively.
The paper acknowledges limitations, such as potential challenges in maintaining speaker identity across diverse accents or emotional tones. Additionally, the reliance on a specific architecture may limit generalizability to other dialogue contexts or languages. More discussion on these limitations would enhance the transparency of the research.
The potential applications of Chroma 1.0 are vast, particularly in areas such as virtual assistants, gaming, and accessibility technologies. The ability to maintain speaker identity while providing real-time dialogue capabilities could significantly enhance user experience in personalized interactions. However, ethical considerations regarding voice cloning and the implications of misuse should be addressed more thoroughly. Chroma 1.0 represents a significant advancement in real-time spoken dialogue systems, combining low-latency interaction with high-fidelity personalized voice cloning. The innovative architecture and training strategy contribute to its effectiveness, although further clarity in methodology and experimental details would strengthen its impact in the field.
Large Audio Language Models (LALMs) excel at semantic and paralinguistic tasks, yet their ability to perceive the fundamental physical attributes of audio such as pitch, loudness, and spatial location remains under-explored. To bridge this gap, we introduce SonicBench, a psychophysically grounded benchmark that systematically evaluates 12 core physical attributes across five perceptual dimensions. Unlike previous datasets, SonicBench uses a controllable generation toolbox to construct stimuli for two complementary paradigms: recognition (absolute judgment) and comparison (relative judgment). This design allows us to probe not only sensory precision but also relational reasoning capabilities, a domain where humans typically exhibit greater proficiency. Our evaluation reveals a substantial deficiency in LALMs' foundational auditory understanding; most models perform near random guessing and, contrary to human patterns, fail to show the expected advantage on comparison tasks. Furthermore, explicit reasoning yields minimal gains. However, our linear probing analysis demonstrates crucially that frozen audio encoders do successfully capture these physical cues (accuracy at least 60%), suggesting that the primary bottleneck lies in the alignment and decoding stages, where models fail to leverage the sensory signals they have already captured.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of SonicBench, a benchmark that systematically evaluates the physical perception capabilities of LALMs, revealing critical deficiencies in their auditory understanding and suggesting pathways for future research. The study's innovative methodology and rigorous experimental design provide valuable insights into the limitations of current models and highlight the importance of foundational auditory understanding in advancing the field.
The paper introduces SonicBench, a benchmark designed to evaluate the physical perception capabilities of Large Audio Language Models (LALMs). The methodology is innovative as it combines psychophysics with machine learning, allowing for a nuanced understanding of LALMs' auditory processing. The use of a controllable generation toolbox to create stimuli for both recognition and comparison tasks is particularly noteworthy, as it enables a comprehensive assessment of sensory precision and relational reasoning. However, the reliance on linear classifiers for probing may limit the depth of insights into model capabilities.
The experiments are well-structured, focusing on 12 core physical attributes across five perceptual dimensions. The results indicate a significant gap in LALMs' performance compared to human benchmarks, highlighting a critical area for future research. The finding that frozen audio encoders can capture physical cues with at least 60% accuracy is promising, suggesting that the issue lies in the alignment and decoding stages rather than the sensory input itself. However, the evaluation could be strengthened by including more diverse datasets and exploring multilingual contexts.
The paper lacks detailed implementation specifics, which could hinder reproducibility. While the methodology is described, the absence of code or a project URL limits the ability of other researchers to replicate the findings. Providing access to the dataset and the generation toolbox would enhance reproducibility and facilitate further research in this area.
The paper acknowledges several limitations, including the focus on English text instructions, which may not reflect the performance of LALMs in multilingual settings. Additionally, the controlled nature of the dataset may restrict acoustic variability, potentially affecting the generalizability of the results. The authors also note that while their probing methodology is effective, it may not capture the full complexity of LALMs' auditory processing capabilities.
The findings of this study have significant implications for the development of more robust audio processing models. By identifying the physical perception bottleneck in LALMs, the research paves the way for future advancements in audio language models that can better understand and interpret the fundamental attributes of sound. This could lead to improvements in various applications, including speech recognition, music analysis, and interactive audio systems. The main contribution of this paper is the introduction of SonicBench, a benchmark that systematically evaluates the physical perception capabilities of LALMs, revealing critical deficiencies in their auditory understanding and suggesting pathways for future research. The study's innovative methodology and rigorous experimental design provide valuable insights into the limitations of current models and highlight the importance of foundational auditory understanding in advancing the field.
We present a family of open-source Music Foundation Models designed to advance large-scale music understanding and generation across diverse tasks and modalities. Our framework consists of four major components: (1) HeartCLAP, an audio-text alignment model; (2) HeartTranscriptor, a robust lyric recognition model optimized for real-world music scenarios; and (3) HeartCodec, a low-frame-rate (12.5 Hz) yet high-fidelity music codec tokenizer that captures long-range musical structure while preserving fine-grained acoustic details and enabling efficient autoregressive modeling; (4) HeartMuLa, an LLM-based song generation model capable of synthesizing high-fidelity music under rich, user-controllable conditions (e.g., textual style descriptions, lyrics, and reference audio). In addition, it provides two specialized modes: (i) fine-grained musical attribute control, which allows users to specify the style of different song sections (e.g., intro, verse, chorus) using natural language prompts; and (ii) short, engaging music generation, which is suitable as background music for short videos. Lastly, HeartMuLa improves significantly when scaled to 7B parameters. For the first time, we show that a Suno-level, commercial-grade system can be reproduced using academic-scale data and GPU resources. We expect these foundation models to serve as strong baselines for future research and to facilitate practical applications in multimodal content production.
Primary: Peking University
All Institutions: Peking University, Scale Global, Ario, The Chinese University of Hong Kong
The main contribution of this paper is the introduction of a family of open-source music foundation models that unify various aspects of music understanding and generation, providing a robust framework for future research and practical applications in the creative industry. The integration of advanced methodologies and the potential for user-controlled generation mark a significant advancement in the field of machine learning for audio.
The paper introduces a comprehensive framework for music generation and understanding, comprising four distinct models that address various aspects of music processing. HeartCLAP focuses on audio-text alignment, HeartTranscriptor enhances lyric recognition, HeartCodec serves as a high-fidelity music tokenizer, and HeartMuLa is an LLM-based model for music generation. The methodology is well-structured, leveraging existing advancements in audio processing while innovating through the integration of low-frame-rate tokenization and user-controllable generation, which is a significant step forward in the field.
The authors claim extensive experiments demonstrating improvements over existing baselines in terms of reconstruction quality and generation performance. However, specific details regarding datasets, experimental setups, and quantitative results are not provided in the abstract or methodology sections, which limits the ability to fully assess the robustness of their claims. The scaling to 7B parameters is a notable highlight, suggesting that the model can handle larger contexts effectively.
The paper mentions that the models are open-sourced, which is a positive aspect for reproducibility. However, without a dedicated URL for the code or datasets, it is challenging to evaluate the ease of reproducing the results. Clear documentation and access to the models would be necessary to facilitate further research.
The paper does not address potential limitations or challenges encountered during the model development or the implications of using large-scale models in music generation. Additionally, while the ethical considerations are mentioned, the practical challenges of implementing watermarking and ensuring accountability in AI-generated music are not thoroughly discussed.
The HeartMuLa framework has the potential to significantly impact the music industry by providing tools for content creators to generate music efficiently and creatively. The open-source nature of the project could foster collaboration and innovation in the field of AI music generation, although ethical implications and copyright issues remain critical considerations. The main contribution of this paper is the introduction of a family of open-source music foundation models that unify various aspects of music understanding and generation, providing a robust framework for future research and practical applications in the creative industry. The integration of advanced methodologies and the potential for user-controlled generation mark a significant advancement in the field of machine learning for audio.
While Audio Large Models (ALMs) have achieved remarkable proficiency, their robustness remains brittle in real-world deployment. Existing evaluations largely rely on synthetic Gaussian noise or simplistic single-source interference, failing to capture the intricate, multi-layered acoustic dynamics -- or ``Acoustic Ecology'' -- that characterize authentic physical environments. To bridge this ecological gap, we introduce \textbf{RSA-Bench}, a comprehensive robustness benchmark designed to stress-test ALLMs through high-fidelity auditory scene simulations. Unlike traditional methods, we construct evaluation samples by naturally superimposing diverse environmental soundscapes -- spanning \textit{Pasture}, \textit{Extreme Weather}, \textit{Classroom}, and \textit{Outdoors} -- onto clean speech signals across a spectrum of interference intensities. By evaluating models on six core tasks ranging from fundamental perception to complex reasoning, our study unveils three macro-level insights: \textbf{(I) The Perception-Cognition Gap:} Models maintain relative resilience in low-level recognition but suffer a \textbf{functional collapse} in high-order reasoning tasks under stress; \textbf{(II) Scenario Sensitivity:} ``Vocal-like'' interference (e.g., background laughter) proves significantly more destructive than mechanical noise, challenging the model's auditory attention mechanisms; and \textbf{(III) The Denoising Paradox:} Standard speech enhancement often exacerbates performance degradation, as ALLMs prove highly sensitive to the semantic distortions introduced by denoising artifacts.
Primary: Shanghai University
All Institutions: Beijing University of Posts and Telecommunications, Nanyang Technological University, Xidian University, Shanghai University
The paper introduces RSA-Bench, a novel benchmark for evaluating the robustness of Audio Large Models in complex acoustic environments, revealing critical insights into model vulnerabilities and the limitations of current denoising strategies. The comprehensive methodology and thorough experimental evaluation contribute significantly to the understanding of ALLMs' performance in real-world scenarios, highlighting the need for further research into improving model robustness.
The methodology presented in RSA-Bench is robust and innovative, focusing on creating a comprehensive benchmark for Audio Large Models (ALLMs) that simulates real-world acoustic environments. The authors utilize a multi-source superposition strategy to create realistic soundscapes, which is a significant improvement over traditional benchmarks that rely on simplistic noise models. The systematic approach to categorizing tasks into perception and cognitive reasoning is well-structured, allowing for a nuanced evaluation of model performance across different scenarios. However, the methodology could benefit from a clearer explanation of the acoustic ecology concept and its implications for model evaluation.
The experimental evaluation is thorough, with a large dataset of over 100,000 samples and a diverse set of tasks that comprehensively assess model performance. The results reveal critical insights into the vulnerabilities of ALLMs, particularly the perception-cognition gap and the denoising paradox. The use of various models for comparison adds depth to the analysis, although the paper could provide more detailed statistical analyses to support the findings. The experiments effectively highlight the limitations of current models in real-world scenarios, making a significant contribution to the field.
The authors provide a GitHub repository with code and datasets, which enhances reproducibility. However, the paper lacks detailed implementation specifics, such as the exact configurations used for the models and the parameters for the noise generation process. Including these details would improve the ability of other researchers to replicate the study.
One limitation is the focus on only a limited number of environmental scenarios, which may not encompass the full range of real-world acoustic conditions. Additionally, the paper primarily addresses inference-time mitigation strategies, leaving out potential training-time interventions that could enhance model robustness. The findings regarding the ineffectiveness of denoising methods could also benefit from further exploration of alternative strategies.
The findings of this research have significant implications for the deployment of ALLMs in real-world applications, particularly in environments with complex acoustic dynamics. By identifying vulnerabilities in model performance, this work paves the way for future research aimed at improving the robustness of audio processing systems, which could enhance their usability in various fields such as telecommunications, assistive technologies, and interactive AI systems. The paper introduces RSA-Bench, a novel benchmark for evaluating the robustness of Audio Large Models in complex acoustic environments, revealing critical insights into model vulnerabilities and the limitations of current denoising strategies. The comprehensive methodology and thorough experimental evaluation contribute significantly to the understanding of ALLMs' performance in real-world scenarios, highlighting the need for further research into improving model robustness.
Despite rapid progress in text-to-speech (TTS), open-source systems still lack truly instruction-following, fine-grained control over core speech attributes (e.g., pitch, speaking rate, age, emotion, and style). We present VoiceSculptor, an open-source unified system that bridges this gap by integrating instruction-based voice design and high-fidelity voice cloning in a single framework. It generates controllable speaker timbre directly from natural-language descriptions, supports iterative refinement via Retrieval-Augmented Generation (RAG), and provides attribute-level edits across multiple dimensions. The designed voice is then rendered into a prompt waveform and fed into a cloning model to enable high-fidelity timbre transfer for downstream speech synthesis. VoiceSculptor achieves open-source state-of-the-art (SOTA) on InstructTTSEval-Zh, and is fully open-sourced, including code and pretrained models, to advance reproducible instruction-controlled TTS research.
Primary: Northwestern Polytechnical University
All Institutions: Northwestern Polytechnical University, Yutu Zhineng, Shanghai Lingguang Zhaxian Technology, WeNet Open Source Community
VoiceSculptor represents a significant advancement in the field of text-to-speech synthesis by integrating user-driven voice design with high-fidelity cloning capabilities. The innovative approach to controllable speech attributes and the commitment to open-source principles position this work as a valuable contribution to the ongoing evolution of TTS technologies.
The methodology presented in VoiceSculptor is innovative as it combines instruction-based voice design with high-fidelity voice cloning. The use of natural language descriptions to generate speaker timbre is a significant advancement, enabling users to have fine-grained control over various speech attributes. The incorporation of Retrieval-Augmented Generation (RAG) for iterative refinement is a notable methodological enhancement that allows for dynamic adjustments based on user feedback. However, the paper could benefit from a more detailed explanation of the underlying algorithms and their integration.
The authors claim to achieve state-of-the-art results on the InstructTTSEval-Zh benchmark, which indicates a strong performance in the field. However, the evaluation metrics and comparison with existing systems are not sufficiently detailed in the abstract. A comprehensive analysis of the experiments, including dataset descriptions, training procedures, and qualitative assessments of the generated voices, would strengthen the paper's credibility. The results should also include user studies to validate the effectiveness of the controllability features.
The authors mention that the system is fully open-sourced, which is a positive aspect for reproducibility. However, specific details regarding the setup, dependencies, and instructions for running the code are not provided in the abstract. A clear and thorough documentation in the supplementary materials or appendix would enhance reproducibility.
The paper acknowledges potential limitations but does not elaborate on them in the abstract. Some possible limitations include the scalability of the system to different languages, the quality of generated voices in diverse emotional states, and the computational resources required for high-fidelity cloning. Addressing these limitations in the full paper would provide a more balanced view of the system's capabilities.
VoiceSculptor has the potential to significantly impact various applications, including personalized voice assistants, content creation, and accessibility tools for individuals with speech impairments. By enabling users to design their voices through simple instructions, the system democratizes voice synthesis technology and could lead to broader adoption in both consumer and professional settings. VoiceSculptor represents a significant advancement in the field of text-to-speech synthesis by integrating user-driven voice design with high-fidelity cloning capabilities. The innovative approach to controllable speech attributes and the commitment to open-source principles position this work as a valuable contribution to the ongoing evolution of TTS technologies.
Traditional speech systems typically rely on separate, task-specific models for text-to-speech (TTS), automatic speech recognition (ASR), and voice conversion (VC), resulting in fragmented pipelines that limit scalability, efficiency, and cross-task generalization. In this paper, we present General-Purpose Audio (GPA), a unified audio foundation model that integrates multiple core speech tasks within a single large language model (LLM) architecture. GPA operates on a shared discrete audio token space and supports instruction-driven task induction, enabling a single autoregressive model to flexibly perform TTS, ASR, and VC without architectural modifications. This unified design combines a fully autoregressive formulation over discrete speech tokens, joint multi-task training across speech domains, and a scalable inference pipeline that achieves high concurrency and throughput. The resulting model family supports efficient multi-scale deployment, including a lightweight 0.3B-parameter variant optimized for edge and resource-constrained environments. Together, these design choices demonstrate that a unified autoregressive architecture can achieve competitive performance across diverse speech tasks while remaining viable for low-latency, practical deployment.
Primary: unknown
All Institutions: unknown
The paper presents GPA, a unified autoregressive framework for speech tasks that consolidates TTS, ASR, and VC into a single architecture, demonstrating competitive performance and efficiency. The innovative methodology and comprehensive evaluation contribute meaningfully to the advancement of speech processing technologies, with potential for significant real-world applications.
The paper introduces the General-Purpose Audio (GPA) model, which unifies TTS, ASR, and VC tasks within a single autoregressive framework. The methodology is innovative in its use of a shared discrete audio token space and instruction-driven task induction, allowing seamless switching between tasks without architectural modifications. The dual-tokenizer scheme enhances the model's ability to represent both semantic and acoustic information effectively. The joint multi-task training approach is well-justified, leveraging shared representations to improve performance across tasks. However, the reliance on a single autoregressive architecture may limit the model's ability to achieve peak performance in highly specialized tasks.
The empirical evaluation is comprehensive, covering training dynamics, performance metrics, and scalability across different model sizes. The authors provide quantitative results on benchmarks for TTS and ASR, demonstrating competitive performance relative to existing models. The evaluation of latency and throughput under streaming conditions is particularly relevant for real-world applications, showcasing the model's efficiency. However, the lack of separate evaluation metrics for voice conversion may limit the understanding of its performance in that domain.
The authors emphasize reproducibility by providing complete training code and configuration files on a public GitHub repository. This transparency is commendable and facilitates further research and validation of their results. The detailed description of the training protocol, data composition, and sample construction strategies enhances the reproducibility of the study.
The paper acknowledges several limitations, including the potential bottleneck introduced by the unified architecture, which may hinder performance on specialized tasks. Additionally, the inference cost and latency scaling with sequence length pose challenges for certain applications. The ASR performance of the lightweight model variant is noted to be comparatively weaker, indicating that further optimization and scaling could yield improvements.
The GPA model has significant implications for the field of speech processing, as it addresses the fragmentation of existing systems by providing a unified framework for multiple speech tasks. This could lead to more efficient deployment in real-time applications, particularly in resource-constrained environments. The model's ability to generalize across tasks may also foster advancements in cross-task learning and knowledge transfer in speech technologies. The paper presents GPA, a unified autoregressive framework for speech tasks that consolidates TTS, ASR, and VC into a single architecture, demonstrating competitive performance and efficiency. The innovative methodology and comprehensive evaluation contribute meaningfully to the advancement of speech processing technologies, with potential for significant real-world applications.
This paper addresses unsupervised diffusion-based single-channel speech enhancement (SE). Prior work in this direction combines a score-based diffusion model trained on clean speech with a Gaussian noise model whose covariance is structured by non-negative matrix factorization (NMF). This combination is used within an iterative expectation-maximization (EM) scheme, in which a diffusion-based posterior-sampling E-step estimates the clean speech. We first revisit this framework and propose to explicitly model both speech and acoustic noise as latent variables, jointly sampling them in the E-step instead of sampling speech alone as in previous approaches. We then introduce a new unsupervised SE framework that replaces the NMF noise prior with a diffusion-based noise model, learned jointly with the speech prior in a single conditional score model. Within this framework, we derive two variants: one that implicitly accounts for noise and one that explicitly treats noise as a latent variable. Experiments on WSJ0-QUT and VoiceBank-DEMAND show that explicit noise modeling systematically improves SE performance for both NMF-based and diffusion-based noise priors. Under matched conditions, the diffusion-based noise model attains the best overall quality and intelligibility among unsupervised methods, while under mismatched conditions the proposed NMF-based explicit-noise framework is more robust and suffers less degradation than several supervised baselines.
Primary: Universitรฉ de Lorraine
All Institutions: Universitรฉ de Lorraine, CNRS, Inria, Loria, Universitรฉ Grenoble Alpes
This paper contributes a significant advancement in unsupervised speech enhancement by introducing a diffusion-based framework that explicitly models noise, leading to improved performance metrics and robustness in speech processing tasks. The methodology is innovative, and the experimental validation is thorough, indicating strong potential for real-world applications.
The paper introduces a novel unsupervised speech enhancement framework that leverages diffusion models to explicitly model both speech and noise as latent variables, improving upon previous methods that treated noise implicitly. The use of a joint score model for speech and noise is innovative and allows for enhanced performance in speech enhancement tasks. The iterative expectation-maximization (EM) scheme is well-structured, and the introduction of two variants (one implicit and one explicit for noise modeling) demonstrates a thoughtful approach to tackling the problem.
The experiments are robust, utilizing well-known datasets (WSJ0-QUT and VoiceBank-DEMAND) to evaluate the proposed methods against several baselines, including both unsupervised and supervised approaches. The results indicate clear improvements in performance metrics such as SI-SDR, PESQ, and ESTOI, particularly in matched conditions, which supports the effectiveness of the proposed methods. The statistical significance of the results is also addressed, adding credibility to the findings.
The paper mentions that the code will be made publicly available, which is a positive aspect for reproducibility. However, the detailed implementation specifics, such as hyperparameters and training configurations, are adequately described, allowing for replication of the experiments. The use of standard datasets also aids in reproducibility.
While the paper presents significant advancements, it does not explore the limitations of the proposed methods in depth. For instance, the performance under mismatched conditions shows some degradation, suggesting that further robustness improvements are needed. Additionally, the computational complexity of the methods may limit their applicability in real-time scenarios.
The proposed frameworks have the potential to significantly enhance applications in speech processing, particularly in scenarios where clean speech data is scarce. This could benefit various fields, including telecommunications, hearing aids, and voice recognition technologies, where improved speech clarity is crucial. The focus on unsupervised methods also opens avenues for broader applicability in diverse environments without the need for extensive labeled datasets. This paper contributes a significant advancement in unsupervised speech enhancement by introducing a diffusion-based framework that explicitly models noise, leading to improved performance metrics and robustness in speech processing tasks. The methodology is innovative, and the experimental validation is thorough, indicating strong potential for real-world applications.
Speech tokenizers serve as the cornerstone of discrete Speech Large Language Models (Speech LLMs). Existing tokenizers either prioritize semantic encoding, fuse semantic content with acoustic style inseparably, or achieve incomplete semantic-acoustic disentanglement. To achieve better disentanglement, we propose DSA-Tokenizer, which explicitly disentangles speech into discrete semantic and acoustic tokens via distinct optimization constraints. Specifically, semantic tokens are supervised by ASR to capture linguistic content, while acoustic tokens focus on mel-spectrograms restoration to encode style. To eliminate rigid length constraints between the two sequences, we introduce a hierarchical Flow-Matching decoder that further improve speech generation quality. Furthermore, We employ a joint reconstruction-recombination training strategy to enforce this separation. DSA-Tokenizer enables high fidelity reconstruction and flexible recombination through robust disentanglement, facilitating controllable generation in speech LLMs. Our analysis highlights disentangled tokenization as a pivotal paradigm for future speech modeling. Audio samples are avaialble at https://anonymous.4open.science/w/DSA_Tokenizer_demo/. The code and model will be made publicly available after the paper has been accepted.
Primary: Huawei
All Institutions: Huawei, DiT
The DSA-Tokenizer presents a novel approach to speech tokenization by achieving effective semantic-acoustic disentanglement, which significantly enhances the controllability and quality of speech generation in large language models. This work not only advances the state-of-the-art in speech processing but also opens avenues for future research in audio modeling and ethical considerations surrounding voice synthesis technologies.
The DSA-Tokenizer introduces a dual-stream architecture that effectively disentangles semantic and acoustic tokens through distinct optimization constraints. This methodology is innovative as it combines ASR supervision for semantic tokens and mel-spectrogram restoration for acoustic tokens, addressing the limitations of existing tokenizers that either mix these attributes or fail to achieve complete disentanglement. The introduction of a hierarchical Flow-Matching decoder and a joint reconstruction-recombination training strategy further enhances the model's capabilities, allowing for flexible recombination of speech attributes.
The experiments conducted are comprehensive, utilizing a variety of datasets and evaluation metrics to assess both reconstruction fidelity and cross-utterance recombination capabilities. The results demonstrate that DSA-Tokenizer outperforms existing models in key metrics such as UTMOS scores, WER, and speaker similarity, indicating its effectiveness in achieving high-quality speech generation while maintaining semantic integrity.
The paper provides detailed training procedures, model architectures, and evaluation setups, which enhance reproducibility. However, the actual code and model are not yet publicly available, which could hinder independent verification of results until they are released.
The paper acknowledges limitations such as inference latency due to the deep architecture of the decoder, which may not be suitable for real-time applications. Additionally, the model's applicability to other audio modalities beyond speech has not been explored, suggesting a narrow focus that could limit broader applicability.
The DSA-Tokenizer has significant implications for the development of controllable speech generation systems, particularly in applications like voice cloning and speech synthesis. However, the potential for misuse in generating deepfakes raises ethical concerns that necessitate the development of detection and mitigation strategies. The DSA-Tokenizer presents a novel approach to speech tokenization by achieving effective semantic-acoustic disentanglement, which significantly enhances the controllability and quality of speech generation in large language models. This work not only advances the state-of-the-art in speech processing but also opens avenues for future research in audio modeling and ethical considerations surrounding voice synthesis technologies.
Deep learning models define the state-of-the-art in Automatic Drum Transcription (ADT), yet their performance is contingent upon large-scale, paired audio-MIDI datasets, which are scarce. Existing workarounds that use synthetic data often introduce a significant domain gap, as they typically rely on low-fidelity SoundFont libraries that lack acoustic diversity. While high-quality one-shot samples offer a better alternative, they are not available in a standardized, large-scale format suitable for training. This paper introduces a new paradigm for ADT that circumvents the need for paired audio-MIDI training data. Our primary contribution is a semi-supervised method to automatically curate a large and diverse corpus of one-shot drum samples from unlabeled audio sources. We then use this corpus to synthesize a high-quality dataset from MIDI files alone, which we use to train a sequence-to-sequence transcription model. We evaluate our model on the ENST and MDB test sets, where it achieves new state-of-the-art results, significantly outperforming both fully supervised methods and previous synthetic-data approaches. The code for reproducing our experiments is publicly available at https://github.com/pier-maker92/ADT_STR
Primary: Roma Tre University
All Institutions: Roma Tre University, Sapienza University of Rome
This paper introduces a novel semi-supervised method for generating high-quality synthetic data for Automatic Drum Transcription, significantly advancing the state-of-the-art in the field. The comprehensive methodology and rigorous experimental validation highlight its potential impact on music information retrieval and related applications.
The paper proposes a semi-supervised approach to curate a large corpus of one-shot drum samples from unlabeled audio sources, addressing the limitations of existing synthetic data methods. The methodology is well-structured, utilizing CLAP embeddings for effective sample classification and ensuring high diversity in the generated data. The systematic approach to building a standardized instrument vocabulary and the detailed description of the data generation pipeline are commendable, demonstrating a clear understanding of the challenges in Automatic Drum Transcription (ADT).
The experimental evaluation is robust, employing well-known datasets (ENST and MDB) to benchmark the proposed model. The results indicate a significant performance improvement over previous methods, validating the effectiveness of the proposed approach. The systematic comparison across different experimental settings provides strong evidence for the contributions of each component of the framework, enhancing the credibility of the findings.
The authors have made their code publicly available, which is a positive aspect for reproducibility. However, the paper could benefit from more detailed documentation regarding the experimental setup, hyperparameters, and specific configurations used in training, which would further facilitate replication of the results.
While the paper presents a novel approach, it does not thoroughly address potential limitations such as the reliance on the quality of the one-shot libraries and the inherent challenges of generalizing from synthetic data to real-world applications. Additionally, the impact of the chosen semi-supervised method on the overall performance could be explored further.
The proposed method has significant implications for the field of music information retrieval, particularly in enhancing the accessibility and quality of drum transcription systems. By reducing the dependency on paired datasets, this work could democratize access to high-quality transcription tools for musicians, educators, and researchers, fostering innovation in music technology. This paper introduces a novel semi-supervised method for generating high-quality synthetic data for Automatic Drum Transcription, significantly advancing the state-of-the-art in the field. The comprehensive methodology and rigorous experimental validation highlight its potential impact on music information retrieval and related applications.
We introduce a voice-agentic framework that learns one critical omni-understanding skill: knowing when to trust itself versus when to consult external audio perception. Our work is motivated by a crucial yet counterintuitive finding: naively fine-tuning an omni-model on both speech recognition and external sound understanding tasks often degrades performance, as the model can be easily misled by noisy hypotheses. To address this, our framework, Speech-Hands, recasts the problem as an explicit self-reflection decision. This learnable reflection primitive proves effective in preventing the model from being derailed by flawed external candidates. We show that this agentic action mechanism generalizes naturally from speech recognition to complex, multiple-choice audio reasoning. Across the OpenASR leaderboard, Speech-Hands consistently outperforms strong baselines by 12.1% WER on seven benchmarks. The model also achieves 77.37% accuracy and high F1 on audio QA decisions, showing robust generalization and reliability across diverse audio question answering datasets. By unifying perception and decision-making, our work offers a practical path toward more reliable and resilient audio intelligence.
Primary: unknown
All Institutions: unknown
The paper presents a novel framework, Speech-Hands, designed to enhance audio processing by incorporating self-reflection mechanisms, significantly improving performance in speech recognition and audio reasoning tasks.
The methodology introduces a novel agentic framework, Speech-Hands, which incorporates self-reflection into audio processing tasks. The use of action tokens (
The experimental evaluation is thorough, utilizing multiple datasets across speech recognition and audio question answering tasks. The results demonstrate a significant improvement in performance metrics, such as WER and accuracy, compared to strong baselines. The paper provides a comprehensive analysis of the results, including confusion matrices and case studies that illustrate the model's decision-making process. However, the experiments could benefit from a broader range of external models to validate the framework's effectiveness further.
The paper lacks detailed implementation specifics, such as code availability or data access, which could hinder reproducibility. While the methodology is described in detail, the absence of a public repository or demo limits the ability of other researchers to replicate the results.
The main limitations include the imbalance in action token training, particularly the underrepresentation of the
The proposed framework has significant implications for improving audio intelligence systems, particularly in applications requiring reliable speech recognition and audio reasoning. By enabling models to make informed decisions about when to trust their own outputs versus external inputs, this work could enhance the robustness and reliability of AI systems in real-world audio processing tasks. The approach could be extended to various domains, including assistive technologies, automated transcription services, and interactive voice agents. The paper presents a novel framework, Speech-Hands, designed to enhance audio processing by incorporating self-reflection mechanisms, significantly improving performance in speech recognition and audio reasoning tasks.
Tabla Stroke Transcription (TST) is central to the analysis of rhythmic structure in Hindustani classical music, yet remains challenging due to complex rhythmic organization and the scarcity of strongly annotated data. Existing approaches largely rely on fully supervised learning with onset-level annotations, which are costly and impractical at scale. This work addresses TST in a weakly supervised setting, using only symbolic stroke sequences without temporal alignment. We propose a framework that combines a CTC-based acoustic model with sequence-level rhythmic rescoring. The acoustic model produces a decoding lattice, which is refined using a \textbf{$T\bar{a}la$}-Independent Static--Dynamic Rhythmic Model (TI-SDRM) that integrates long-term rhythmic structure with short-term adaptive dynamics through an adaptive interpolation mechanism. We curate a new real-world tabla solo dataset and a complementary synthetic dataset, establishing the first benchmark for weakly supervised TST in Hindustani classical music. Experiments demonstrate consistent and substantial reductions in stroke error rate over acoustic-only decoding, confirming the importance of explicit rhythmic structure for accurate transcription.
Primary: IEEE Publication Technology Group
All Institutions: IEEE Publication Technology Group
This paper presents a pioneering approach to weakly supervised tabla stroke transcription, establishing a new benchmark and demonstrating the critical role of rhythmic structure in improving transcription accuracy. The innovative combination of acoustic modeling and rhythmic rescoring offers valuable insights into the complexities of music transcription, particularly for non-Western musical traditions.
The proposed methodology introduces a novel framework for weakly supervised Tabla Stroke Transcription (TST) that effectively combines a Connectionist Temporal Classification (CTC)-based acoustic model with a rhythm-aware lattice rescoring mechanism. The introduction of the $Tala$-Independent Static-Dynamic Rhythmic Model (TI-SDRM) is a significant advancement, as it integrates long-term rhythmic structure with short-term adaptive dynamics. The adaptive interpolation mechanism used to balance the contributions of static and dynamic models based on acoustic confidence is particularly innovative. However, the methodology could benefit from further clarification on the implementation details of the adaptive interpolation process and how it is tuned in practice.
The experimental setup is robust, utilizing both a newly curated real-world tabla dataset and a synthetic dataset to establish a comprehensive benchmark for weakly supervised TST. The results demonstrate substantial improvements in stroke error rates when incorporating the TI-SDRM framework compared to acoustic-only decoding. The ablation studies effectively highlight the contributions of various components within the framework, reinforcing the importance of both long-term rhythmic structure and local adaptation. However, the paper could enhance its experimental rigor by providing more detailed statistical analysis of the results and discussing the significance of the improvements observed.
The paper provides a thorough description of the methodology, including the acoustic model architecture and training configurations. However, the lack of publicly available code or datasets limits reproducibility. Providing access to the datasets and code would significantly enhance the ability of other researchers to replicate the results and build upon this work.
One limitation of the study is the reliance on weakly supervised learning, which, while innovative, may still lead to inaccuracies in transcription due to the absence of precise onset-level annotations. Additionally, the performance of the proposed method may vary across different tabla styles or individual performers, which is not fully explored in the current work. The paper also does not address how the model might generalize to other percussion instruments or musical traditions, which could limit its broader applicability.
The research has the potential to significantly impact the field of music information retrieval, particularly in the context of Indian classical music. By enabling more accessible transcription of tabla performances, it could facilitate music education, archiving, and the preservation of cultural heritage. Furthermore, the methodologies developed could be adapted for use in other musical contexts and instruments, broadening the scope of automatic transcription systems. This paper presents a pioneering approach to weakly supervised tabla stroke transcription, establishing a new benchmark and demonstrating the critical role of rhythmic structure in improving transcription accuracy. The innovative combination of acoustic modeling and rhythmic rescoring offers valuable insights into the complexities of music transcription, particularly for non-Western musical traditions.
Speaker-specific anti-spoofing and synthesis-source tracing are central challenges in audio anti-spoofing. Progress has been hampered by the lack of datasets that systematically vary model architectures, synthesis pipelines, and generative parameters. To address this gap, we introduce LJ-Spoof, a speaker-specific, generatively diverse corpus that systematically varies prosody, vocoders, generative hyperparameters, bona fide prompt sources, training regimes, and neural post-processing. The corpus spans one speakers-including studio-quality recordings-30 TTS families, 500 generatively variant subsets, 10 bona fide neural-processing variants, and more than 3 million utterances. This variation-dense design enables robust speaker-conditioned anti-spoofing and fine-grained synthesis-source tracing. We further position this dataset as both a practical reference training resource and a benchmark evaluation suite for anti-spoofing and source tracing.
Primary: University of Michigan
All Institutions: University of Michigan
The paper introduces the LJ-Spoof dataset, a comprehensive resource for audio anti-spoofing research, systematically varying multiple parameters to enhance the robustness of speaker-specific detection systems. This work significantly contributes to the field by addressing the lack of diverse datasets and providing a structured methodology for future research.
The methodology presented in the paper is robust and comprehensive, focusing on the creation of a diverse corpus that systematically varies multiple parameters relevant to audio anti-spoofing. The authors clearly outline the dataset generation process, categorizing variations by training regimes, input sources, generative parameters, and neural post-processing techniques. This structured approach allows for a nuanced analysis of the effects of these variations on speaker-specific anti-spoofing systems. The inclusion of a reproducible protocol enhances the methodological rigor, ensuring that other researchers can replicate the dataset generation process.
The paper does not present experimental results directly within the text, as it primarily focuses on the dataset creation. However, the authors emphasize the potential for the LJ-Spoof dataset to facilitate future experiments and evaluations of anti-spoofing systems. The dataset's scale, with over 3 million utterances and 500 generatively variant subsets, suggests that it can support extensive testing and validation of various models. The authors also propose a strategic split protocol for training and evaluating anti-spoofing systems, which is a crucial aspect for ensuring the dataset's effectiveness.
The authors provide a detailed versioned protocol file that documents the architectural choices, configuration settings, and neural post-processing steps used in generating the dataset. This level of detail is commendable and significantly enhances the reproducibility of the research. By making the dataset open-sourced, the authors further promote transparency and community engagement, allowing other researchers to build upon their work.
One limitation noted is the focus on a single speaker, which may restrict the generalizability of findings to multi-speaker scenarios. While the authors acknowledge this and express intentions to extend the corpus to include multiple speakers in the future, the current dataset may not fully address the complexities involved in real-world applications where multiple voices and contexts are present. Additionally, the reliance on existing TTS systems and their configurations may introduce biases based on the limitations of those systems.
The LJ-Spoof dataset has significant implications for the fields of audio processing, anti-spoofing, and synthetic speech detection. By providing a comprehensive resource for evaluating speaker-specific anti-spoofing systems, the dataset can help advance research aimed at mitigating the risks associated with voice cloning and synthetic speech technologies. This is particularly relevant in an era where deepfake technologies pose threats to personal security and misinformation. The potential for broader applications in security, telecommunications, and digital forensics is substantial. The paper introduces the LJ-Spoof dataset, a comprehensive resource for audio anti-spoofing research, systematically varying multiple parameters to enhance the robustness of speaker-specific detection systems. This work significantly contributes to the field by addressing the lack of diverse datasets and providing a structured methodology for future research.
This paper summarizes the ICASSP 2026 Automatic Song Aesthetics Evaluation (ASAE) Challenge, which focuses on predicting the subjective aesthetic scores of AI-generated songs. The challenge consists of two tracks: Track 1 targets the prediction of the overall musicality score, while Track 2 focuses on predicting five fine-grained aesthetic scores. The challenge attracted strong interest from the research community and received numerous submissions from both academia and industry. Top-performing systems significantly surpassed the official baseline, demonstrating substantial progress in aligning objective metrics with human aesthetic preferences. The outcomes establish a standardized benchmark and advance human-aligned evaluation methodologies for modern music generation systems.
Primary: Nanjing University
All Institutions: Nanjing University, Nanyang Technological University, Shanghai Conservatory of Music
The paper effectively establishes a benchmark for evaluating the aesthetics of AI-generated music, advancing methodologies in the field and fostering collaboration among researchers. The challenge's design and results highlight the importance of aligning objective metrics with human preferences, paving the way for future advancements in music generation technologies.
The paper presents a well-structured challenge that addresses a significant gap in the evaluation of AI-generated music aesthetics. The use of two distinct tracks for overall musicality and fine-grained aesthetic dimensions is innovative, allowing for comprehensive evaluation. The hierarchical evaluation metric proposed is a strong methodological contribution, as it combines multiple correlation metrics to assess model performance effectively. The challenge's design encourages participants to explore diverse modeling strategies, which is crucial for advancing the field.
The experimental setup is robust, with a clear delineation of training and test datasets, including both seen and unseen systems. The performance metrics used provide a thorough assessment of model capabilities. The results indicate that top-performing systems have made significant strides in aligning with human aesthetic preferences, showcasing the effectiveness of various modeling techniques. However, the paper could benefit from more detailed statistical analysis of the results to better understand the performance variations among different models.
The paper provides adequate information regarding the datasets and evaluation metrics, which is essential for reproducibility. However, it lacks detailed descriptions of the individual models used by the participants, which could hinder full reproducibility of results. Including links to the submitted models or detailed descriptions of their architectures would enhance reproducibility.
One notable limitation is the modest TTA scores across teams, suggesting that while models can rank songs well, they struggle with fine-grained discrimination, particularly at the high-quality end. Additionally, the challenge's reliance on subjective aesthetic scores may introduce variability based on individual preferences, which could affect the generalizability of the findings.
The challenge has the potential to significantly influence the field of music generation by providing a standardized benchmark for evaluating AI-generated music. It encourages collaboration between academia and industry, fostering innovation in music technology. The outcomes could lead to improved music generation systems that better align with human aesthetic preferences, impacting both the creative industries and consumer experiences. The paper effectively establishes a benchmark for evaluating the aesthetics of AI-generated music, advancing methodologies in the field and fostering collaboration among researchers. The challenge's design and results highlight the importance of aligning objective metrics with human preferences, paving the way for future advancements in music generation technologies.
In this work, we present a novel perspective on cognitive impairment classification from speech by integrating speech foundation models that explicitly recognize speech dialects. Our motivation is based on the observation that individuals with Alzheimer's Disease (AD) or mild cognitive impairment (MCI) often produce measurable speech characteristics, such as slower articulation rate and lengthened sounds, in a manner similar to dialectal phonetic variations seen in speech. Building on this idea, we introduce VoxCog, an end-to-end framework that uses pre-trained dialect models to detect AD or MCI without relying on additional modalities such as text or images. Through experiments on multiple multilingual datasets for AD and MCI detection, we demonstrate that model initialization with a dialect classifier on top of speech foundation models consistently improves the predictive performance of AD or MCI. Our trained models yield similar or often better performance compared to previous approaches that ensembled several computational methods using different signal modalities. Particularly, our end-to-end speech-based model achieves 87.5% and 85.9% accuracy on the ADReSS 2020 challenge and ADReSSo 2021 challenge test sets, outperforming existing solutions that use multimodal ensemble-based computation or LLMs.
Primary: University of Southern California
All Institutions: University of Southern California
The paper presents VoxCog, a novel framework for cognitive impairment classification that utilizes dialectal knowledge from speech. This innovative approach not only enhances predictive performance but also simplifies the modeling process by eliminating the need for additional modalities, thus making significant strides in the field of machine learning for healthcare applications.
The paper introduces VoxCog, an end-to-end framework that leverages pre-trained dialect models to classify cognitive impairments based solely on speech. The methodology is innovative as it integrates dialectal knowledge, which is typically orthogonal to cognitive impairment detection, thus providing a fresh perspective on the task. The use of large-scale dialect models as a prior for cognitive impairment classification is a significant methodological advancement that could lead to more efficient and interpretable models. The approach is well-structured, with clear explanations of model initialization, architecture, and the rationale behind the choice of dialectal features.
The experiments are extensive, covering multiple datasets across different languages, which demonstrates the robustness and generalizability of the proposed model. The results show that VoxCog consistently outperforms baseline models, including those that rely on multimodal approaches, indicating strong empirical support for the hypothesis that dialectal features enhance cognitive impairment classification. The choice of datasets is appropriate, and the evaluation metrics are well-aligned with the goals of the study.
The paper provides sufficient detail regarding the model architecture, training procedures, and data preparation methods, which facilitates reproducibility. However, the absence of a public code repository or demo URL limits the ability of other researchers to replicate the findings directly. Including such resources would significantly enhance the paper's reproducibility.
One notable limitation is the reliance on dialectal features, which may not capture all relevant aspects of cognitive impairment in speech. Additionally, the paper acknowledges potential confounds introduced by the presence of interviewers' speech during recordings, which could affect the model's predictions. Future work is needed to address these issues, particularly through improved speaker diarization techniques.
The implications of this research are significant, particularly in the context of aging populations and the increasing prevalence of cognitive impairments like Alzheimer's disease. By providing a more efficient and interpretable model for cognitive impairment detection, this work could facilitate earlier diagnosis and intervention, ultimately improving patient outcomes. The approach also opens avenues for further research into the intersection of dialectology and cognitive health, potentially influencing both fields. The paper presents VoxCog, a novel framework for cognitive impairment classification that utilizes dialectal knowledge from speech. This innovative approach not only enhances predictive performance but also simplifies the modeling process by eliminating the need for additional modalities, thus making significant strides in the field of machine learning for healthcare applications.
Large Audio Language Models (LALMs) have been widely applied in real-time scenarios, such as in-car assistants and online meeting comprehension. In practice, audio inputs are often corrupted by device and environmental noise, leading to performance degradation. However, existing LALM studies on noise lack quantitative analysis and rely mainly on intuition and empirical observation, thus failing to understand practical robustness. To address this issue, we introduce Signal Embedding Energy (SEE), a method for quantifying the impact of noise intensity on LALM inputs, enabling the differentiation of LALM robustness in real-world deployments. SEE introduces a perspective based on structured activation subspaces derived from the model's internal representations, which more accurately captures its perception of noise than raw audio features. Across experiments, SEE exhibits a strong correlation with LALM performance, achieving a correlation of 0.98. Surprisingly, traditional audio denoising methods are only marginally effective for LALMs, and, in some cases, even increase SEE and impair performance. This suggests a mismatch between speech-centric denoising objectives and the noise sensitivity of modern LALMs. Therefore, we propose a mitigation strategy derived from SEE to denoise LALM inputs, outperforming existing denoising methods. This paper introduces a novel metric for noise quantification in LALMs, providing guidance for robustness improvements in real-world deployments.
Primary: unknown
All Institutions: unknown
This paper presents a significant advancement in understanding and mitigating noise interference in Large Audio Language Models through the introduction of a novel metric and a corresponding denoising strategy. The rigorous methodology and experimental validation position it as a valuable contribution to the field of machine learning and audio processing.
The methodology introduces Signal Embedding Energy (SEE) as a novel metric for quantifying noise interference in Large Audio Language Models (LALMs). The approach leverages structured activation subspaces derived from the model's internal representations, allowing for a more nuanced understanding of how noise affects model performance. The proposed method of Signal Embedding Energy Neutralization (SEEN) aims to mitigate noise without retraining, which is a significant advancement in the field. The use of empirical evaluations and the correlation of SEE with generation quality (0.98) further validate the robustness of the methodology.
The experiments are comprehensive, utilizing multiple models (Qwen, MiniCPM, StepAudio) and datasets (MMAU, Librispeech) to assess the effectiveness of SEE and SEEN across various noise conditions. The results demonstrate a consistent correlation between SEE and model performance, with SEEN outperforming traditional denoising methods. However, the paper could benefit from more extensive comparisons with a wider array of baseline methods and additional noise types to strengthen the findings.
The paper provides sufficient details on the experimental setup, including model architectures, noise synthesis, and evaluation metrics, which supports reproducibility. The availability of the code on GitHub enhances this aspect, allowing other researchers to replicate the experiments and validate the findings.
The limitations include the assumption of access to aligned clean requests and pure-noise recordings, which may not be feasible in all real-world scenarios. Additionally, the methodology relies on mean pooling, which may overlook temporal nuances in audio signals. The authors acknowledge that overly aggressive neutralization could remove task-relevant information, indicating a need for careful calibration.
The proposed methods have significant implications for real-world applications of LALMs, particularly in environments with varying noise conditions. By providing a quantitative measure of noise interference and a practical mitigation strategy, this work could enhance the robustness of audio-based AI systems in critical applications such as in-car assistants and online communication tools. This paper presents a significant advancement in understanding and mitigating noise interference in Large Audio Language Models through the introduction of a novel metric and a corresponding denoising strategy. The rigorous methodology and experimental validation position it as a valuable contribution to the field of machine learning and audio processing.
We present TagSpeech, a unified LLM-based framework that utilizes Temporal Anchor Grounding for joint multi-speaker ASR and diarization. The framework is built on two key designs: (1) decoupled semantic and speaker streams fine-tuned via Serialized Output Training (SOT) to learn turn-taking dynamics; and (2) an interleaved time anchor mechanism that not only supports fine-grained timestamp prediction but also acts as a synchronization signal between semantic understanding and speaker tracking. Compared to previous works that primarily focus on speaker-attributed ASR or implicit diarization, TagSpeech addresses the challenge of fine-grained speaker-content alignment and explicitly models "who spoke what and when" in an end-to-end manner. Experiments on AMI and AliMeeting benchmarks demonstrate that our method achieves consistent improvements in Diarization Error Rate (DER) over strong end-to-end baselines, including Qwen-Omni and Gemini, particularly in handling complex speech overlaps. Moreover, TagSpeech employs a parameter-efficient training paradigm in which the LLM backbone is frozen and only lightweight projectors are trained, resulting in strong performance with low computational cost.
Primary: University of Illinois Urbana-Champaign
All Institutions: Johns Hopkins University, University of Illinois Urbana-Champaign
The main contribution of this work is the introduction of TagSpeech, a unified framework for multi-speaker ASR and diarization that effectively addresses fine-grained speaker-content alignment through innovative methodologies. This paper makes a substantial technical contribution to the field of audio processing, particularly in enhancing the accuracy and efficiency of multi-speaker systems.
The methodology presented in TagSpeech is innovative, particularly in its use of Temporal Anchor Grounding for joint multi-speaker ASR and diarization. The decoupled semantic and speaker streams, along with the Serialized Output Training (SOT), provide a structured approach to learning turn-taking dynamics. The interleaved time anchor mechanism is a significant contribution that enhances the system's ability to synchronize semantic understanding with speaker tracking, addressing a critical gap in existing methods. The end-to-end nature of the framework, which explicitly models speaker-content alignment, is a notable advancement over previous works that have largely focused on either ASR or implicit diarization.
The experiments conducted on the AMI and AliMeeting benchmarks are robust, demonstrating consistent improvements in Diarization Error Rate (DER) compared to strong baselines like Qwen-Omni and Gemini. The focus on complex speech overlaps is particularly relevant, as this is a common challenge in multi-speaker environments. The results indicate that TagSpeech not only achieves better performance but does so with a parameter-efficient training paradigm, which is a critical consideration for practical applications.
The paper provides a GitHub repository link, which is essential for reproducibility. However, the paper should ideally include more detailed implementation instructions, hyperparameter settings, and any specific configurations used during training and evaluation to facilitate easier reproduction of results by other researchers.
While the framework shows promise, the paper does not sufficiently address potential limitations, such as the scalability of the approach to larger datasets or more diverse speaker profiles. Additionally, the reliance on a frozen LLM backbone may limit the adaptability of the model to new domains or languages without further fine-tuning.
The implications of TagSpeech are significant, particularly in applications requiring accurate transcription and speaker identification in multi-speaker scenarios, such as meetings, interviews, and media production. The advancements in handling speech overlaps could lead to better accessibility and usability of ASR technologies in real-world applications. The main contribution of this work is the introduction of TagSpeech, a unified framework for multi-speaker ASR and diarization that effectively addresses fine-grained speaker-content alignment through innovative methodologies. This paper makes a substantial technical contribution to the field of audio processing, particularly in enhancing the accuracy and efficiency of multi-speaker systems.
In this study, we present a multimodal framework for predicting neuro-facial disorders by capturing both vocal and facial cues. We hypothesize that explicitly disentangling shared and modality-specific representations within multimodal foundation model embeddings can enhance clinical interpretability and generalization. To validate this hypothesis, we propose DIVINE a fully disentangled multimodal framework that operates on representations extracted from state-of-the-art (SOTA) audio and video foundation models, incorporating hierarchical variational bottlenecks, sparse gated fusion, and learnable symptom tokens. DIVINE operates in a multitask learning setup to jointly predict diagnostic categories (Healthy Control,ALS, Stroke) and severity levels (Mild, Moderate, Severe). The model is trained using synchronized audio and video inputs and evaluated on the Toronto NeuroFace dataset under full (audio-video) as well as single-modality (audio-only and video-only) test conditions. Our proposed approach, DIVINE achieves SOTA result, with the DeepSeek-VL2 and TRILLsson combination reaching 98.26% accuracy and 97.51% F1-score. Under modality-constrained scenarios, the framework performs well, showing strong generalization when tested with video-only or audio-only inputs. It consistently yields superior performance compared to unimodal models and baseline fusion techniques. To the best of our knowledge, DIVINE is the first framework that combines cross-modal disentanglement, adaptive fusion, and multitask learning to comprehensively assess neurological disorders using synchronized speech and facial video.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of DIVINE, a fully disentangled multimodal framework that enhances the assessment of neuro-facial disorders by effectively integrating audio and visual cues through advanced representation learning techniques. This work is significant as it addresses critical challenges in clinical assessments, providing a foundation for future research and practical applications in the field.
The proposed DIVINE framework introduces a novel approach to disentangling shared and modality-specific representations in a multimodal setting, leveraging advanced techniques such as hierarchical variational bottlenecks and sparse gated fusion. The integration of symptom tokens for clinical interpretability is a significant methodological contribution that enhances the model's applicability in real-world clinical scenarios. The multitask learning setup is well-justified, allowing for simultaneous classification and severity estimation, which is crucial for neuro-facial disorder assessment.
The experiments are robust, utilizing the Toronto NeuroFace dataset with a well-defined evaluation strategy that includes cross-validation and various test conditions (full, audio-only, video-only). The reported results demonstrate significant improvements over baseline methods and unimodal models, establishing a new state-of-the-art performance. However, the paper could benefit from more detailed comparisons with existing methods in terms of computational efficiency and model complexity.
The paper provides a GitHub repository with code, model checkpoints, and evaluation scripts, which is a positive aspect for reproducibility. However, the absence of detailed hyperparameter settings and specific training configurations in the main text may hinder full reproducibility for some readers.
The study's limitations include a reliance on a single dataset, which may affect the generalizability of the findings. The authors also acknowledge the need for cross-dataset validation, which is essential for establishing the robustness of the model in diverse clinical settings. Additionally, the ethical implications of using the model in clinical practice are briefly mentioned but could be explored in greater depth.
The DIVINE framework has the potential to significantly impact clinical workflows by providing a more objective and interpretable assessment tool for neuro-facial disorders. Its use in decision-support systems could enhance the accuracy and efficiency of clinical evaluations, but careful consideration of ethical deployment and clinician oversight is necessary to mitigate risks associated with automated decision-making. The main contribution of this paper is the introduction of DIVINE, a fully disentangled multimodal framework that enhances the assessment of neuro-facial disorders by effectively integrating audio and visual cues through advanced representation learning techniques. This work is significant as it addresses critical challenges in clinical assessments, providing a foundation for future research and practical applications in the field.
We propose a unified framework for not only attributing synthetic speech to its source but also for detecting speech generated by synthesizers that were not encountered during training. This requires methods that move beyond simple detection to support both detailed forensic analysis and open-set generalization. To address this, we introduce SIGNAL, a hybrid framework that combines speech foundation models (SFMs) with graph-based modeling and open-set-aware inference. Our framework integrates Graph Neural Networks (GNNs) and a k-Nearest Neighbor (KNN) classifier, allowing it to capture meaningful relationships between utterances and recognize speech that doesn`t belong to any known generator. It constructs a query-conditioned graph over generator class prototypes, enabling the GNN to reason over relationships among candidate generators, while the KNN branch supports open-set detection via confidence-based thresholding. We evaluate SIGNAL using the DiffSSD dataset, which offers a diverse mix of real speech and synthetic audio from both open-source and commercial diffusion-based TTS systems. To further assess generalization, we also test on the SingFake benchmark. Our results show that SIGNAL consistently improves performance across both tasks, with Mamba-based embeddings delivering especially strong results. To the best of our knowledge, this is the first study to unify graph-based learning and open-set detection for tracing synthetic speech back to its origin.
Primary: unknown
All Institutions: unknown
The paper presents SIGNAL, a hybrid framework that effectively combines GNNs and KNN for robust synthetic speech detection and source attribution. This innovative approach has the potential to significantly advance the field of audio forensics by improving the detection of synthetic speech and addressing the challenges posed by unseen generators.
The proposed SIGNAL framework integrates Graph Neural Networks (GNNs) and k-Nearest Neighbors (KNN) to tackle the dual challenges of source attribution and open-set detection in synthetic speech. The methodology is well-structured, leveraging GNNs for relational modeling and KNN for instance-based reasoning. The construction of a query-conditioned graph over generator class prototypes is a novel approach that enhances the model's ability to reason over relationships among candidate generators. The combination of these two methodologies is innovative, although the paper could benefit from a more detailed explanation of the underlying assumptions and potential biases in the graph construction process.
The experiments are comprehensive, utilizing two diverse datasets (DiffSSD and SingFake) to evaluate the framework's performance in both closed-set and open-set scenarios. The results demonstrate consistent improvements across various tasks, particularly with Mamba-based embeddings. However, the paper could enhance its credibility by providing more detailed statistical analyses and comparisons with existing state-of-the-art methods.
The paper provides a GitHub repository link for the SIGNAL framework, which is a positive aspect for reproducibility. However, the implementation details, such as hyperparameter settings and training procedures, could be more explicitly outlined to facilitate replication by other researchers.
The authors acknowledge several limitations, including the reliance on a fixed decision threshold for open-set detection and the lack of fine-grained attribution among multiple unseen sources. Additionally, the evaluation is limited to two benchmarks, which may not encompass all possible synthesis conditions or generators.
The proposed framework addresses critical issues in synthetic speech detection, which has significant implications for security, misinformation, and trust in AI systems. By improving the ability to trace synthetic speech back to its source, this work contributes to the broader field of digital forensics and responsible AI deployment. The paper presents SIGNAL, a hybrid framework that effectively combines GNNs and KNN for robust synthetic speech detection and source attribution. This innovative approach has the potential to significantly advance the field of audio forensics by improving the detection of synthetic speech and addressing the challenges posed by unseen generators.