Role-play has become a key testbed for generative models, expanding from text-only dialogue to multimodal interaction. Extending role-play to speech captures prosody, emotion, and delivery, but also poses new evaluation challenges. Current pipelines often use audio large language models (ALLMs) as zero-shot judges, which miss paralinguistic cues, collapse multiple aspects into coarse scores, and rely on synthetic speech references that fail to reflect real-world roles. We present Speech-DRAME, a unified framework that contributes at three levels: (i) Speech-DRAME-EvalBench, an evaluation benchmark with bilingual human-annotated data and protocols for training and testing speech evaluation models (SEMs), (ii) DRAME-Eval, a fine-tuned evaluation model, which substantially outperforms zero-shot and few-shot ALLMs, and (iii) Speech-DRAME-RoleBench, a speech role-play benchmark that leverages DRAME-Eval as an automatic judge to compare speech foundation models (SFMs). Speech-DRAME distinguishes between two complementary evaluation strategies: Archetype Evaluation, a top-down approach measuring adherence to broad role archetypes, and Realism Evaluation, a bottom-up approach grounded in real human speech that emphasizes nuanced role quality. Compared to zero-shot ALLM judges, DRAME-Eval achieves stronger agreement with human ratings (Pearson correlation from 0.480 to 0.629 in archetypes, and 0.390 to 0.625 in realism). By integrating transparent benchmark resources, modeling approaches, and system-level evaluation, Speech-DRAME provides the first comprehensive, reproducible foundation for assessing spoken role-play.
Primary: Carnegie Mellon University
All Institutions: Carnegie Mellon University
The main contribution of this paper is the introduction of Speech-DRAME, a comprehensive framework for evaluating speech role-play that combines dual evaluation strategies and human-annotated datasets, significantly improving the assessment of generative models in speech interactions. This work addresses critical gaps in existing evaluation methodologies, paving the way for more effective and realistic speech-based AI systems.
The methodology presented in the paper is robust and innovative, introducing a dual evaluation paradigm that combines archetype and realism assessments for speech role-play. The creation of DRAME-EvalBench and DRAME-Eval demonstrates a thoughtful approach to addressing the limitations of existing evaluation models by incorporating human-annotated data and fine-tuning techniques. The authors effectively leverage a comprehensive dataset that includes both synthetic and real human speech, which enhances the evaluation's relevance and applicability. The clear definitions and structured evaluation strategies are commendable, making the framework both systematic and reproducible.
The experimental evaluation is thorough, with a well-defined setup that includes zero-shot, few-shot, and fine-tuning conditions. The results demonstrate significant improvements in correlation with human ratings when using the DRAME-Eval model compared to existing ALLMs. The paper provides detailed statistical analyses and comparisons across various models, showcasing the effectiveness of the proposed methods. However, the reliance on proprietary models for some evaluations may limit the generalizability of the findings.
The authors emphasize reproducibility by providing access to datasets, models, and code through a GitHub repository. The detailed descriptions of the experimental setup, including data collection, annotation protocols, and evaluation metrics, enhance the likelihood that other researchers can replicate the study. The commitment to transparency is evident, although the use of proprietary models may pose challenges for full reproducibility in certain aspects.
One notable limitation is the potential bias introduced by the reliance on proprietary models for evaluation, which may not be accessible to all researchers. Additionally, while the dual evaluation framework is innovative, the complexity of the evaluation process may pose challenges for broader adoption in the community. The paper also acknowledges the gap between synthetic and real human performance, indicating that further work is needed to bridge this divide.
The proposed framework has significant implications for the development and evaluation of speech-based generative models, particularly in applications such as education, entertainment, and human-AI interaction. By providing a comprehensive and nuanced approach to evaluating speech role-play, the work encourages the creation of more sophisticated and human-aligned AI systems. The integration of human annotations and the focus on realistic speech delivery can lead to advancements in the quality and reliability of conversational agents. The main contribution of this paper is the introduction of Speech-DRAME, a comprehensive framework for evaluating speech role-play that combines dual evaluation strategies and human-annotated datasets, significantly improving the assessment of generative models in speech interactions. This work addresses critical gaps in existing evaluation methodologies, paving the way for more effective and realistic speech-based AI systems.
While generative models for music composition are increasingly capable, their adoption by musicians is hindered by text-prompting, an asynchronous workflow disconnected from the embodied, responsive nature of instrumental performance. To address this, we introduce Aria-Duet, an interactive system facilitating a real-time musical duet between a human pianist and Aria, a state-of-the-art generative model, using a Yamaha Disklavier as a shared physical interface. The framework enables a turn-taking collaboration: the user performs, signals a handover, and the model generates a coherent continuation performed acoustically on the piano. Beyond describing the technical architecture enabling this low-latency interaction, we analyze the system's output from a musicological perspective, finding the model can maintain stylistic semantics and develop coherent phrasal ideas, demonstrating that such embodied systems can engage in musically sophisticated dialogue and open a promising new path for human-AI co-creation.
Primary: Stanford University
All Institutions: Stanford University, Queen Mary University of London
The main contribution of this paper is the introduction of Aria-Duet, a novel system that facilitates real-time musical collaboration between humans and AI, addressing critical interaction challenges and demonstrating the potential for sophisticated musical dialogue. This work significantly advances the field of AI in music by integrating state-of-the-art generative models with embodied performance, paving the way for future explorations in human-AI co-creation.
The paper presents a well-structured methodology that integrates a generative model with a real-time interactive system for musical co-creation. The use of a Yamaha Disklavier as a physical interface is innovative, allowing for a more embodied interaction between the human performer and the AI. The authors address critical engineering challenges such as latency and coherence in musical transitions, demonstrating a deep understanding of both the technical and artistic aspects of music performance. The continuous prefill strategy and the custom playback adjustments are particularly noteworthy as they enhance the user experience significantly.
The paper includes a demonstration of the system's capabilities through a video showcasing various musical prompts. The analysis of the system's output from a musicological perspective is thorough, providing insights into how well the model maintains stylistic semantics and coherence in its continuations. However, the paper lacks quantitative metrics or formal evaluations of the system's performance, which would strengthen the claims made about its effectiveness.
The paper provides a GitHub repository link for the Aria-Duet system, which is a positive aspect for reproducibility. However, detailed implementation specifics regarding the model training and the real-time engine setup are somewhat limited, which may pose challenges for other researchers attempting to replicate the work.
The paper acknowledges the need for further research to assess the broader impact of the Aria-Duet system on creativity and co-creation. Additionally, while the system shows promise, the authors note that it may not always generate highly inventive outputs, particularly with novel compositions. The reliance on a specific hardware setup (Disklavier) could also limit accessibility for some users.
The potential applications of this work are significant, as it opens up new avenues for human-AI collaboration in music composition. By addressing the interaction challenges faced by musicians when using AI tools, this research could help foster greater engagement with AI in creative processes. The implications extend beyond music, suggesting that similar approaches could be applied in other artistic domains where real-time interaction is crucial. The main contribution of this paper is the introduction of Aria-Duet, a novel system that facilitates real-time musical collaboration between humans and AI, addressing critical interaction challenges and demonstrating the potential for sophisticated musical dialogue. This work significantly advances the field of AI in music by integrating state-of-the-art generative models with embodied performance, paving the way for future explorations in human-AI co-creation.
Remote monitoring of cardiovascular diseases plays an essential role in early detection of abnormal cardiac function, enabling timely intervention, improved preventive care, and personalized patient treatment. Abnormalities in the heart sounds can be detected automatically via computer-assisted decision support systems, and used as the first-line screening tool for detection of cardiovascular problems, or for monitoring the effects of treatments and interventions. We propose in this paper CardioPHON, an integrated heart sound quality assessment and classification tool that can be used for screening of abnormal cardiac function from phonocardiogram recordings. The model is pretrained in a self-supervised fashion on a collection of six small- and mid-sized heart sound datasets, enables automatic removal of low quality recordings to ensure that subtle sounds of heart abnormalities are not misdiagnosed, and provides a state-of-the-art performance for the heart sound classification task. The multimodal model that combines audio and socio-demographic features demonstrated superior performance, achieving the best ranking on the official leaderboard of the 2022 George B. Moody PhysioNet heart sound challenge, whereas the unimodal model, that is based only on phonocardiogram recordings, holds the first position among the unimodal approaches (a total rank 4), surpassing the models utilizing multiple modalities. CardioPHON is the first publicly released pretrained model in the domain of heart sound recordings, facilitating the development of data-efficient artificial intelligence models that can generalize to various downstream tasks in cardiovascular diagnostics.
Primary: Luxembourg Institute of Health
All Institutions: Luxembourg Institute of Health, University of Zilina, University of Maribor
The main contribution of this paper is the development of CardioPHON, a comprehensive tool for heart sound quality assessment and classification, leveraging self-supervised learning to enhance diagnostic capabilities in cardiovascular health. The integration of multimodal data and the achievement of state-of-the-art results in a competitive challenge underscore the significance of this work in advancing machine learning applications in healthcare.
The methodology presented in CardioPHON is robust, integrating a novel quality assessment model with a self-supervised learning approach for heart sound classification. The use of multiple datasets for pretraining is a significant strength, allowing the model to learn diverse features relevant to cardiac function. The combination of audio and socio-demographic features in a multimodal framework enhances the model's predictive capabilities. The choice of BYOL-A for self-supervised pretraining is well-justified, as it addresses the challenges of limited labeled data in medical domains. However, the paper could benefit from a clearer explanation of the feature extraction process and the rationale behind the specific choices made in the model architecture.
The experimental evaluation is thorough, utilizing a variety of datasets and providing detailed performance metrics. The results demonstrate state-of-the-art performance in the PhysioNet 2022 challenge, which is a significant achievement. The paper effectively compares different training setups (feature extraction, training from scratch, and fine-tuning), showcasing the advantages of each approach. However, the reliance on a single challenge dataset for evaluation may limit the generalizability of the findings.
The paper provides a clear description of the datasets used and the experimental setup, which aids in reproducibility. The availability of the pretrained CardioPHON model and quality labels is a positive aspect that encourages further research and application. However, more details on the implementation specifics, such as hyperparameter tuning and data preprocessing steps, would enhance reproducibility.
One notable limitation is the model's development for a specific sampling frequency (1 kHz), which may not generalize well to other frequencies used in clinical settings (e.g., 4 kHz). This could lead to performance drops when applied to real-world data. Additionally, while the quality assessment improves classification performance, the paper acknowledges that the model could be further enhanced with a quality assessment model tailored for higher sampling frequencies.
The implications of this research are significant, as it addresses a critical need for automated tools in cardiovascular diagnostics. The integration of quality assessment with classification could lead to more reliable screening methods, potentially improving patient outcomes through earlier detection of cardiac abnormalities. The public release of the CardioPHON model may facilitate further advancements in the field and encourage the development of data-efficient AI models for other medical applications. The main contribution of this paper is the development of CardioPHON, a comprehensive tool for heart sound quality assessment and classification, leveraging self-supervised learning to enhance diagnostic capabilities in cardiovascular health. The integration of multimodal data and the achievement of state-of-the-art results in a competitive challenge underscore the significance of this work in advancing machine learning applications in healthcare.
Neural audio codecs have recently enabled high-fidelity reconstruction at high compression rates, especially for speech. However, speech and non-speech audio exhibit fundamentally different spectral characteristics: speech energy concentrates in narrow bands around pitch harmonics (80-400 Hz), while non-speech audio requires faithful reproduction across the full spectrum, particularly preserving higher frequencies that define timbre and texture. This poses a challenge: speech-optimized neural codecs suffer degradation on music or sound. Treating the full spectrum holistically is suboptimal: frequency bands have vastly different information density and perceptual importance by content type, yet full-band approaches apply uniform capacity across frequencies without accounting for these acoustic structures. To address this gap, we propose BSCodec (Band-Split Codec), a novel neural audio codec architecture that splits the spectral dimension into separate bands and compresses each band independently. Experimental results demonstrate that BSCodec achieves superior reconstruction over baselines across sound and music, while maintaining competitive quality in the speech domain, when trained on the same combined dataset of speech, music and sound. Downstream benchmark tasks further confirm that BSCodec shows strong potential for use in downstream applications.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of BSCodec, a novel neural audio codec that effectively addresses the challenges of audio reconstruction by splitting the spectral dimension into separate bands for independent compression. This innovative approach has the potential to advance the field of audio processing significantly, although further details on methodology and reproducibility are needed to fully assess its impact.
The proposed BSCodec architecture innovatively splits the audio spectrum into distinct bands, allowing for independent compression tailored to the unique characteristics of speech and non-speech audio. This approach acknowledges the differing perceptual importance and information density across frequency bands, which is a significant advancement over traditional holistic methods. The methodology is sound, leveraging neural networks effectively, but lacks detailed descriptions of the architecture and training process that would allow for deeper scrutiny.
The experimental results demonstrate that BSCodec outperforms existing codecs in various audio types, including speech, music, and sound. However, the paper would benefit from a more comprehensive comparison with state-of-the-art methods and a clearer presentation of quantitative metrics. The benchmarks used are relevant, but the lack of detailed datasets and evaluation criteria limits the robustness of the claims.
The paper does not provide sufficient implementation details or code repositories, which raises concerns about reproducibility. Without access to the model architecture, training data, and hyperparameters, it is challenging for other researchers to replicate the results or build upon this work.
One limitation is the potential overfitting of the model to the combined dataset, which may not generalize well to unseen audio types. Additionally, the paper does not address the computational efficiency of the proposed codec, which is critical for practical applications.
BSCodec has the potential to significantly enhance audio compression technologies, particularly in applications requiring high fidelity, such as streaming services, gaming, and virtual reality. Its ability to handle diverse audio types could lead to improved user experiences across various platforms. The main contribution of this paper is the introduction of BSCodec, a novel neural audio codec that effectively addresses the challenges of audio reconstruction by splitting the spectral dimension into separate bands for independent compression. This innovative approach has the potential to advance the field of audio processing significantly, although further details on methodology and reproducibility are needed to fully assess its impact.
We present MERaLiON-SER, a robust speech emotion recognition model de- signed for English and Southeast Asian languages. The model is trained using a hybrid objective combining weighted categorical cross-entropy and Concordance Correlation Coefficient (CCC) losses for joint discrete and dimensional emotion modelling. This dual approach enables the model to capture both the distinct categories of emotion (like happy or angry) and the fine-grained, such as arousal (intensity), valence (positivity/negativity), and dominance (sense of control), lead- ing to a more comprehensive and robust representation of human affect. Extensive evaluations across multilingual Singaporean languages (English, Chinese, Malay, and Tamil ) and other public benchmarks show that MERaLiON-SER consistently surpasses both open-source speech encoders and large Audio-LLMs. These results underscore the importance of specialised speech-only models for accurate paralin- guistic understanding and cross-lingual generalisation. Furthermore, the proposed framework provides a foundation for integrating emotion-aware perception into future agentic audio systems, enabling more empathetic and contextually adaptive multimodal reasoning.
Primary: Institute for Infocomm Research (IR)
All Institutions: Institute for Infocomm Research (IR)
The main contribution of this paper is the introduction of MERaLiON-SER, a robust speech emotion recognition model tailored for English and Southeast Asian languages, demonstrating superior performance through innovative methodologies and extensive evaluations. The technical contributions, particularly in the areas of model architecture and training strategies, represent a meaningful advancement in the field of speech emotion recognition, addressing critical challenges such as multilingual generalization and paralinguistic understanding.
The methodology presented in MERaLiON-SER is robust, utilizing a hybrid loss function that combines weighted categorical cross-entropy and Concordance Correlation Coefficient (CCC) to address both discrete and dimensional emotion recognition. The architecture builds upon the Whisper-Medium encoder, incorporating attention-based pooling and ECAPA-TDNN modules, which are well-suited for capturing temporal dynamics and speaker-invariant features. The use of Low-Rank Adaptation (LoRA) for efficient fine-tuning while keeping the backbone frozen is innovative and enhances parameter efficiency. The introduction of multiscale and hierarchical attention pooling techniques is a significant advancement for capturing emotional cues at various temporal resolutions.
The experimental evaluation is extensive, covering multiple datasets, including proprietary and public benchmarks. The results demonstrate that MERaLiON-SER consistently outperforms both open-source speech encoders and large Audio-LLMs across various languages and emotion categories. The use of Unweighted Average Recall (UAR) as a performance metric is appropriate given the class imbalance in emotion datasets. The paper provides detailed comparisons with existing models, showcasing the advantages of the proposed approach in multilingual and culturally diverse contexts.
The paper includes comprehensive details on the training configuration, datasets, and evaluation setup, which are crucial for reproducibility. However, the reliance on proprietary datasets may limit the ability of other researchers to fully replicate the results. The availability of the model on Hugging Face is a positive step towards facilitating reproducibility.
The paper acknowledges limitations such as potential label noise from pseudo-labeled data and the restriction to seven emotion classes, which may not encompass more complex emotional contexts. Future work is suggested to enhance the model's capabilities by incorporating more manually labeled data and expanding the range of emotions recognized.
The development of MERaLiON-SER has significant implications for applications in affective computing, particularly in creating empathetic conversational agents and enhancing human-machine interaction. The model's ability to generalize across languages and cultural contexts positions it as a valuable tool for various industries, including mental health monitoring, customer service, and robotics. The main contribution of this paper is the introduction of MERaLiON-SER, a robust speech emotion recognition model tailored for English and Southeast Asian languages, demonstrating superior performance through innovative methodologies and extensive evaluations. The technical contributions, particularly in the areas of model architecture and training strategies, represent a meaningful advancement in the field of speech emotion recognition, addressing critical challenges such as multilingual generalization and paralinguistic understanding.
We present MERaLiON-SER, a robust speech emotion recognition model de- signed for English and Southeast Asian languages. The model is trained using a hybrid objective combining weighted categorical cross-entropy and Concordance Correlation Coefficient (CCC) losses for joint discrete and dimensional emotion modelling. This dual approach enables the model to capture both the distinct categories of emotion (like happy or angry) and the fine-grained, such as arousal (intensity), valence (positivity/negativity), and dominance (sense of control), lead- ing to a more comprehensive and robust representation of human affect. Extensive evaluations across multilingual Singaporean languages (English, Chinese, Malay, and Tamil ) and other public benchmarks show that MERaLiON-SER consistently surpasses both open-source speech encoders and large Audio-LLMs. These results underscore the importance of specialised speech-only models for accurate paralin- guistic understanding and cross-lingual generalisation. Furthermore, the proposed framework provides a foundation for integrating emotion-aware perception into future agentic audio systems, enabling more empathetic and contextually adaptive multimodal reasoning.
Primary: Institute for Infocomm Research (IR)
All Institutions: Institute for Infocomm Research (IR)
The main contribution of this paper is the development of MERaLiON-SER, a robust speech emotion recognition model that excels in multilingual and culturally diverse settings. The comprehensive evaluation and innovative methodology underscore its potential to advance the field of affective computing, particularly in underrepresented languages and emotional expressions.
The methodology presented in MERaLiON-SER is robust and innovative, utilizing a hybrid loss function that combines weighted categorical cross-entropy and Concordance Correlation Coefficient (CCC) losses. This dual approach allows the model to effectively capture both discrete and dimensional aspects of emotion, which is a significant advancement in speech emotion recognition (SER). The architecture employs a Whisper-Medium encoder and integrates attention-based pooling and ECAPA-TDNN modules, enhancing the model's ability to generalize across languages and emotional expressions. The use of Low-Rank Adaptation (LoRA) for efficient fine-tuning is particularly noteworthy, as it balances computational efficiency with performance.
The experimental evaluation is extensive, covering multiple datasets, including proprietary and public benchmarks. The model demonstrates superior performance across various languages, achieving the highest Unweighted Average Recall (UAR) compared to both open-source and closed-source models. The rigorous evaluation on multilingual datasets, including the SG-ECMT and public emotion datasets, reinforces the model's robustness and generalization capabilities. The results are well-presented, providing clear comparisons with baseline models, which adds credibility to the findings.
The paper provides detailed implementation information, including model architecture, training configurations, and data augmentation techniques. However, while the methodology is well-documented, the lack of a publicly available code repository limits reproducibility. The authors mention that the model is available on Hugging Face, which aids reproducibility to some extent, but a comprehensive code release would enhance this aspect significantly.
The model's reliance on pseudo-labeled data for certain languages may introduce noise and bias, which could affect performance. Additionally, the model is limited to seven emotion classes and may not generalize well to more complex emotional contexts, such as sarcasm detection. The authors acknowledge these limitations and suggest future work to address them, particularly in enhancing the dataset with more manually labeled examples.
The MERaLiON-SER model has significant implications for applications in affective computing, particularly in developing empathetic conversational agents and improving human-machine interactions. Its ability to recognize emotions across multiple Southeast Asian languages positions it as a valuable tool in diverse cultural contexts, potentially enhancing user experiences in virtual assistants and mental health monitoring systems. The main contribution of this paper is the development of MERaLiON-SER, a robust speech emotion recognition model that excels in multilingual and culturally diverse settings. The comprehensive evaluation and innovative methodology underscore its potential to advance the field of affective computing, particularly in underrepresented languages and emotional expressions.
The proliferation of distorted, compressed, and manipulated music on modern media platforms like TikTok motivates the development of more robust audio fingerprinting techniques to identify the sources of musical recordings. In this paper, we develop and evaluate new neural audio fingerprinting techniques with the aim of improving their robustness. We make two contributions to neural fingerprinting methodology: (1) we use a pretrained music foundation model as the backbone of the neural architecture and (2) we expand the use of data augmentation to train fingerprinting models under a wide variety of audio manipulations, including time streching, pitch modulation, compression, and filtering. We systematically evaluate our methods in comparison to two state-of-the-art neural fingerprinting models: NAFP and GraFPrint. Results show that fingerprints extracted with music foundation models (e.g., MuQ, MERT) consistently outperform models trained from scratch or pretrained on non-musical audio. Segment-level evaluation further reveals their capability to accurately localize fingerprint matches, an important practical feature for catalog management.
Primary: Cornell University
All Institutions: Cornell University
The main contribution of this work is the introduction of pretrained music foundation models for robust audio fingerprinting, demonstrating superior performance over traditional methods and establishing a new benchmark for future research in the field. The comprehensive evaluation and innovative methodology position this paper as a significant advancement in audio processing and machine learning applications.
The methodology is robust, leveraging pretrained music foundation models (MuQ and MERT) for audio fingerprinting, which is a significant advancement over traditional methods. The systematic evaluation of data augmentations and the architecture design (including a two-layer MLP projection head) are well-justified and contribute to the overall effectiveness of the approach. The choice of datasets and the detailed description of augmentations provide a solid foundation for the experiments.
The experiments are comprehensive, comparing against established models (NAFP and GraFPrint) and including both track-level and segment-level evaluations. The results demonstrate a clear performance advantage of the proposed models, particularly in challenging conditions, which is crucial for real-world applications. However, the paper could benefit from more detailed statistical analyses to quantify the significance of the results.
The paper provides sufficient details about the datasets, model architectures, and training procedures, which enhances reproducibility. However, the lack of publicly accessible code or demo URLs limits the ability for others to replicate the findings easily.
The paper acknowledges weaknesses in handling certain transformations, such as spectral filtering, which could be a significant limitation in practical applications. Additionally, the focus on specific datasets may not fully represent the diversity of audio encountered in real-world scenarios.
The advancements in audio fingerprinting have significant implications for music identification, copyright enforcement, and catalog management in the age of digital media. The robustness of the proposed methods against various audio manipulations is particularly relevant for platforms like TikTok, where music is frequently altered. The main contribution of this work is the introduction of pretrained music foundation models for robust audio fingerprinting, demonstrating superior performance over traditional methods and establishing a new benchmark for future research in the field. The comprehensive evaluation and innovative methodology position this paper as a significant advancement in audio processing and machine learning applications.
Remote monitoring of cardiovascular diseases plays an essential role in early detection of abnormal cardiac function, enabling timely intervention, improved preventive care, and personalized patient treatment. Abnormalities in the heart sounds can be detected automatically via computer-assisted decision support systems, and used as the first-line screening tool for detection of cardiovascular problems, or for monitoring the effects of treatments and interventions. We propose in this paper CardioPHON, an integrated heart sound quality assessment and classification tool that can be used for screening of abnormal cardiac function from phonocardiogram recordings. The model is pretrained in a self-supervised fashion on a collection of six small- and mid-sized heart sound datasets, enables automatic removal of low quality recordings to ensure that subtle sounds of heart abnormalities are not misdiagnosed, and provides a state-of-the-art performance for the heart sound classification task. The multimodal model that combines audio and socio-demographic features demonstrated superior performance, achieving the best ranking on the official leaderboard of the 2022 George B. Moody PhysioNet heart sound challenge, whereas the unimodal model, that is based only on phonocardiogram recordings, holds the first position among the unimodal approaches (a total rank 4), surpassing the models utilizing multiple modalities. CardioPHON is the first publicly released pretrained model in the domain of heart sound recordings, facilitating the development of data-efficient artificial intelligence models that can generalize to various downstream tasks in cardiovascular diagnostics.
Primary: Luxembourg Institute of Health
All Institutions: Luxembourg Institute of Health, University of Zilina, University of Maribor
The main contribution of this paper is the development of CardioPHON, a comprehensive tool for heart sound quality assessment and classification, leveraging self-supervised learning to enhance diagnostic capabilities in cardiovascular health. The integration of multimodal data and the achievement of state-of-the-art results in a competitive challenge underscore the significance of this work in advancing machine learning applications in healthcare.
The methodology presented in CardioPHON is robust, integrating a novel quality assessment model with a self-supervised learning approach for heart sound classification. The use of multiple datasets for pretraining is a significant strength, allowing the model to learn diverse features relevant to cardiac function. The combination of audio and socio-demographic features in a multimodal framework enhances the model's predictive capabilities. The choice of BYOL-A for self-supervised pretraining is well-justified, as it addresses the challenges of limited labeled data in medical domains. However, the paper could benefit from a clearer explanation of the feature extraction process and the rationale behind the specific choices made in the model architecture.
The experimental evaluation is thorough, utilizing a variety of datasets and providing detailed performance metrics. The results demonstrate state-of-the-art performance in the PhysioNet 2022 challenge, which is a significant achievement. The paper effectively compares different training setups (feature extraction, training from scratch, and fine-tuning), showcasing the advantages of each approach. However, the reliance on a single challenge dataset for evaluation may limit the generalizability of the findings.
The paper provides a clear description of the datasets used and the experimental setup, which aids in reproducibility. The availability of the pretrained CardioPHON model and quality labels is a positive aspect that encourages further research and application. However, more details on the implementation specifics, such as hyperparameter tuning and data preprocessing steps, would enhance reproducibility.
One notable limitation is the model's development for a specific sampling frequency (1 kHz), which may not generalize well to other frequencies used in clinical settings (e.g., 4 kHz). This could lead to performance drops when applied to real-world data. Additionally, while the quality assessment improves classification performance, the paper acknowledges that the model could be further enhanced with a quality assessment model tailored for higher sampling frequencies.
The implications of this research are significant, as it addresses a critical need for automated tools in cardiovascular diagnostics. The integration of quality assessment with classification could lead to more reliable screening methods, potentially improving patient outcomes through earlier detection of cardiac abnormalities. The public release of the CardioPHON model may facilitate further advancements in the field and encourage the development of data-efficient AI models for other medical applications. The main contribution of this paper is the development of CardioPHON, a comprehensive tool for heart sound quality assessment and classification, leveraging self-supervised learning to enhance diagnostic capabilities in cardiovascular health. The integration of multimodal data and the achievement of state-of-the-art results in a competitive challenge underscore the significance of this work in advancing machine learning applications in healthcare.
Recent breakthroughs in language-queried audio source separation (LASS) have shown that generative models can achieve higher separation audio quality than traditional masking-based approaches. However, two key limitations restrict their practical use: (1) users often require operations beyond separation, such as sound removal; and (2) relying solely on text prompts can be unintuitive for specifying sound sources. In this paper, we propose PromptSep to extend LASS into a broader framework for general-purpose sound separation. PromptSep leverages a conditional diffusion model enhanced with elaborated data simulation to enable both audio extraction and sound removal. To move beyond text-only queries, we incorporate vocal imitation as an additional and more intuitive conditioning modality for our model, by incorporating Sketch2Sound as a data augmentation strategy. Both objective and subjective evaluations on multiple benchmarks demonstrate that PromptSep achieves state-of-the-art performance in sound removal and vocal-imitation-guided source separation, while maintaining competitive results on language-queried source separation.
Primary: University of Illinois Urbana-Champaign
All Institutions: University of Illinois Urbana-Champaign
PromptSep offers a unified framework for sound extraction and removal that overcomes key limitations of existing LASS systems. The integration of vocal imitation as a query modality addresses the ambiguity and limitations of text prompts, offering a more intuitive interface for users. Through comprehensive evaluations, PromptSep demonstrates state-of-the-art performance in sound removal and vocal-imitation-guided separation, while remaining competitive in standard LASS settings.
The methodology presented in PromptSep is innovative, as it combines a conditional diffusion model with multimodal prompting that includes both text and vocal imitation. This dual conditioning approach addresses the limitations of traditional language-queried audio source separation (LASS) systems, allowing for more intuitive user interactions and enabling both sound extraction and removal. The incorporation of a data simulation pipeline enhances the model's training process, allowing it to generalize better across various sound types and conditions. The use of Sketch2Sound for augmenting vocal imitation data is a notable strength, as it provides a richer dataset for training.
The experiments are well-structured, utilizing multiple benchmarks for both objective and subjective evaluations. The paper reports strong performance across various metrics, demonstrating the effectiveness of the proposed model in both sound extraction and removal tasks. The inclusion of both in-domain and out-of-domain datasets adds robustness to the evaluation, although the reliance on subjective evaluations could be further detailed in terms of participant demographics and scoring consistency.
The paper provides a reasonable level of detail regarding the model architecture and training procedures, but it lacks specific implementation details that would enhance reproducibility, such as hyperparameter settings and training duration. The release of the VimSketchGen dataset is a positive step towards facilitating further research, but additional resources, such as code or pre-trained models, would be beneficial for the community.
One limitation is the potential complexity of the model, which may require significant computational resources for training and inference. Additionally, while the model shows promise in handling vocal imitations, the paper does not explore the combined effects of using both text and vocal imitation during inference, which could limit its applicability in real-world scenarios. The subjective evaluation could also be influenced by the biases of the human annotators.
The advancements made in audio source separation through PromptSep have significant implications for various applications, including music production, film editing, and accessibility technologies for the hearing impaired. By allowing users to specify sounds more intuitively, the model could democratize audio editing tools, making them more accessible to non-experts. PromptSep offers a unified framework for sound extraction and removal that overcomes key limitations of existing LASS systems. The integration of vocal imitation as a query modality addresses the ambiguity and limitations of text prompts, offering a more intuitive interface for users. Through comprehensive evaluations, PromptSep demonstrates state-of-the-art performance in sound removal and vocal-imitation-guided separation, while remaining competitive in standard LASS settings.
Music editing has emerged as an important and practical area of artificial intelligence, with applications ranging from video game and film music production to personalizing existing tracks according to user preferences. However, existing models face significant limitations, such as being restricted to editing synthesized music generated by their own models, requiring highly precise prompts, or necessitating task-specific retraining, thus lacking true zero-shot capability. Leveraging recent advances in rectified flow and diffusion transformers, we introduce MusRec, the first zero-shot text-to-music editing model capable of performing diverse editing tasks on real-world music efficiently and effectively. Experimental results demonstrate that our approach outperforms existing methods in preserving musical content, structural consistency, and editing fidelity, establishing a strong foundation for controllable music editing in real-world scenarios.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of MusRec, a novel zero-shot text-to-music editing framework that utilizes rectified flow and diffusion transformers, significantly advancing the capabilities of music editing in real-world scenarios. The technical contributions and innovative methodology present a meaningful step forward in the intersection of machine learning and audio processing, with the potential for substantial impact in various creative industries.
The methodology presented in MusRec is innovative, leveraging rectified flow and diffusion transformers to enable zero-shot editing of real-world music. The authors adapt existing techniques from image processing to audio, specifically focusing on the inversion and editing processes. The framework's design allows for flexible, user-friendly interaction without the need for extensive retraining or precise prompts, which is a significant advancement in the field of music editing. However, the paper could benefit from clearer explanations of the underlying mathematical formulations and their implications for audio processing.
The experiments conducted are robust, utilizing both objective metrics (such as CLAP similarity and Fréchet Audio Distance) and subjective evaluations (Mean Opinion Scores) to assess the performance of the proposed model against strong baselines. The results indicate that MusRec outperforms existing methods in various aspects of music editing, showcasing its effectiveness in maintaining fidelity and semantic alignment. However, the datasets used are relatively small, which may limit the generalizability of the findings.
The paper lacks sufficient details regarding the implementation and availability of the code, which raises concerns about reproducibility. While the authors describe their methodology and experiments in detail, the absence of a publicly accessible repository or demo limits the ability for other researchers to replicate the results.
One of the main limitations is the reliance on small datasets, which may not capture the full diversity of real-world music. Additionally, while the zero-shot capability is a significant advantage, the performance may vary depending on the complexity of the editing tasks and the characteristics of the input audio. The paper also does not address potential computational costs associated with the model's operations, which could be a barrier for practical applications.
The potential applications of MusRec are extensive, ranging from enhancing music production workflows to enabling personalized music experiences for users. By simplifying the editing process and making it more accessible, this work could democratize music creation and editing, allowing a broader audience to engage with music technology. The implications for the entertainment industry, particularly in film and gaming, are particularly noteworthy, as this technology could streamline audio production processes. The main contribution of this paper is the introduction of MusRec, a novel zero-shot text-to-music editing framework that utilizes rectified flow and diffusion transformers, significantly advancing the capabilities of music editing in real-world scenarios. The technical contributions and innovative methodology present a meaningful step forward in the intersection of machine learning and audio processing, with the potential for substantial impact in various creative industries.
In this work, we present a new state-of-the-art Romanian Automatic Speech Recognition (ASR) system based on NVIDIA's FastConformer architecture--explored here for the first time in the context of Romanian. We train our model on a large corpus of, mostly, weakly supervised transcriptions, totaling over 2,600 hours of speech. Leveraging a hybrid decoder with both Connectionist Temporal Classification (CTC) and Token-Duration Transducer (TDT) branches, we evaluate a range of decoding strategies including greedy, ALSD, and CTC beam search with a 6-gram token-level language model. Our system achieves state-of-the-art performance across all Romanian evaluation benchmarks, including read, spontaneous, and domain-specific speech, with up to 27% relative WER reduction compared to previous best-performing systems. In addition to improved transcription accuracy, our approach demonstrates practical decoding efficiency, making it suitable for both research and deployment in low-latency ASR applications.
Primary: POLITEHNICA Bucharest
All Institutions: POLITEHNICA Bucharest
The main contribution of this paper is the introduction of a state-of-the-art Romanian ASR system using the FastConformer architecture, achieving significant performance improvements across various benchmarks. This work not only advances the field of speech recognition for Romanian but also sets a precedent for future research in low-resource languages, showcasing the potential of modern architectures and hybrid decoding strategies.
The paper presents a robust methodology by adapting the FastConformer architecture for Romanian ASR, leveraging a hybrid CTC-TDT decoder that allows for flexible decoding strategies. The use of both weakly supervised and high-quality transcriptions for training is innovative, particularly in the context of low-resource languages. The detailed exploration of various decoding strategies and their trade-offs adds depth to the methodology, showcasing a comprehensive understanding of ASR challenges.
The experimental evaluation is thorough, utilizing a diverse set of Romanian speech datasets, which strengthens the validity of the results. The paper reports significant improvements in WER across multiple benchmarks, demonstrating the effectiveness of the proposed system compared to existing models. The clear presentation of results, including comparisons with baseline systems, enhances the credibility of the findings.
The authors commit to releasing their trained model and detailed training recipes, which is commendable for reproducibility. However, the paper could benefit from more explicit details on hyperparameter tuning and the specific configurations used during training and evaluation.
While the paper achieves state-of-the-art results, it does not address potential limitations related to the reliance on weakly supervised data, which may affect the robustness of the model in real-world applications. Additionally, the computational demands of some decoding strategies may limit their applicability in low-latency scenarios.
The work has significant implications for the development of ASR systems in low-resource languages, potentially facilitating advancements in Romanian speech technology and contributing to more inclusive language processing tools. The open-source nature of the project encourages further research and development in this area. The main contribution of this paper is the introduction of a state-of-the-art Romanian ASR system using the FastConformer architecture, achieving significant performance improvements across various benchmarks. This work not only advances the field of speech recognition for Romanian but also sets a precedent for future research in low-resource languages, showcasing the potential of modern architectures and hybrid decoding strategies.
Emotions are fundamental to the creation and perception of music performances. However, achieving human-like expression and emotion through machine learning models for performance rendering remains a challenging task. In this work, we present SyMuPe, a novel framework for developing and training affective and controllable symbolic piano performance models. Our flagship model, PianoFlow, uses conditional flow matching trained to solve diverse multi-mask performance inpainting tasks. By design, it supports both unconditional generation and infilling of music performance features. For training, we use a curated, cleaned dataset of 2,968 hours of aligned musical scores and expressive MIDI performances. For text and emotion control, we integrate a piano performance emotion classifier and tune PianoFlow with the emotion-weighted Flan-T5 text embeddings provided as conditional inputs. Objective and subjective evaluations against transformer-based baselines and existing models show that PianoFlow not only outperforms other approaches, but also achieves performance quality comparable to that of human-recorded and transcribed MIDI samples. For emotion control, we present and analyze samples generated under different text conditioning scenarios. The developed model can be integrated into interactive applications, contributing to the creation of more accessible and engaging music performance systems.
Primary: Skolkovo Institute of Science and Technology
All Institutions: Skolkovo Institute of Science and Technology, Peachnote GmbH
This paper presents a significant advancement in the field of machine learning for music performance, introducing a novel framework and model that effectively captures the nuances of expressive piano playing while allowing for intuitive control through emotional and textual inputs. The comprehensive evaluation and innovative methodology position this work as a valuable contribution to both the academic and practical realms of music technology.
The methodology presented in this paper is robust and innovative, particularly the introduction of the PianoFlow model which utilizes conditional flow matching for expressive music performance rendering. The integration of emotion classification and text embeddings for control over performance adds a significant layer of sophistication. The authors also provide a comprehensive framework (SyMuPe) for tokenizing and modeling symbolic music, which is a notable advancement in the field. The use of a curated dataset of 2,968 hours of aligned musical scores and expressive MIDI performances is commendable, as it addresses the limitations of existing datasets.
The experimental evaluation is thorough, employing both objective metrics and subjective listening tests to validate the performance of the PianoFlow model against transformer-based baselines and external models. The results indicate that PianoFlow achieves superior performance in terms of expressiveness and quality, with a high win rate in listener preference tests. The use of diverse evaluation methods strengthens the findings and demonstrates the model's capabilities effectively.
The paper provides detailed descriptions of the model architecture, training configurations, and evaluation methodologies, which are essential for reproducibility. However, the absence of a public code repository limits the ability for other researchers to replicate the experiments fully. The authors mention the use of a supercomputer for their experiments, which may not be accessible to all researchers, potentially hindering reproducibility.
The authors acknowledge several limitations, including the approximation of pedal effects and the handling of trills, which could impact the realism of the generated performances. Additionally, the emotion classifier's limitations and the potential for arbitrary text inputs to produce undesirable results are noted. The lack of ablation studies to assess the impact of different components on performance is also a shortcoming.
The potential applications of this research are significant, as it contributes to the development of more intelligent and controllable music performance systems. The ability to generate expressive music performances that can be controlled through text and emotion inputs may enhance interactive applications in music education, entertainment, and therapy. This work could pave the way for more accessible music composition tools, fostering creativity among users with varying musical backgrounds. This paper presents a significant advancement in the field of machine learning for music performance, introducing a novel framework and model that effectively captures the nuances of expressive piano playing while allowing for intuitive control through emotional and textual inputs. The comprehensive evaluation and innovative methodology position this work as a valuable contribution to both the academic and practical realms of music technology.
Recent advances in Speech Large Language Models (Speech LLMs) have paved the way for unified architectures across diverse speech understanding tasks. However, prevailing alignment paradigms rely heavily on large-scale audio-text paired data and computationally intensive training, yet often exhibit limited generalization to unseen domains or tasks. To address these limitations, we propose TASU (Text-only Alignment for Speech Understanding), a novel alignment paradigm that can leverage only unpaired text data to guide cross-modal alignment. Experiments show that TASU achieves competitive zero-shot speech recognition. Leveraging this property, it can further function as a pre-training stage in curriculum learning, enhancing domain generalization in speech recognition. Ultimately, TASU can extend its zero-shot generalization to a wide range of speech understanding tasks and notably outperforms prominent Speech LLMs including GLM-4-Voice and Step-Audio on the MMSU benchmark, establishing TASU as an efficient and scalable alignment paradigm for Speech LLMs.
Primary: Shanghai Jiao Tong University
All Institutions: Shanghai Jiao Tong University, AISpeech Ltd, Huazhong University of Science and Technology, Jiangsu Key Lab of Language Computing, MoE Key Lab of Artificial Intelligence, NLP Lab, School of Computer Science, School of Electronic Information and Communications, Soochow University
The main contribution of this paper is the introduction of TASU, a novel alignment paradigm that enables effective speech understanding using only unpaired text data, significantly advancing the field of Speech LLMs. The technical contributions, particularly in methodology and experimental validation, position TASU as a promising approach for enhancing speech recognition capabilities while addressing existing limitations in data dependency.
The proposed TASU method introduces innovative techniques such as a phone-synchronized disposal mechanism and a CTC topology simulation, which are particularly noteworthy for their ability to facilitate cross-modal alignment using only unpaired text data. This approach is a significant departure from traditional methods that rely on audio-text pairs, thus addressing a critical gap in the field. The methodology is well-structured, although further details on the implementation and optimization of these techniques would enhance clarity.
The experiments conducted demonstrate the effectiveness of TASU in achieving competitive zero-shot speech recognition and improving domain generalization. The comparison with existing Speech LLMs on the MMSU benchmark provides a solid basis for evaluating performance. However, more extensive datasets and diverse task evaluations could strengthen the findings.
The paper lacks sufficient implementation details that would allow for easy reproduction of the results. While the methodology is promising, the absence of code or detailed experimental setups limits the ability of other researchers to validate the findings independently.
One significant limitation is the reliance on text data alone, which may not capture the full complexity of speech signals. Additionally, the performance in truly unseen domains or tasks remains to be thoroughly evaluated, which could impact the generalizability of the proposed method.
The implications of TASU are substantial, particularly in making speech understanding technologies more accessible by reducing the need for large audio datasets. This could democratize advancements in speech recognition and understanding across various applications, from virtual assistants to accessibility tools. The main contribution of this paper is the introduction of TASU, a novel alignment paradigm that enables effective speech understanding using only unpaired text data, significantly advancing the field of Speech LLMs. The technical contributions, particularly in methodology and experimental validation, position TASU as a promising approach for enhancing speech recognition capabilities while addressing existing limitations in data dependency.
This paper proposes VoxStudio, the first unified and end-to-end speech-to-image model that generates expressive images directly from spoken descriptions by jointly aligning linguistic and paralinguistic information. At its core is a speech information bottleneck (SIB) module, which compresses raw speech into compact semantic tokens, preserving prosody and emotional nuance. By operating directly on these tokens, VoxStudio eliminates the need for an additional speech-to-text system, which often ignores the hidden details beyond text, e.g., tone or emotion. We also release VoxEmoset, a large-scale paired emotional speech-image dataset built via an advanced TTS engine to affordably generate richly expressive utterances. Comprehensive experiments on the SpokenCOCO, Flickr8kAudio, and VoxEmoset benchmarks demonstrate the feasibility of our method and highlight key challenges, including emotional consistency and linguistic ambiguity, paving the way for future research.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of VoxStudio, a pioneering speech-to-image model that effectively captures emotional and linguistic nuances in generated images. This work represents a substantial step forward in multimodal machine learning, with the potential to influence various applications in both creative and assistive technologies.
The proposed methodology, VoxStudio, introduces a novel speech information bottleneck (SIB) module that effectively compresses raw speech into semantic tokens while preserving emotional nuances. This approach is innovative as it bypasses traditional speech-to-text systems, which often lose critical paralinguistic information. The integration of linguistic and paralinguistic features in a unified model is a significant advancement in the field of speech-to-image generation. However, the paper could benefit from a more detailed explanation of the SIB module's architecture and the training process involved.
The authors conduct comprehensive experiments on multiple benchmarks, including SpokenCOCO, Flickr8kAudio, and their own VoxEmoset dataset. The results demonstrate the feasibility of their approach, although the paper could provide more quantitative metrics to substantiate claims of emotional consistency and linguistic ambiguity challenges. The experiments are well-structured, but additional comparisons with state-of-the-art models would enhance the evaluation.
The paper lacks detailed implementation specifics, such as hyperparameters, training duration, and code availability, which are crucial for reproducibility. The absence of a public repository or demo further limits the ability of other researchers to replicate the findings and build upon the work.
Key limitations include the potential for emotional inconsistency in generated images and the challenges posed by linguistic ambiguity. The paper acknowledges these issues but does not provide a robust framework for addressing them in future work. Additionally, the reliance on a large-scale dataset generated via TTS may introduce biases that could affect the model's generalizability.
The implications of this research are significant, as it opens new avenues for applications in creative industries, such as animation and video game design, where expressive visual content is essential. The ability to generate images from speech could also enhance accessibility tools for individuals with disabilities, allowing for richer interactions in virtual environments. However, ethical considerations regarding the potential misuse of such technology must be addressed. The main contribution of this paper is the introduction of VoxStudio, a pioneering speech-to-image model that effectively captures emotional and linguistic nuances in generated images. This work represents a substantial step forward in multimodal machine learning, with the potential to influence various applications in both creative and assistive technologies.
Role-play has become a key testbed for generative models, expanding from text-only dialogue to multimodal interaction. Extending role-play to speech captures prosody, emotion, and delivery, but also poses new evaluation challenges. Current pipelines often use audio large language models (ALLMs) as zero-shot judges, which miss paralinguistic cues, collapse multiple aspects into coarse scores, and rely on synthetic speech references that fail to reflect real-world roles. We present Speech-DRAME, a unified framework that contributes at three levels: (i) Speech-DRAME-EvalBench, an evaluation benchmark with bilingual human-annotated data and protocols for training and testing speech evaluation models (SEMs), (ii) DRAME-Eval, a fine-tuned evaluation model, which substantially outperforms zero-shot and few-shot ALLMs, and (iii) Speech-DRAME-RoleBench, a speech role-play benchmark that leverages DRAME-Eval as an automatic judge to compare speech foundation models (SFMs). Speech-DRAME distinguishes between two complementary evaluation strategies: Archetype Evaluation, a top-down approach measuring adherence to broad role archetypes, and Realism Evaluation, a bottom-up approach grounded in real human speech that emphasizes nuanced role quality. Compared to zero-shot ALLM judges, DRAME-Eval achieves stronger agreement with human ratings (Pearson correlation from 0.480 to 0.629 in archetypes, and 0.390 to 0.625 in realism). By integrating transparent benchmark resources, modeling approaches, and system-level evaluation, Speech-DRAME provides the first comprehensive, reproducible foundation for assessing spoken role-play.
Primary: Carnegie Mellon University
All Institutions: Carnegie Mellon University
The main contribution of this paper is the introduction of Speech-DRAME, a comprehensive framework for evaluating speech role-play that combines dual evaluation strategies and human-annotated datasets, significantly improving the assessment of generative models in speech interactions. This work addresses critical gaps in existing evaluation methodologies, paving the way for more effective and realistic speech-based AI systems.
The methodology presented in the paper is robust and innovative, introducing a dual evaluation paradigm that combines archetype and realism assessments for speech role-play. The creation of DRAME-EvalBench and DRAME-Eval demonstrates a thoughtful approach to addressing the limitations of existing evaluation models by incorporating human-annotated data and fine-tuning techniques. The authors effectively leverage a comprehensive dataset that includes both synthetic and real human speech, which enhances the evaluation's relevance and applicability. The clear definitions and structured evaluation strategies are commendable, making the framework both systematic and reproducible.
The experimental evaluation is thorough, with a well-defined setup that includes zero-shot, few-shot, and fine-tuning conditions. The results demonstrate significant improvements in correlation with human ratings when using the DRAME-Eval model compared to existing ALLMs. The paper provides detailed statistical analyses and comparisons across various models, showcasing the effectiveness of the proposed methods. However, the reliance on proprietary models for some evaluations may limit the generalizability of the findings.
The authors emphasize reproducibility by providing access to datasets, models, and code through a GitHub repository. The detailed descriptions of the experimental setup, including data collection, annotation protocols, and evaluation metrics, enhance the likelihood that other researchers can replicate the study. The commitment to transparency is evident, although the use of proprietary models may pose challenges for full reproducibility in certain aspects.
One notable limitation is the potential bias introduced by the reliance on proprietary models for evaluation, which may not be accessible to all researchers. Additionally, while the dual evaluation framework is innovative, the complexity of the evaluation process may pose challenges for broader adoption in the community. The paper also acknowledges the gap between synthetic and real human performance, indicating that further work is needed to bridge this divide.
The proposed framework has significant implications for the development and evaluation of speech-based generative models, particularly in applications such as education, entertainment, and human-AI interaction. By providing a comprehensive and nuanced approach to evaluating speech role-play, the work encourages the creation of more sophisticated and human-aligned AI systems. The integration of human annotations and the focus on realistic speech delivery can lead to advancements in the quality and reliability of conversational agents. The main contribution of this paper is the introduction of Speech-DRAME, a comprehensive framework for evaluating speech role-play that combines dual evaluation strategies and human-annotated datasets, significantly improving the assessment of generative models in speech interactions. This work addresses critical gaps in existing evaluation methodologies, paving the way for more effective and realistic speech-based AI systems.
While generative models for music composition are increasingly capable, their adoption by musicians is hindered by text-prompting, an asynchronous workflow disconnected from the embodied, responsive nature of instrumental performance. To address this, we introduce Aria-Duet, an interactive system facilitating a real-time musical duet between a human pianist and Aria, a state-of-the-art generative model, using a Yamaha Disklavier as a shared physical interface. The framework enables a turn-taking collaboration: the user performs, signals a handover, and the model generates a coherent continuation performed acoustically on the piano. Beyond describing the technical architecture enabling this low-latency interaction, we analyze the system's output from a musicological perspective, finding the model can maintain stylistic semantics and develop coherent phrasal ideas, demonstrating that such embodied systems can engage in musically sophisticated dialogue and open a promising new path for human-AI co-creation.
Primary: Stanford University
All Institutions: Stanford University, Queen Mary University of London
The main contribution of this paper is the introduction of Aria-Duet, a novel system that facilitates real-time musical collaboration between humans and AI, addressing critical interaction challenges and demonstrating the potential for sophisticated musical dialogue. This work significantly advances the field of AI in music by integrating state-of-the-art generative models with embodied performance, paving the way for future explorations in human-AI co-creation.
The paper presents a well-structured methodology that integrates a generative model with a real-time interactive system for musical co-creation. The use of a Yamaha Disklavier as a physical interface is innovative, allowing for a more embodied interaction between the human performer and the AI. The authors address critical engineering challenges such as latency and coherence in musical transitions, demonstrating a deep understanding of both the technical and artistic aspects of music performance. The continuous prefill strategy and the custom playback adjustments are particularly noteworthy as they enhance the user experience significantly.
The paper includes a demonstration of the system's capabilities through a video showcasing various musical prompts. The analysis of the system's output from a musicological perspective is thorough, providing insights into how well the model maintains stylistic semantics and coherence in its continuations. However, the paper lacks quantitative metrics or formal evaluations of the system's performance, which would strengthen the claims made about its effectiveness.
The paper provides a GitHub repository link for the Aria-Duet system, which is a positive aspect for reproducibility. However, detailed implementation specifics regarding the model training and the real-time engine setup are somewhat limited, which may pose challenges for other researchers attempting to replicate the work.
The paper acknowledges the need for further research to assess the broader impact of the Aria-Duet system on creativity and co-creation. Additionally, while the system shows promise, the authors note that it may not always generate highly inventive outputs, particularly with novel compositions. The reliance on a specific hardware setup (Disklavier) could also limit accessibility for some users.
The potential applications of this work are significant, as it opens up new avenues for human-AI collaboration in music composition. By addressing the interaction challenges faced by musicians when using AI tools, this research could help foster greater engagement with AI in creative processes. The implications extend beyond music, suggesting that similar approaches could be applied in other artistic domains where real-time interaction is crucial. The main contribution of this paper is the introduction of Aria-Duet, a novel system that facilitates real-time musical collaboration between humans and AI, addressing critical interaction challenges and demonstrating the potential for sophisticated musical dialogue. This work significantly advances the field of AI in music by integrating state-of-the-art generative models with embodied performance, paving the way for future explorations in human-AI co-creation.
Prosody is essential for speech technology, shaping comprehension, naturalness, and expressiveness. However, current text-to-speech (TTS) systems still struggle to accurately capture human-like prosodic variation, in part because existing evaluation methods for prosody remain limited. Traditional metrics like Mean Opinion Score (MOS) are resource-intensive, inconsistent, and offer little insight into why a system sounds unnatural. This study introduces a linguistically informed, semi-automatic framework for evaluating TTS prosody through a two-tier architecture that mirrors human prosodic organization. The method uses quantitative linguistic criteria to evaluate synthesized speech against human speech corpora across multiple acoustic dimensions. By integrating discrete and continuous prosodic measures, it provides objective and interpretable metrics of both event placement and cue realization, while accounting for the natural variability observed across speakers and prosodic cues. Results show strong correlations with perceptual MOS ratings while revealing model-specific weaknesses that traditional perceptual tests alone cannot capture. This approach provides a principled path toward diagnosing, benchmarking, and ultimately improving the prosodic naturalness of next-generation TTS systems.
Primary: University of Pennsylvania
All Institutions: University of Pennsylvania
This paper presents a novel framework for evaluating prosody in TTS systems, offering a linguistically informed approach that bridges the gap between subjective and objective evaluation methods. The comprehensive analysis of the technical contributions and methodology highlights its significance in advancing the field of speech synthesis.
The proposed methodology introduces a two-tier framework for evaluating TTS prosody that integrates both discrete and continuous measures, reflecting the complexity of human prosodic expression. This approach is grounded in linguistic theory, allowing for a more nuanced evaluation of synthesized speech compared to traditional metrics like MOS. The framework's ability to account for variability among speakers and prosodic cues is a significant advancement, as it provides interpretable metrics that can diagnose specific weaknesses in TTS models.
The experiments conducted are robust, utilizing a range of TTS models and a well-defined human speech corpus. The comparison of model outputs against human performances using both perceptual evaluations (MOS and pairwise comparisons) and objective acoustic measures strengthens the findings. The results demonstrate clear correlations between the proposed metrics and human judgments, validating the effectiveness of the new evaluation framework.
The paper outlines a clear methodology for data collection and analysis, including the use of specific acoustic features and statistical measures. However, the lack of a publicly accessible dataset or code repository limits the reproducibility of the findings. Future work should aim to provide these resources to facilitate further research and validation of the proposed methods.
One limitation of the study is the reliance on a single corpus of human speech, which may not capture the full range of prosodic variability present in natural speech across different contexts and speakers. Additionally, while the framework addresses some limitations of traditional metrics, it may still be sensitive to the specific characteristics of the chosen reference corpus, potentially affecting generalizability.
The proposed evaluation framework has the potential to significantly impact the development of next-generation TTS systems by providing clearer diagnostic tools for improving prosodic naturalness. By moving beyond traditional evaluation methods, this work could lead to more expressive and human-like speech synthesis, enhancing applications in various fields such as virtual assistants, audiobooks, and accessibility technologies. This paper presents a novel framework for evaluating prosody in TTS systems, offering a linguistically informed approach that bridges the gap between subjective and objective evaluation methods. The comprehensive analysis of the technical contributions and methodology highlights its significance in advancing the field of speech synthesis.
In the era of large language models (LLMs) and artificial general intelligence (AGI), computer audition must evolve beyond traditional paradigms to fully leverage the capabilities of foundation models, towards more comprehensive understanding, more natural generation and more human-like interaction. Audio, as a modality rich in semantic, emotional, and contextual cues, plays a vital role in achieving naturalistic and embodied machine intelligence. This survey provides a comprehensive review of recent progress in integrating audio into LLMs, with a focus on four key areas: audio comprehension, audio generation, speech-based interaction, and audio-visual understanding. We analyze how LLMs are reshaping audio perception and reasoning, enabling systems to understand sound at a deeper semantic level, generate expressive audio outputs, and engage in human-like spoken interaction. Furthermore, we explore how the fusion of audio and visual modalities enhances situational awareness and cross-modal reasoning, pushing the boundaries of multimodal intelligence. This survey not only synthesizes existing research but also identifies critical challenges and future directions for building audio-native AGI systems capable of perceiving, understanding, and interacting through sound as naturally as humans do.
Primary: Tsinghua University
All Institutions: Tsinghua University, Google DeepMind, NVIDIA, University of Cambridge
The main contribution of this paper is its comprehensive review of the integration of audio into large multimodal models, synthesizing existing research and identifying future challenges in the pursuit of general auditory intelligence. The survey provides valuable insights into the evolving landscape of machine listening and speaking, although it lacks original experimental contributions that would enhance its technical impact.
The paper provides a thorough survey of the integration of audio into large language models (LLMs), addressing four main areas: audio comprehension, audio generation, speech-based interaction, and audio-visual understanding. The methodology is primarily qualitative, synthesizing existing research rather than presenting new experimental results. The authors effectively categorize and analyze the advancements in these areas, highlighting how LLMs reshape audio perception and reasoning. However, the lack of original experimental work limits the methodological rigor typically expected in a research paper.
As a survey paper, it does not present new experiments or datasets. Instead, it reviews existing literature and identifies trends and gaps in current research. While this approach is valuable for synthesizing knowledge, it does not provide empirical validation of the discussed concepts. The paper could have benefited from case studies or examples of implementations to illustrate the discussed methodologies.
The paper lacks specific implementation details or code repositories that would allow for reproducibility of the findings. As it primarily reviews existing work, it does not present new methodologies that could be reproduced. The absence of a project URL or demo further limits the ability to verify the claims made in the survey.
The paper's primary limitation is its survey nature, which does not contribute new experimental findings or methodologies. Additionally, the broad scope may dilute the depth of analysis in specific areas. The authors identify challenges and future directions, but without empirical data, the proposed future work remains speculative.
The integration of audio into LLMs has significant implications for the development of more naturalistic and human-like AI systems. The paper highlights the potential for advancements in machine listening and speaking, which could enhance applications in various fields, including human-computer interaction, assistive technologies, and entertainment. The exploration of audio-visual understanding also suggests avenues for improving situational awareness in AI systems. The main contribution of this paper is its comprehensive review of the integration of audio into large multimodal models, synthesizing existing research and identifying future challenges in the pursuit of general auditory intelligence. The survey provides valuable insights into the evolving landscape of machine listening and speaking, although it lacks original experimental contributions that would enhance its technical impact.