Personalizing Automatic Speech Recognition (ASR) for dysarthric speech is crucial but challenging due to training and storing of individual user adapters. We propose a hybrid meta-training method for a single model, excelling in zero-shot and few-shot on-the-fly personalization via in-context learning (ICL). Measuring Word Error Rate (WER) on state-of-the-art subsets, the model achieves 13.9% WER on Euphonia which surpasses speaker-independent baselines (17.5% WER) and rivals user-specific personalized models. On SAP Test 1, its 5.3% WER significantly bests the 8% from even personalized adapters. We also demonstrate the importance of example curation, where an oracle text-similarity method shows 5 curated examples can achieve performance similar to 19 randomly selected ones, highlighting a key area for future efficiency gains. Finally, we conduct data ablations to measure the data efficiency of this approach. This work presents a practical, scalable, and personalized solution.
Primary: Google DeepMind
All Institutions: Google DeepMind
This paper presents a novel approach to dysarthric speech recognition through a hybrid meta-learning strategy, significantly advancing the state-of-the-art in personalized ASR systems. The methodology is innovative, and the results demonstrate substantial improvements, positioning the work as a meaningful contribution to the field of machine learning and accessibility.
The proposed methodology introduces a hybrid meta-training strategy that combines zero-shot and few-shot learning for dysarthric speech recognition. By utilizing in-context learning (ICL), the authors effectively eliminate the need for per-user model training, which is a significant advancement in the field. The use of a single model that can adapt to various users on-the-fly is innovative and practical, addressing the complexities of traditional ASR personalization methods. The exploration of example curation methods adds depth to the methodology, showcasing a thoughtful approach to data efficiency.
The experiments are well-structured, utilizing two substantial datasets (Euphonia and SAP) to validate the proposed method. The results demonstrate a clear improvement in Word Error Rate (WER) compared to existing models, establishing new state-of-the-art benchmarks. The comparative analysis of different training strategies provides strong evidence for the effectiveness of the mixed-objective approach. However, the paper could benefit from more extensive ablation studies to further dissect the contributions of each component in the model.
The paper provides a clear description of the datasets and evaluation metrics used, which is essential for reproducibility. However, the lack of a publicly available code repository or demo limits the ability for others to replicate the results precisely. The detailed methodology does allow for a reasonable attempt at reproduction, but the absence of shared resources is a drawback.
While the paper presents a robust solution, it does not address potential challenges in real-world deployment, such as the variability of dysarthric speech across different users and contexts. Additionally, the reliance on a large foundational model (Gemini 2.5 Flash) may limit accessibility for researchers without similar resources. The exploration of example curation methods, while promising, also raises questions about the practicality of implementing these strategies in real-time applications.
This research has significant implications for accessibility in technology, particularly for individuals with speech impairments. By improving ASR systems for dysarthric speech, the work can enhance communication tools for affected individuals, fostering greater inclusion in various domains. The findings could also inspire further research into personalized AI systems across different modalities, potentially benefiting a wider range of users with diverse needs. This paper presents a novel approach to dysarthric speech recognition through a hybrid meta-learning strategy, significantly advancing the state-of-the-art in personalized ASR systems. The methodology is innovative, and the results demonstrate substantial improvements, positioning the work as a meaningful contribution to the field of machine learning and accessibility.
State-of-the-art automatic speech recognition (ASR) models like Whisper, perform poorly on atypical speech, such as that produced by individuals with dysarthria. Past works for atypical speech have mostly investigated fully personalized (or idiosyncratic) models, but modeling strategies that can both generalize and handle idiosyncracy could be more effective for capturing atypical speech. To investigate this, we compare four strategies: (a) $\textit{normative}$ models trained on typical speech (no personalization), (b) $\textit{idiosyncratic}$ models completely personalized to individuals, (c) $\textit{dysarthric-normative}$ models trained on other dysarthric speakers, and (d) $\textit{dysarthric-idiosyncratic}$ models which combine strategies by first modeling normative patterns before adapting to individual speech. In this case study, we find the dysarthric-idiosyncratic model performs better than idiosyncratic approach while requiring less than half as much personalized data (36.43 WER with 128 train size vs 36.99 with 256). Further, we found that tuning the speech encoder alone (as opposed to the LM decoder) yielded the best results reducing word error rate from 71% to 32% on average. Our findings highlight the value of leveraging both normative (cross-speaker) and idiosyncratic (speaker-specific) patterns to improve ASR for underrepresented speech populations.
Primary: Vanderbilt University
All Institutions: Department of Computer Science, Stony Brook University, Address line, College of Connected Computing, and Author n, Vanderbilt University
The main contribution of this paper is the introduction of a dysarthric-idiosyncratic modeling approach that effectively combines normative and personalized strategies to enhance ASR performance for individuals with dysarthria. This work not only advances the technical understanding of ASR in atypical speech contexts but also highlights the need for more inclusive and representative datasets in machine learning research.
The paper presents a systematic comparison of four modeling strategies for automatic speech recognition (ASR) tailored to dysarthric speech, which is a significant contribution to the field. The methodology is well-structured, employing a combination of normative and idiosyncratic modeling approaches. The use of transfer learning and fine-tuning techniques is appropriate given the limited data available for dysarthric speakers. However, while the methodology is sound, it could benefit from a more detailed explanation of the parameter-efficient strategies employed and their specific impacts on model performance.
The experiments are thorough, utilizing a well-defined dataset (TORGO) and employing leave-one-out cross-validation to ensure robustness. The results demonstrate clear improvements in word error rates (WER) across different modeling strategies, particularly highlighting the effectiveness of the dysarthric-idiosyncratic model. However, the reliance on a small dataset limits the generalizability of the findings, and the paper could have included more extensive comparisons with existing state-of-the-art models.
The paper provides sufficient details regarding the experimental setup, including model architecture, training parameters, and evaluation metrics, which supports reproducibility. The availability of the GitHub repository further enhances the potential for other researchers to replicate the study. However, a more detailed description of the data preprocessing steps would improve clarity.
The study is limited by the small number of speakers in the TORGO dataset, which may not capture the full diversity of dysarthric speech. Additionally, the paper acknowledges that factors such as regional accents and dialects were not controlled, which could influence results. The authors also note the need for larger datasets to validate their findings, particularly in real-world applications.
This research has significant implications for improving ASR systems for individuals with dysarthria, a population often underserved by current technologies. By demonstrating that a hybrid approach can outperform purely personalized models, the findings could lead to more accessible and effective speech recognition tools in clinical settings. The study also emphasizes the importance of inclusive AI development, which is crucial for ensuring that technological advancements benefit all users, including those with disabilities. The main contribution of this paper is the introduction of a dysarthric-idiosyncratic modeling approach that effectively combines normative and personalized strategies to enhance ASR performance for individuals with dysarthria. This work not only advances the technical understanding of ASR in atypical speech contexts but also highlights the need for more inclusive and representative datasets in machine learning research.
We present VoXtream, a fully autoregressive, zero-shot streaming text-to-speech (TTS) system for real-time use that begins speaking from the first word. VoXtream directly maps incoming phonemes to audio tokens using a monotonic alignment scheme and a dynamic look-ahead that does not delay onset. Built around an incremental phoneme transformer, a temporal transformer predicting semantic and duration tokens, and a depth transformer producing acoustic tokens, VoXtream achieves, to our knowledge, the lowest initial delay among publicly available streaming TTS: 102 ms on GPU. Despite being trained on a mid-scale 9k-hour corpus, it matches or surpasses larger baselines on several metrics, while delivering competitive quality in both output- and full-streaming settings. Demo and code are available at https://herimor.github.io/voxtream.
Primary: KTH Royal Institute of Technology
All Institutions: KTH Royal Institute of Technology, Department of Speech, Music and Hearing, Thanks to XYZ agency for funding
VoXtream presents a pioneering approach to streaming TTS with ultra-low latency, combining innovative transformer architectures to achieve competitive performance. The paper's contributions are substantial, addressing a critical need in real-time speech synthesis and setting a new benchmark for future research in the field.
The methodology presented in VoXtream is innovative, utilizing a combination of autoregressive transformers to achieve low-latency streaming TTS. The architecture's design, which includes an incremental Phoneme Transformer, a Temporal Transformer, and a Depth Transformer, is well thought out and addresses the critical issue of initial latency in TTS systems. The use of dynamic look-ahead for phoneme processing is particularly noteworthy, as it allows for immediate speech output without waiting for the entire input, which is a significant advancement over existing models. The integration of these components into a cohesive framework demonstrates a solid understanding of the challenges in TTS and offers a practical solution.
The experimental evaluation is robust, with comprehensive testing on established datasets such as SEED-TTS and LibriSpeech. The paper provides clear comparisons with multiple baseline models, showcasing VoXtream's performance in terms of intelligibility, naturalness, and latency. The results indicate that VoXtream not only meets but often exceeds the performance of larger models, despite being trained on a smaller dataset. The use of both objective metrics (WER, SPK-SIM, UTMOS) and subjective evaluations through user studies strengthens the credibility of the findings.
The paper includes sufficient implementation details, such as model architecture specifications, training procedures, and evaluation metrics, which facilitate reproducibility. However, the absence of a publicly accessible code repository limits the ease with which other researchers can replicate the results. The authors mention the use of specific datasets and training setups, which is helpful, but a direct link to the code would enhance reproducibility further.
One limitation of the study is the reliance on a mid-scale dataset (9k hours), which may restrict the model's generalizability compared to systems trained on larger datasets. Additionally, while the model achieves low initial latency, the paper does not extensively discuss the trade-offs in quality that may arise from such optimizations. The subjective evaluations, while positive, could benefit from a larger participant pool to ensure broader applicability of the results.
The implications of VoXtream are significant for real-time applications in conversational AI, voice assistants, and simultaneous translation systems. The ability to generate speech with minimal latency enhances user experience and engagement, making it a valuable contribution to the field of speech synthesis. The model's architecture could inspire further research into low-latency systems and their applications in various domains, potentially leading to advancements in human-computer interaction. VoXtream presents a pioneering approach to streaming TTS with ultra-low latency, combining innovative transformer architectures to achieve competitive performance. The paper's contributions are substantial, addressing a critical need in real-time speech synthesis and setting a new benchmark for future research in the field.
State-of-the-art automatic speech recognition (ASR) models like Whisper, perform poorly on atypical speech, such as that produced by individuals with dysarthria. Past works for atypical speech have mostly investigated fully personalized (or idiosyncratic) models, but modeling strategies that can both generalize and handle idiosyncracy could be more effective for capturing atypical speech. To investigate this, we compare four strategies: (a) $\textit{normative}$ models trained on typical speech (no personalization), (b) $\textit{idiosyncratic}$ models completely personalized to individuals, (c) $\textit{dysarthric-normative}$ models trained on other dysarthric speakers, and (d) $\textit{dysarthric-idiosyncratic}$ models which combine strategies by first modeling normative patterns before adapting to individual speech. In this case study, we find the dysarthric-idiosyncratic model performs better than idiosyncratic approach while requiring less than half as much personalized data (36.43 WER with 128 train size vs 36.99 with 256). Further, we found that tuning the speech encoder alone (as opposed to the LM decoder) yielded the best results reducing word error rate from 71% to 32% on average. Our findings highlight the value of leveraging both normative (cross-speaker) and idiosyncratic (speaker-specific) patterns to improve ASR for underrepresented speech populations.
Primary: Vanderbilt University
All Institutions: Department of Computer Science, Stony Brook University, Address line, College of Connected Computing, and Author n, Vanderbilt University
The main contribution of this paper is the introduction of a dysarthric-idiosyncratic modeling approach that effectively combines normative and personalized strategies to enhance ASR performance for individuals with dysarthria. This work not only advances the technical understanding of ASR in atypical speech contexts but also highlights the need for more inclusive and representative datasets in machine learning research.
The paper presents a systematic comparison of four modeling strategies for automatic speech recognition (ASR) tailored to dysarthric speech, which is a significant contribution to the field. The methodology is well-structured, employing a combination of normative and idiosyncratic modeling approaches. The use of transfer learning and fine-tuning techniques is appropriate given the limited data available for dysarthric speakers. However, while the methodology is sound, it could benefit from a more detailed explanation of the parameter-efficient strategies employed and their specific impacts on model performance.
The experiments are thorough, utilizing a well-defined dataset (TORGO) and employing leave-one-out cross-validation to ensure robustness. The results demonstrate clear improvements in word error rates (WER) across different modeling strategies, particularly highlighting the effectiveness of the dysarthric-idiosyncratic model. However, the reliance on a small dataset limits the generalizability of the findings, and the paper could have included more extensive comparisons with existing state-of-the-art models.
The paper provides sufficient details regarding the experimental setup, including model architecture, training parameters, and evaluation metrics, which supports reproducibility. The availability of the GitHub repository further enhances the potential for other researchers to replicate the study. However, a more detailed description of the data preprocessing steps would improve clarity.
The study is limited by the small number of speakers in the TORGO dataset, which may not capture the full diversity of dysarthric speech. Additionally, the paper acknowledges that factors such as regional accents and dialects were not controlled, which could influence results. The authors also note the need for larger datasets to validate their findings, particularly in real-world applications.
This research has significant implications for improving ASR systems for individuals with dysarthria, a population often underserved by current technologies. By demonstrating that a hybrid approach can outperform purely personalized models, the findings could lead to more accessible and effective speech recognition tools in clinical settings. The study also emphasizes the importance of inclusive AI development, which is crucial for ensuring that technological advancements benefit all users, including those with disabilities. The main contribution of this paper is the introduction of a dysarthric-idiosyncratic modeling approach that effectively combines normative and personalized strategies to enhance ASR performance for individuals with dysarthria. This work not only advances the technical understanding of ASR in atypical speech contexts but also highlights the need for more inclusive and representative datasets in machine learning research.
This paper introduces MR-CQTdiff, a novel neural-network architecture for diffusion-based audio generation that leverages a multi-resolution Constant-$Q$ Transform (C$Q$T). The proposed architecture employs an efficient, invertible CQT framework that adjusts the time-frequency resolution on an octave-by-octave basis. This design addresses the issue of low temporal resolution at lower frequencies, enabling more flexible and expressive audio generation. We conduct an evaluation using the Fr\'echet Audio Distance (FAD) metric across various architectures and two datasets. Experimental results demonstrate that MR-CQTdiff achieves state-of-the-art audio quality, outperforming competing architectures.
Primary: Both authors contributed equally to this work. It was funded by Volkswagen Foundation (Volkswagen Stiftung) Germany
All Institutions: Both authors contributed equally to this work. It was funded by Volkswagen Foundation (Volkswagen Stiftung) Germany, under Grant no. 96 881
The paper introduces MR-CQTdiff, a novel architecture for diffusion-based audio generation that leverages a multi-resolution constant-Q transform to improve audio quality. The comprehensive analysis highlights its innovative methodology, robust experimental validation, and significant implications for the field of audio processing and generation.
The paper presents a well-structured methodology that introduces the MR-CQTdiff architecture, which innovatively employs a multi-resolution Constant-Q Transform (CQT) to enhance diffusion-based audio generation. The architecture's design addresses the critical trade-off between time and frequency resolution, particularly for low-frequency audio signals, by utilizing multiple parallel CQT filters. This approach allows for better capture of transient audio events and harmonically rich content, which is a significant improvement over existing methods. The use of a U-Net structure facilitates effective feature reuse and gradient flow, enhancing the model's training stability.
The experimental evaluation is robust, utilizing two diverse datasets (FMA-Large and OpenSinger) to assess the performance of MR-CQTdiff against several strong baselines. The use of the Fréchet Audio Distance (FAD) metric provides a quantitative measure of audio quality, and the results demonstrate that MR-CQTdiff consistently outperforms other models, particularly in capturing transient details in vocal audio. The thoroughness of the experiments, including the comparison with latent diffusion models, adds credibility to the findings.
The paper provides sufficient implementation details, including the architecture specifications, training parameters, and dataset descriptions, which enhance reproducibility. The availability of the code on GitHub and the demo page with audio samples further supports this aspect, allowing other researchers to replicate the experiments and validate the results.
While the proposed architecture shows promising results, the paper acknowledges limitations in terms of computational efficiency compared to latent diffusion models. The focus on audio generation quality may lead to increased resource consumption, which could be a barrier for broader applications. Additionally, the evaluation primarily focuses on unconditional generation, and further exploration of conditional generation tasks could provide deeper insights into the model's capabilities.
The MR-CQTdiff architecture has significant potential applications in various audio generation tasks, including music synthesis, sound design, and audio restoration. By improving the quality of generated audio, this work could influence the development of more sophisticated audio generation tools and enhance user experiences in creative industries. The findings may also inspire further research into time-frequency representations in generative models, potentially leading to advancements in other domains of machine learning. The paper introduces MR-CQTdiff, a novel architecture for diffusion-based audio generation that leverages a multi-resolution constant-Q transform to improve audio quality. The comprehensive analysis highlights its innovative methodology, robust experimental validation, and significant implications for the field of audio processing and generation.
Piano cover generation aims to automatically transform a pop song into a piano arrangement. While numerous deep learning approaches have been proposed, existing models often fail to maintain structural consistency with the original song, likely due to the absence of beat-aware mechanisms or the difficulty of modeling complex rhythmic patterns. Rhythmic information is crucial, as it defines structural similarity (e.g., tempo, BPM) and directly impacts the overall quality of the generated music. In this paper, we introduce Etude, a three-stage architecture consisting of Extract, strucTUralize, and DEcode stages. By pre-extracting rhythmic information and applying a novel, simplified REMI-based tokenization, our model produces covers that preserve proper song structure, enhance fluency and musical dynamics, and support highly controllable generation through style injection. Subjective evaluations with human listeners show that Etude substantially outperforms prior models, achieving a quality level comparable to that of human composers.
The main contribution of this paper is the introduction of the Etude framework, a novel three-stage architecture for Automatic Piano Cover Generation that significantly enhances the quality and controllability of generated music. This work represents a substantial advancement in the field of music generation, addressing critical challenges in structural consistency and stylistic diversity through innovative methodologies and comprehensive evaluations.
The proposed methodology of Etude is well-structured and innovative, consisting of a three-stage architecture that effectively separates the extraction of musical features, the structuralization of rhythmic information, and the decoding of the final output. This modular approach addresses key challenges in Automatic Piano Cover Generation (APCG), particularly the need for structural consistency and stylistic control. The introduction of Tiny-REMI as a minimalistic token representation is a significant improvement over previous models, simplifying the learning task for the decoder. The use of a pre-trained Beat-Transformer for rhythmic analysis is also a notable enhancement, ensuring that the generated covers maintain a coherent rhythmic framework.
The experimental evaluation is comprehensive, utilizing both objective and subjective metrics to assess the performance of the Etude framework against several baseline models. The dataset of approximately 7,700 pop song and piano cover pairs is substantial, and the authors have taken care to ensure data quality through filtering and alignment methods. The results demonstrate that Etude significantly outperforms existing models in both objective metrics (WPD, RGC, IPE) and subjective evaluations (similarity, fluency, dynamic expression, overall quality), providing strong evidence for the effectiveness of the proposed approach.
The paper provides sufficient detail regarding the training process, model architecture, and evaluation metrics, which supports reproducibility. However, the lack of a publicly available code repository limits the ease with which others can replicate the results. The authors mention that all code and audio demonstrations are available on their project page, which is a positive aspect, but the absence of a GitHub link could hinder broader accessibility.
One identified limitation is the reliance on the performance of the front-end components, particularly the Beat-Detector and Extractor. The authors acknowledge that the framework's structural accuracy is constrained by the precision of the beat tracker and that the Extractor's flattening process may lead to information loss. This could affect the model's ability to capture the primary melody of the original song, resulting in incomplete melodic lines. Additionally, the subjective evaluation indicates that while the model performs well, it still falls short of human performance in certain aspects.
The potential applications of the Etude framework are significant, particularly in the realm of music generation and AI-assisted creativity. The ability to generate high-quality, stylistically diverse piano covers could enhance user engagement in music production and education. Furthermore, the framework's modular design allows for future extensions, such as integrating more advanced beat-tracking modules or exploring multi-stream extractors, which could further improve its capabilities. The main contribution of this paper is the introduction of the Etude framework, a novel three-stage architecture for Automatic Piano Cover Generation that significantly enhances the quality and controllability of generated music. This work represents a substantial advancement in the field of music generation, addressing critical challenges in structural consistency and stylistic diversity through innovative methodologies and comprehensive evaluations.
Personalizing Automatic Speech Recognition (ASR) for dysarthric speech is crucial but challenging due to training and storing of individual user adapters. We propose a hybrid meta-training method for a single model, excelling in zero-shot and few-shot on-the-fly personalization via in-context learning (ICL). Measuring Word Error Rate (WER) on state-of-the-art subsets, the model achieves 13.9% WER on Euphonia which surpasses speaker-independent baselines (17.5% WER) and rivals user-specific personalized models. On SAP Test 1, its 5.3% WER significantly bests the 8% from even personalized adapters. We also demonstrate the importance of example curation, where an oracle text-similarity method shows 5 curated examples can achieve performance similar to 19 randomly selected ones, highlighting a key area for future efficiency gains. Finally, we conduct data ablations to measure the data efficiency of this approach. This work presents a practical, scalable, and personalized solution.
Primary: Google DeepMind
All Institutions: Google DeepMind
This paper presents a novel approach to dysarthric speech recognition through a hybrid meta-learning strategy, significantly advancing the state-of-the-art in personalized ASR systems. The methodology is innovative, and the results demonstrate substantial improvements, positioning the work as a meaningful contribution to the field of machine learning and accessibility.
The proposed methodology introduces a hybrid meta-training strategy that combines zero-shot and few-shot learning for dysarthric speech recognition. By utilizing in-context learning (ICL), the authors effectively eliminate the need for per-user model training, which is a significant advancement in the field. The use of a single model that can adapt to various users on-the-fly is innovative and practical, addressing the complexities of traditional ASR personalization methods. The exploration of example curation methods adds depth to the methodology, showcasing a thoughtful approach to data efficiency.
The experiments are well-structured, utilizing two substantial datasets (Euphonia and SAP) to validate the proposed method. The results demonstrate a clear improvement in Word Error Rate (WER) compared to existing models, establishing new state-of-the-art benchmarks. The comparative analysis of different training strategies provides strong evidence for the effectiveness of the mixed-objective approach. However, the paper could benefit from more extensive ablation studies to further dissect the contributions of each component in the model.
The paper provides a clear description of the datasets and evaluation metrics used, which is essential for reproducibility. However, the lack of a publicly available code repository or demo limits the ability for others to replicate the results precisely. The detailed methodology does allow for a reasonable attempt at reproduction, but the absence of shared resources is a drawback.
While the paper presents a robust solution, it does not address potential challenges in real-world deployment, such as the variability of dysarthric speech across different users and contexts. Additionally, the reliance on a large foundational model (Gemini 2.5 Flash) may limit accessibility for researchers without similar resources. The exploration of example curation methods, while promising, also raises questions about the practicality of implementing these strategies in real-time applications.
This research has significant implications for accessibility in technology, particularly for individuals with speech impairments. By improving ASR systems for dysarthric speech, the work can enhance communication tools for affected individuals, fostering greater inclusion in various domains. The findings could also inspire further research into personalized AI systems across different modalities, potentially benefiting a wider range of users with diverse needs. This paper presents a novel approach to dysarthric speech recognition through a hybrid meta-learning strategy, significantly advancing the state-of-the-art in personalized ASR systems. The methodology is innovative, and the results demonstrate substantial improvements, positioning the work as a meaningful contribution to the field of machine learning and accessibility.
The steered response power (SRP) method is one of the most popular approaches for acoustic source localization with microphone arrays. It is often based on simplifying acoustic assumptions, such as an omnidirectional sound source in the far field of the microphone array(s), free field propagation, and spatially uncorrelated noise. In reality, however, there are many acoustic scenarios where such assumptions are violated. This paper proposes a generalization of the conventional SRP method that allows to apply generic acoustic models for localization with arbitrary microphone constellations. These models may consider, for instance, level differences in distributed microphones, the directivity of sources and receivers, or acoustic shadowing effects. Moreover, also measured acoustic transfer functions may be applied as acoustic model. We show that the delay-and-sum beamforming of the conventional SRP is not optimal for localization with generic acoustic models. To this end, we propose a generalized SRP beamforming criterion that considers generic acoustic models and spatially correlated noise, and derive an optimal SRP beamformer. Furthermore, we propose and analyze appropriate frequency weightings. Unlike the conventional SRP, the proposed method can jointly exploit observed level and time differences between the microphone signals to infer the source location. Realistic simulations of three different microphone setups with speech under various noise conditions indicate that the proposed method can significantly reduce the mean localization error compared to the conventional SRP and, in particular, a reduction of more than 60% can be archived in noisy conditions.
Primary: University of Oldenburg
All Institutions: and the Cluster of Excellence Hearing4all, 26129 Oldenburg, University of Oldenburg, This project has received funding from the SOUNDS European Training Network -- an European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 956369, Germany (e-mail, &D Department, S.~Doclo is with the Department of Medical Physics and Acoustics, and T.~Wolff are with the Audio AI R
The paper presents a novel approach to sound source localization by generalizing the steered response power method to accommodate complex acoustic environments. This advancement is crucial for enhancing the accuracy and robustness of localization systems in real-world applications.
The paper introduces a generalized steered response power (GSRP) method that enhances traditional SRP techniques by incorporating generic acoustic models and addressing limitations in the conventional SRP method. The authors provide a comprehensive mathematical framework that allows for the inclusion of various acoustic propagation models and noise characteristics, which is a significant advancement over previous methods that relied on oversimplified assumptions. The proposed MVCNR and MPCNR beamformers demonstrate a robust design that optimizes localization accuracy under diverse acoustic conditions. The methodology is well-structured, with clear derivations and justifications for the proposed approaches.
The experimental validation is thorough, utilizing realistic simulations across three different microphone setups with varying noise conditions. The results indicate a significant reduction in localization error compared to conventional methods, particularly in challenging acoustic environments. The paper provides detailed descriptions of the experimental setups, including the generation of microphone signals and the evaluation metrics used. The performance of the proposed methods is convincingly demonstrated through comparative analysis against established techniques, showcasing their effectiveness in real-world scenarios.
The paper lacks specific implementation details or code availability, which could hinder reproducibility. While the authors describe the methodologies and experimental setups in detail, the absence of a publicly available code repository or supplementary materials limits the ability of other researchers to replicate the results. Providing a demo or project URL would enhance reproducibility and facilitate further research in this area.
One limitation of the proposed methods is their dependency on accurate acoustic models, which may not always be feasible in practical applications. The performance of the GSRP methods could be sensitive to model inaccuracies or assumptions regarding noise characteristics. Additionally, while the paper addresses various noise conditions, the generalizability of the findings to all acoustic environments remains to be thoroughly validated.
The advancements presented in this paper have significant implications for various applications, including teleconferencing, robotics, and autonomous systems. By improving sound source localization in complex acoustic environments, the proposed methods can enhance the performance of systems that rely on accurate spatial awareness, ultimately leading to better user experiences and more effective technological solutions. The work contributes to the ongoing development of more sophisticated audio processing techniques, which are increasingly relevant in today's technology-driven world. The paper presents a novel approach to sound source localization by generalizing the steered response power method to accommodate complex acoustic environments. This advancement is crucial for enhancing the accuracy and robustness of localization systems in real-world applications.
The pipeline for multi-participant audiobook production primarily consists of three stages: script analysis, character voice timbre selection, and speech synthesis. Among these, script analysis can be automated with high accuracy using NLP models, whereas character voice timbre selection still relies on manual effort. Speech synthesis uses either manual dubbing or text-to-speech (TTS). While TTS boosts efficiency, it struggles with emotional expression, intonation control, and contextual scene adaptation. To address these challenges, we propose DeepDubbing, an end-to-end automated system for multi-participant audiobook production. The system comprises two main components: a Text-to-Timbre (TTT) model and a Context-Aware Instruct-TTS (CA-Instruct-TTS) model. The TTT model generates role-specific timbre embeddings conditioned on text descriptions. The CA-Instruct-TTS model synthesizes expressive speech by analyzing contextual dialogue and incorporating fine-grained emotional instructions. This system enables the automated generation of multi-participant audiobooks with both timbre-matched character voices and emotionally expressive narration, offering a novel solution for audiobook production.
Primary: Beijing University of Posts and Telecommunications
All Institutions: Beijing University of Posts and Telecommunications, Beijing University of Civil Engineering and Architecture, China & Beijing University of Civil Engineering and Architecture, Tencent Music Entertainment Lyra Lab
The main contribution of this paper is the introduction of DeepDubbing, an end-to-end automated system for multi-participant audiobook production that combines innovative text-to-timbre generation and context-aware speech synthesis. This work represents a significant advancement in the field of audio synthesis, addressing critical challenges in emotional expressiveness and character voice differentiation, thereby paving the way for more immersive audiobook experiences.
The methodology presented in the paper is robust and innovative, leveraging a dual-component architecture that includes a Text-to-Timbre (TTT) model and a Context-Aware Instruct-TTS (CA-Instruct-TTS) model. The use of conditional flow matching for timbre generation is a significant advancement, allowing for more nuanced and contextually appropriate voice synthesis. The integration of large language models (LLMs) for both timbre description generation and emotional instruction extraction showcases a sophisticated approach to automating audiobook production. However, the paper could benefit from a more detailed explanation of the training processes and hyperparameter choices for the models.
The experimental evaluation is comprehensive, utilizing a large-scale internal dataset and employing both subjective and objective metrics to assess the performance of the proposed models. The results indicate that the DeepDubbing system achieves high levels of naturalness and emotional expressiveness in synthesized speech, outperforming baseline methods. The use of a diverse set of evaluation metrics, including Character Matching Score and Mean Opinion Scores, adds credibility to the findings. However, the paper could improve by providing more comparative analysis against a wider range of existing systems.
The paper mentions the release of the BookVoice-50h dataset and provides a demo URL, which enhances reproducibility. However, specific implementation details, such as the exact configurations and training procedures for the models, are not thoroughly documented, making it challenging for other researchers to replicate the results without further guidance.
One notable limitation is the TTT model's struggle with generating child-like voices due to the lack of authentic child speech data in the training set. This limitation could hinder the system's applicability in scenarios requiring diverse character voices. Additionally, while the paper addresses the emotional expressiveness of the CA-Instruct-TTS model, it does not explore the potential biases that might arise from the training data or the implications of using LLMs in this context.
The proposed DeepDubbing system has significant potential applications in the audiobook industry, particularly in automating the production of multi-participant audiobooks, which could reduce costs and production times. The ability to generate emotionally expressive and contextually aware speech could enhance user engagement and experience. Furthermore, the methodologies developed could be adapted for other applications in voice synthesis, such as gaming, virtual reality, and interactive storytelling. The main contribution of this paper is the introduction of DeepDubbing, an end-to-end automated system for multi-participant audiobook production that combines innovative text-to-timbre generation and context-aware speech synthesis. This work represents a significant advancement in the field of audio synthesis, addressing critical challenges in emotional expressiveness and character voice differentiation, thereby paving the way for more immersive audiobook experiences.
Deep learning-based Sound Event Localization and Detection (SELD) systems degrade significantly on real-world, long-tailed datasets. Standard regression losses bias learning toward frequent classes, causing rare events to be systematically under-recognized. To address this challenge, we introduce MAGENTA (Magnitude And Geometry-ENhanced Training Approach), a unified loss function that counteracts this bias within a physically interpretable vector space. MAGENTA geometrically decomposes the regression error into radial and angular components, enabling targeted, rarity-aware penalties and strengthened directional modeling. Empirically, MAGENTA substantially improves SELD performance on imbalanced real-world data, providing a principled foundation for a new class of geometry-aware SELD objectives. Code is available at: https://github.com/itsjunwei/MAGENTA_ICASSP
Primary: Nanyang Technological University
All Institutions: Nanyang Technological University, This research is supported by the Singapore Ministry of Education, under research grant MOE-T2EP20224-0010, School of Electrical and Electronic Engineering, Academic Research Fund Tier 2, Smart Nation TRANS Lab
The main contribution of this paper is the introduction of MAGENTA, a geometry- and rarity-aware loss function for Sound Event Localization and Detection that effectively addresses the challenges posed by long-tailed datasets. This work represents a significant advancement in the field, providing a principled and effective solution that enhances the detection of rare acoustic events while maintaining robust localization performance.
The proposed MAGENTA framework introduces a novel geometric decomposition of regression errors in Sound Event Localization and Detection (SELD), specifically addressing the challenges posed by long-tailed datasets. By separating the error into radial and angular components, the authors provide a targeted approach to mitigate detection timidity for rare classes. This methodology is well-grounded in the physical interpretation of the problem and is a significant advancement over traditional loss functions like Mean Squared Error (MSE), which do not account for the unique geometry of the ACCDOA representation. The modular design of the loss function allows for fine-tuning and flexibility, making it a robust solution for SELD tasks.
The experiments are rigorously designed, utilizing the STARSS23 dataset, which is representative of real-world scenarios and characterized by a significant class imbalance. The authors provide a comprehensive evaluation of various loss function configurations, demonstrating the effectiveness of MAGENTA through empirical results that show substantial improvements in SELD performance metrics. The results are well-presented, with clear comparisons against baseline methods, and the statistical significance of improvements is implied through the structured experimentation. However, the paper lacks detailed statistical analysis of the results, which could strengthen the claims made.
The paper includes sufficient implementation details, including the architecture used (SELDNet), training parameters, and evaluation metrics. The availability of the code on GitHub enhances reproducibility, allowing other researchers to replicate the experiments and validate the findings. However, the paper could benefit from additional documentation or examples on how to run the code effectively.
One limitation is the reliance on a single dataset (STARSS23) for evaluation, which may not fully capture the diversity of real-world acoustic environments. Additionally, while the proposed method shows improvements, the potential for increased false positives due to heightened sensitivity in rare class detection is noted but not quantitatively analyzed. The authors also mention future work on adaptive priors, indicating that the current approach may not fully address all aspects of class imbalance.
The MAGENTA framework has significant implications for applications in audio surveillance, smart environments, and assistive technologies, where accurate sound event detection and localization are critical. By improving the recognition of rare sound events, this work could enhance situational awareness in various domains, including public safety and human-computer interaction. The methodology also sets a precedent for future research in long-tailed learning and geometry-aware training approaches. The main contribution of this paper is the introduction of MAGENTA, a geometry- and rarity-aware loss function for Sound Event Localization and Detection that effectively addresses the challenges posed by long-tailed datasets. This work represents a significant advancement in the field, providing a principled and effective solution that enhances the detection of rare acoustic events while maintaining robust localization performance.
While large audio-language models (LALMs) have demonstrated state-of-the-art audio understanding, their reasoning capability in complex soundscapes still falls behind large vision-language models (LVLMs). Compared to the visual domain, one bottleneck is the lack of large-scale chain-of-thought audio data to teach LALM stepwise reasoning. To circumvent this data and modality gap, we present SightSound-R1, a cross-modal distillation framework that transfers advanced reasoning from a stronger LVLM teacher to a weaker LALM student on the same audio-visual question answering (AVQA) dataset. SightSound-R1 consists of three core steps: (i) test-time scaling to generate audio-focused chains of thought (CoT) from an LVLM teacher, (ii) audio-grounded validation to filter hallucinations, and (iii) a distillation pipeline with supervised fine-tuning (SFT) followed by Group Relative Policy Optimization (GRPO) for the LALM student. Results show that SightSound-R1 improves LALM reasoning performance both in the in-domain AVQA test set as well as in unseen auditory scenes and questions, outperforming both pretrained and label-only distilled baselines. Thus, we conclude that vision reasoning can be effectively transferred to audio models and scaled with abundant audio-visual data.
Primary: University of Washington
All Institutions: Columbia University, University of Washington
The main contribution of this paper is the introduction of SightSound-R1, a novel framework for cross-modal reasoning distillation that enhances the reasoning capabilities of audio-language models by leveraging the strengths of vision-language models. This work represents a significant step forward in bridging the modality gap in multimodal AI systems, with the potential for broad applications in various fields.
The proposed methodology of SightSound-R1 is innovative, leveraging a cross-modal distillation framework that effectively bridges the reasoning capabilities between LVLMs and LALMs. The three-step process—test-time scaling, audio-grounded validation, and a distillation pipeline—demonstrates a thoughtful approach to addressing the identified gap in reasoning capabilities. The use of self-consistency to generate diverse reasoning traces and the incorporation of a lightweight audio-grounded fact verification step are particularly noteworthy. However, the methodology could benefit from a more detailed explanation of the underlying assumptions and potential biases in the audio-grounded validation process.
The experimental evaluation is robust, utilizing multiple datasets (AVQA, MMAU, and MUSIC-AVQA) to validate the effectiveness of the proposed framework. The results indicate significant improvements in LALM reasoning performance, particularly in sound tasks, which supports the hypothesis that reasoning can be effectively transferred from LVLMs. The comparative analysis against pretrained and label-only distilled baselines adds credibility to the findings. However, the paper could improve by providing more detailed statistical analysis and significance testing for the reported results.
The implementation details are described with sufficient clarity, including the use of specific models, training parameters, and evaluation metrics. However, the absence of a public code repository or supplementary materials limits the reproducibility of the results. Future work should consider making the code and trained models available to facilitate further research and validation of the findings.
One limitation identified is the potential for hallucinations in the reasoning generated by the LVLM teacher, which may mislead the LALM student during training. Additionally, the performance drop in certain categories (Speech and Music) suggests that the framework may not generalize equally across all audio types, indicating a need for further refinement and integration with LALM perception capabilities.
The implications of this research are significant, as it addresses a critical gap in multimodal reasoning capabilities, particularly in the audio domain. By enhancing LALMs' reasoning through cross-modal distillation, the framework has the potential to improve applications in audio understanding, accessibility technologies, and interactive AI systems. The approach could pave the way for more sophisticated audio-language models that can reason about complex soundscapes, ultimately contributing to advancements in human-computer interaction and multimedia content analysis. The main contribution of this paper is the introduction of SightSound-R1, a novel framework for cross-modal reasoning distillation that enhances the reasoning capabilities of audio-language models by leveraging the strengths of vision-language models. This work represents a significant step forward in bridging the modality gap in multimodal AI systems, with the potential for broad applications in various fields.
Source separation is a fundamental task in speech, music, and audio processing, and it also provides cleaner and larger data for training generative models. However, improving separation performance in practice often depends on increasingly large networks, inflating training and deployment costs. Motivated by recent advances in inference-time scaling for generative modeling, we propose Training-Time and Inference-Time Scalable Discriminative Source Separation (TISDiSS), a unified framework that integrates early-split multi-loss supervision, shared-parameter design, and dynamic inference repetitions. TISDiSS enables flexible speed-performance trade-offs by adjusting inference depth without retraining additional models. We further provide systematic analyses of architectural and training choices and show that training with more inference repetitions improves shallow-inference performance, benefiting low-latency applications. Experiments on standard speech separation benchmarks demonstrate state-of-the-art performance with a reduced parameter count, establishing TISDiSS as a scalable and practical framework for adaptive source separation. Code is available at https://github.com/WingSingFung/TISDiSS.
Primary: Fudan University
All Institutions: Shanghai Key Laboratory of Intelligent Information Processing, Central Conservatory of Music, Department of Music AI and Music IT, Fudan University, School of Computer Science and Technology
The main contribution of this paper is the introduction of the TISDiSS framework, which effectively balances performance and computational efficiency in discriminative source separation tasks. This work presents a significant advancement in the field, particularly for applications requiring low-latency processing, while also providing a solid foundation for future research in scalable audio processing methodologies.
The proposed TISDiSS framework integrates several innovative components, including early-split multi-loss supervision and shared-parameter design, which are well-justified in the context of improving source separation tasks. The dynamic inference repetitions allow for a flexible trade-off between speed and performance, which is particularly relevant for real-time applications. However, while the methodology is robust, the paper could benefit from a clearer explanation of how these components interact and their specific contributions to the overall performance improvement.
The experiments are conducted on standard speech separation benchmarks, showcasing state-of-the-art performance with a reduced parameter count. The results are compelling and demonstrate the effectiveness of the TISDiSS framework. However, the paper lacks a comprehensive comparison with a broader range of existing methods, which could further validate the claimed advantages.
The authors provide a GitHub repository for code access, which is a positive aspect for reproducibility. However, the paper would benefit from more detailed documentation regarding the experimental setup, hyperparameters, and specific configurations used in the experiments to facilitate easier reproduction by other researchers.
One limitation is the potential overfitting to the specific benchmarks used, as the paper does not explore the generalizability of the TISDiSS framework across diverse datasets or tasks beyond speech separation. Additionally, the reliance on dynamic inference repetitions may introduce complexity in deployment, which could be a barrier for practical applications.
The TISDiSS framework has significant implications for real-time audio processing applications, such as virtual assistants and music production tools, where efficient source separation is crucial. By enabling scalable performance adjustments, it opens avenues for further research into adaptive models that can cater to varying computational resources. The main contribution of this paper is the introduction of the TISDiSS framework, which effectively balances performance and computational efficiency in discriminative source separation tasks. This work presents a significant advancement in the field, particularly for applications requiring low-latency processing, while also providing a solid foundation for future research in scalable audio processing methodologies.
We present VoXtream, a fully autoregressive, zero-shot streaming text-to-speech (TTS) system for real-time use that begins speaking from the first word. VoXtream directly maps incoming phonemes to audio tokens using a monotonic alignment scheme and a dynamic look-ahead that does not delay onset. Built around an incremental phoneme transformer, a temporal transformer predicting semantic and duration tokens, and a depth transformer producing acoustic tokens, VoXtream achieves, to our knowledge, the lowest initial delay among publicly available streaming TTS: 102 ms on GPU. Despite being trained on a mid-scale 9k-hour corpus, it matches or surpasses larger baselines on several metrics, while delivering competitive quality in both output- and full-streaming settings. Demo and code are available at https://herimor.github.io/voxtream.
Primary: KTH Royal Institute of Technology
All Institutions: KTH Royal Institute of Technology, Department of Speech, Music and Hearing, Thanks to XYZ agency for funding
VoXtream presents a pioneering approach to streaming TTS with ultra-low latency, combining innovative transformer architectures to achieve competitive performance. The paper's contributions are substantial, addressing a critical need in real-time speech synthesis and setting a new benchmark for future research in the field.
The methodology presented in VoXtream is innovative, utilizing a combination of autoregressive transformers to achieve low-latency streaming TTS. The architecture's design, which includes an incremental Phoneme Transformer, a Temporal Transformer, and a Depth Transformer, is well thought out and addresses the critical issue of initial latency in TTS systems. The use of dynamic look-ahead for phoneme processing is particularly noteworthy, as it allows for immediate speech output without waiting for the entire input, which is a significant advancement over existing models. The integration of these components into a cohesive framework demonstrates a solid understanding of the challenges in TTS and offers a practical solution.
The experimental evaluation is robust, with comprehensive testing on established datasets such as SEED-TTS and LibriSpeech. The paper provides clear comparisons with multiple baseline models, showcasing VoXtream's performance in terms of intelligibility, naturalness, and latency. The results indicate that VoXtream not only meets but often exceeds the performance of larger models, despite being trained on a smaller dataset. The use of both objective metrics (WER, SPK-SIM, UTMOS) and subjective evaluations through user studies strengthens the credibility of the findings.
The paper includes sufficient implementation details, such as model architecture specifications, training procedures, and evaluation metrics, which facilitate reproducibility. However, the absence of a publicly accessible code repository limits the ease with which other researchers can replicate the results. The authors mention the use of specific datasets and training setups, which is helpful, but a direct link to the code would enhance reproducibility further.
One limitation of the study is the reliance on a mid-scale dataset (9k hours), which may restrict the model's generalizability compared to systems trained on larger datasets. Additionally, while the model achieves low initial latency, the paper does not extensively discuss the trade-offs in quality that may arise from such optimizations. The subjective evaluations, while positive, could benefit from a larger participant pool to ensure broader applicability of the results.
The implications of VoXtream are significant for real-time applications in conversational AI, voice assistants, and simultaneous translation systems. The ability to generate speech with minimal latency enhances user experience and engagement, making it a valuable contribution to the field of speech synthesis. The model's architecture could inspire further research into low-latency systems and their applications in various domains, potentially leading to advancements in human-computer interaction. VoXtream presents a pioneering approach to streaming TTS with ultra-low latency, combining innovative transformer architectures to achieve competitive performance. The paper's contributions are substantial, addressing a critical need in real-time speech synthesis and setting a new benchmark for future research in the field.
In speech enhancement, knowledge distillation (KD) compresses models by transferring a high-capacity teacher's knowledge to a compact student. However, conventional KD methods train the student to mimic the teacher's output entirely, which forces the student to imitate the regions where the teacher performs poorly and to apply distillation to the regions where the student already performs well, which yields only marginal gains. We propose Distilling Selective Patches (DISPatch), a KD framework for speech enhancement that applies the distillation loss to spectrogram patches where the teacher outperforms the student, as determined by a Knowledge Gap Score. This approach guides optimization toward areas with the most significant potential for student improvement while minimizing the influence of regions where the teacher may provide unreliable instruction. Furthermore, we introduce Multi-Scale Selective Patches (MSSP), a frequency-dependent method that uses different patch sizes across low- and high-frequency bands to account for spectral heterogeneity. We incorporate DISPatch into conventional KD methods and observe consistent gains in compact students. Moreover, integrating DISPatch and MSSP into a state-of-the-art frequency-dependent KD method considerably improves performance across all metrics.
Primary: School of Electrical Engineering
All Institutions: Republic of Korea, School of Electrical Engineering, and Jung-Woo Choi
The main contribution of this paper is the introduction of the DISPatch framework, which innovatively applies selective knowledge distillation in speech enhancement, leading to significant performance improvements while addressing the limitations of traditional methods. This work represents a meaningful advancement in the field, with potential applications extending beyond speech enhancement to various machine learning tasks.
The proposed DISPatch framework introduces a novel approach to knowledge distillation in speech enhancement by selectively applying distillation losses to spectrogram patches where the teacher model outperforms the student. This is quantified using a Knowledge Gap Score (KGS), which is a significant advancement over traditional methods that indiscriminately apply distillation across all output regions. The introduction of Multi-Scale Selective Patches (MSSP) further enhances the methodology by adapting patch sizes based on frequency characteristics, addressing spectral heterogeneity effectively. The methodology is well-structured and clearly articulated, demonstrating a thoughtful integration of existing techniques with innovative modifications.
The experiments are comprehensive, utilizing well-established datasets such as DNS2020 and VoiceBank+DEMAND, which provide a robust basis for evaluating the proposed method. The results indicate consistent improvements across various metrics when DISPatch is applied, particularly in conjunction with DFKD and MSSP. The ablation studies effectively demonstrate the importance of the KGS in selecting informative patches, reinforcing the method's validity. However, the paper could benefit from more extensive comparisons with a broader range of existing methods to contextualize the performance gains.
The implementation details are sufficiently detailed, including model configurations, training setups, and hyperparameters. The paper provides a GitHub link for accessing the code, which is crucial for reproducibility. However, the absence of a clear description of the environment and dependencies required for running the code could pose challenges for some researchers.
While the methodology shows promise, it may be limited by the assumptions made regarding the teacher model's superiority. If a teacher model is not adequately trained or is flawed, the selective distillation might not yield the expected benefits. Additionally, the paper does not explore the scalability of the approach to larger datasets or more complex models, which could be a potential area for future research.
The DISPatch framework has significant implications for real-world applications in speech enhancement, particularly in resource-constrained environments where computational efficiency is paramount. By improving the performance of compact models, this research could facilitate the deployment of advanced speech processing technologies in mobile devices and other low-power applications. The principles established in this work may also be applicable to other domains within machine learning, such as image processing and natural language processing, where selective learning could enhance model performance. The main contribution of this paper is the introduction of the DISPatch framework, which innovatively applies selective knowledge distillation in speech enhancement, leading to significant performance improvements while addressing the limitations of traditional methods. This work represents a meaningful advancement in the field, with potential applications extending beyond speech enhancement to various machine learning tasks.
Voice cloning for Text-to-Speech (TTS) aims to generate expressive and personalized speech from text using limited data from a target speaker. Federated Learning (FL) offers a collaborative and privacy-preserving framework for this task, but existing approaches suffer from high communication costs and tend to suppress stylistic heterogeneity, resulting in insufficient personalization. To address these issues, we propose Fed-PISA, which stands for Federated Personalized Identity-Style Adaptation. To minimize communication costs, Fed-PISA introduces a disentangled Low-Rank Adaptation (LoRA) mechanism: the speaker's timbre is retained locally through a private ID-LoRA, while only a lightweight style-LoRA is transmitted to the server, thereby minimizing parameter exchange. To harness heterogeneity, our aggregation method, inspired by collaborative filtering, is introduced to create custom models for each client by learning from stylistically similar peers. Experiments show that Fed-PISA improves style expressivity, naturalness, and speaker similarity, outperforming standard federated baselines with minimal communication costs.
Primary: We employ a learning rate of
All Institutions: We employ a learning rate of
The main contribution of this paper is the introduction of Fed-PISA, a novel federated learning framework for voice cloning that effectively balances personalization and communication efficiency through a disentangled adaptation mechanism and personalized aggregation strategy. This work significantly advances the field of federated TTS systems, addressing key challenges in personalization and communication costs while demonstrating strong empirical results.
The proposed methodology of Fed-PISA is innovative in its use of a disentangled Low-Rank Adaptation (LoRA) mechanism to separate speaker timbre from stylistic features, allowing for efficient federated learning without compromising personalization. The introduction of a personalized aggregation strategy based on collaborative filtering is a significant advancement, enabling the model to leverage stylistic similarities among clients effectively. The detailed description of the LoRA parameterization and the client-server interaction provides a clear understanding of the framework's operational dynamics.
The experiments are robust, utilizing four public datasets with emotion annotations to evaluate the effectiveness of Fed-PISA against various baselines, including both federated and non-federated methods. The results demonstrate significant improvements in style expressivity, speaker similarity, and naturalness, with detailed metrics reported. The inclusion of ablation studies strengthens the findings, confirming the necessity of the proposed components.
The paper provides sufficient implementation details, including the architecture, training parameters, and evaluation metrics, which facilitates reproducibility. The availability of a demo page with audio samples further aids in understanding the practical implications of the research.
While the approach shows promise, it may still be limited by the reliance on the quality of the datasets used and the inherent challenges of federated learning, such as variability in client data distribution. Additionally, the communication costs, while minimized, could still be a concern in highly distributed environments.
The implications of this work extend to various applications in personalized speech synthesis, voice assistants, and accessibility technologies. By enabling effective voice cloning with privacy preservation, Fed-PISA could enhance user experiences in numerous domains, including entertainment, education, and assistive technologies. The main contribution of this paper is the introduction of Fed-PISA, a novel federated learning framework for voice cloning that effectively balances personalization and communication efficiency through a disentangled adaptation mechanism and personalized aggregation strategy. This work significantly advances the field of federated TTS systems, addressing key challenges in personalization and communication costs while demonstrating strong empirical results.
Neural audio codecs are a fundamental component of modern generative audio pipelines. Although recent codecs achieve strong low-bitrate reconstruction and provide powerful representations for downstream tasks, most are non-streamable, limiting their use in real-time applications. We present FocalCodec-Stream, a hybrid codec based on focal modulation that compresses speech into a single binary codebook at 0.55 - 0.80 kbps with a theoretical latency of 80 ms. Our approach combines multi-stage causal distillation of WavLM with targeted architectural improvements, including a lightweight refiner module that enhances quality under latency constraints. Experiments show that FocalCodec-Stream outperforms existing streamable codecs at comparable bitrates, while preserving both semantic and acoustic information. The result is a favorable trade-off between reconstruction quality, downstream task performance, latency, and efficiency. Code and checkpoints will be released at https://github.com/lucadellalib/focalcodec.
Primary: textitConcordia University
All Institutions: textitConcordia University, Mila-Quebec AI Institute
The main contribution of this paper is the development of FocalCodec-Stream, a novel streaming low-bitrate speech codec that effectively balances reconstruction quality, semantic preservation, and latency, thereby advancing the state of the art in neural audio codecs. The comprehensive analysis of the technical contributions, methodology, and experimental results highlights its significance in addressing real-time audio processing challenges.
The methodology presented in this paper is robust and innovative, particularly in its use of multi-stage causal distillation to adapt the WavLM architecture for streaming applications. The introduction of a lightweight refiner module to enhance audio quality under latency constraints is a significant contribution, as it addresses a critical challenge in low-latency audio processing. The architectural modifications, such as the use of causal convolutions and sliding window attention, are well-justified and effectively enable the codec to maintain performance while achieving streamability. The paper also provides a clear and structured approach to the codec design, which is commendable.
The experimental evaluation is thorough, comparing FocalCodec-Stream against several existing streaming codecs across multiple metrics, including speech resynthesis, voice conversion, and downstream task performance. The results demonstrate that FocalCodec-Stream consistently outperforms its competitors in terms of intelligibility and speaker fidelity, even at lower bitrates. The use of diverse datasets, such as LibriSpeech and Libri-Light, adds credibility to the findings. The ablation studies further substantiate the importance of the refiner and the multi-stage training approach, providing a comprehensive understanding of the model's performance.
The paper mentions that code and checkpoints will be made available on GitHub, which is a positive aspect for reproducibility. However, while the implementation details are described, the paper could benefit from more explicit guidance on hyperparameter settings and training procedures to facilitate easier replication of the results by other researchers.
One limitation noted in the paper is the performance gap between FocalCodec-Stream and the full-context FocalCodec, particularly at lower bitrates. This is expected due to the stricter constraints imposed by real-time streaming. Additionally, the paper does not address potential challenges in scaling the model to larger datasets or the implications of deploying such a codec in resource-constrained environments.
The potential applications of FocalCodec-Stream are significant, particularly in real-time speech applications such as virtual assistants, telecommunication, and interactive dialogue systems. By achieving low-latency, high-quality audio coding, this work could enhance user experiences in various audio-related technologies, making it a valuable contribution to the field of machine learning and audio processing. The main contribution of this paper is the development of FocalCodec-Stream, a novel streaming low-bitrate speech codec that effectively balances reconstruction quality, semantic preservation, and latency, thereby advancing the state of the art in neural audio codecs. The comprehensive analysis of the technical contributions, methodology, and experimental results highlights its significance in addressing real-time audio processing challenges.
The spatial semantic segmentation task focuses on separating and classifying sound objects from multichannel signals. To achieve two different goals, conventional methods fine-tune a large classification model cascaded with the separation model and inject classified labels as separation clues for the next iteration step. However, such integration is not ideal, in that fine-tuning over a smaller dataset loses the diversity of large classification models, features from the source separation model are different from the inputs of the pretrained classifier, and injected one-hot class labels lack semantic depth, often leading to error propagation. To resolve these issues, we propose a Dual-Path Classifier (DPC) architecture that combines object features from a source separation model with semantic representations acquired from a pretrained classification model without fine-tuning. We also introduce a Semantic Clue Encoder (SCE) that enriches the semantic depth of injected clues. Our system achieves a state-of-the-art 11.19 dB CA-SDRi and enhanced semantic fidelity on the DCASE 2025 task4 evaluation set, surpassing the top-rank performance of 11.00 dB. These results highlight the effectiveness of integrating separator-derived features and rich semantic clues.
Primary: School of Electrical Engineering
All Institutions: and Jung-Woo Choi, School of Electrical Engineering
The main contribution of this paper is the introduction of a novel Dual-Path Classifier and Semantic Clue Encoder that significantly enhance sound separation and classification performance. The methodology effectively addresses key limitations of existing approaches, leading to improved accuracy and robustness in audio processing tasks.
The proposed methodology introduces a Dual-Path Classifier (DPC) and a Semantic Clue Encoder (SCE) to address the challenges of sound separation and classification. The DPC architecture effectively combines object features from a source separation model with semantic representations from a pretrained classifier without fine-tuning, which is a significant improvement over conventional methods. The SCE enhances the semantic depth of the injected clues, mitigating the limitations of one-hot encoding. The architecture's design, which includes a dual-path CRNN and a robust fusion mechanism, demonstrates a thoughtful approach to leveraging existing models while preserving feature diversity and richness. However, the paper could benefit from a clearer explanation of the integration process between the DPC and SCE, as well as more detailed descriptions of the training process.
The experiments are well-structured, utilizing the DCASE 2025 task4 challenge as a benchmark for evaluation. The results indicate a clear performance improvement over previous systems, with a state-of-the-art CA-SDRi score of 11.19 dB. The paper provides comprehensive comparisons across different stages of the proposed framework, showcasing the effectiveness of both the DPC and SCE. However, the evaluation could be strengthened by including more diverse datasets and additional performance metrics to provide a broader perspective on the model's capabilities.
The paper lacks sufficient implementation details that would facilitate reproducibility. While it mentions the use of pretrained weights and specific training configurations, it does not provide the exact architecture details, hyperparameters, or code repositories. Including this information would greatly enhance the ability of other researchers to replicate the results.
The paper identifies several limitations in existing methods, such as the loss of diversity in fine-tuning and the inadequacy of one-hot class labels. However, it does not thoroughly address potential weaknesses in the proposed approach, such as the reliance on pretrained models and the risk of overfitting to the training data. Additionally, the evaluation is limited to a specific challenge dataset, which may not fully represent real-world scenarios.
The proposed methods have significant implications for audio processing applications, particularly in environments where sound separation and classification are critical, such as in assistive technologies, smart environments, and multimedia content creation. By improving the accuracy and robustness of sound separation systems, this research could enhance user experiences in various audio-related applications. The main contribution of this paper is the introduction of a novel Dual-Path Classifier and Semantic Clue Encoder that significantly enhance sound separation and classification performance. The methodology effectively addresses key limitations of existing approaches, leading to improved accuracy and robustness in audio processing tasks.
Although Large Audio-Language Models (LALMs) have exhibited outstanding performance in auditory understanding, their performance in affective computing scenarios, particularly in emotion recognition, reasoning, and subtle sentiment differentiation, remains suboptimal. Recent advances in Reinforcement Learning (RL) have shown promise in improving LALMs' reasoning abilities. However, two critical challenges hinder the direct application of RL techniques to Speech Emotion Recognition (SER) tasks: (1) convergence instability caused by ambiguous emotional boundaries and (2) limited reasoning ability when using relatively small models (e.g., 7B-parameter architectures). To overcome these limitations, we introduce EMO-RL, a novel framework incorporating reinforcement learning with two key innovations: Emotion Similarity-Weighted Reward (ESWR) and Explicit Structured Reasoning (ESR). Built upon pretrained LALMs, our method employs group-relative policy optimization with emotion constraints. Comprehensive experiments demonstrate that our EMO-RL training strategies can significantly enhance the emotional reasoning capabilities of LALMs, attaining state-of-the-art results on both the MELD and IEMOCAP datasets, and cross-dataset experiments prove the strong superiority of generalization.
Primary: and Author n
All Institutions: Address line, and Author n
The main contribution of this paper is the introduction of the EMO-RL framework, which enhances the emotional reasoning capabilities of large audio-language models for speech emotion recognition through innovative reinforcement learning techniques. This work represents a meaningful advancement in the field, addressing critical challenges in emotion recognition and setting a foundation for future research in multi-modal emotion detection systems.
The proposed EMO-RL framework effectively integrates reinforcement learning with emotion-specific strategies, namely Emotion Similarity-Weighted Reward (ESWR) and Explicit Structured Reasoning (ESR). This innovative approach addresses the challenges of convergence instability and limited reasoning capabilities in speech emotion recognition tasks. The methodology is well-structured, providing a clear transformation of the SER problem into a regression framework that accommodates emotional nuances. However, the reliance on psychological models like Plutchik's wheel for reward structuring, while beneficial, may introduce biases based on the chosen emotional framework.
The experiments conducted are comprehensive, utilizing multiple datasets (MELD, IEMOCAP, RAVDESS, SAVEE) to validate the effectiveness of the proposed approach. The results demonstrate significant improvements over baseline models, achieving state-of-the-art performance in SER tasks. The evaluation metrics used (Unweighted Accuracy, Weighted Accuracy, Macro F1 Score) are appropriate for the task, providing a well-rounded assessment of model performance. However, the paper could benefit from more detailed comparisons with additional state-of-the-art methods beyond those mentioned.
The implementation details are sufficiently detailed, including the model architecture, training parameters, and experimental setup. However, the absence of a publicly accessible code repository limits the reproducibility of the results. Providing access to the trained models or code would enhance the paper's impact and allow other researchers to validate the findings.
The paper acknowledges limitations, including the focus solely on the speech modality without exploring multi-modal contexts that could enhance the framework's applicability. Additionally, the computational complexity and inference efficiency issues may hinder real-time applications. These limitations suggest areas for future research and development.
The EMO-RL framework has significant implications for various applications in affective computing, such as mental health assessment, customer service, and human-computer interaction. By improving emotion recognition capabilities in audio-language models, this research paves the way for more emotionally aware AI systems, enhancing user experience and interaction quality. The main contribution of this paper is the introduction of the EMO-RL framework, which enhances the emotional reasoning capabilities of large audio-language models for speech emotion recognition through innovative reinforcement learning techniques. This work represents a meaningful advancement in the field, addressing critical challenges in emotion recognition and setting a foundation for future research in multi-modal emotion detection systems.
Target Speaker Automatic Speech Recognition (TS-ASR) aims to transcribe the speech of a specified target speaker from multi-speaker mixtures in cocktail party scenarios. Recent advancement of Large Audio-Language Models (LALMs) has already brought some new insights to TS-ASR. However, significant room for optimization remains for the TS-ASR task within the LALMs architecture. While Chain of Thoughts (CoT) and Reinforcement Learning (RL) have proven effective in certain speech tasks, TS-ASR, which requires the model to deeply comprehend speech signals, differentiate various speakers, and handle overlapping utterances is particularly well-suited to a reasoning-guided approach. Therefore, we propose a novel framework that incorporates CoT and RL training into TS-ASR for performance improvement. A novel CoT dataset of TS-ASR is constructed, and the TS-ASR model is first trained on regular data and then fine-tuned on CoT data. Finally, the model is further trained with RL using selected data to enhance generalized reasoning capabilities. Experiment results demonstrate a significant improvement of TS-ASR performance with CoT and RL training, establishing a state-of-the-art performance compared with previous works of TS-ASR on comparable datasets.
Primary: †Equal contribution
All Institutions: †Equal contribution
The main contribution of this paper is the novel integration of Chain-of-Thought and Reinforcement Learning into the TS-ASR task, resulting in a significant performance improvement in transcribing target speakers from overlapping speech. This work represents a meaningful advancement in the field of speech recognition, particularly in challenging acoustic environments.
The proposed methodology is innovative, integrating Chain-of-Thought (CoT) and Reinforcement Learning (RL) into the Target Speaker Automatic Speech Recognition (TS-ASR) task. The construction of a novel CoT dataset tailored for TS-ASR is a significant contribution, as it allows for structured reasoning and enhances the model's ability to handle overlapping speech. The three-stage training paradigm—base model training, CoT fine-tuning, and RL refinement—demonstrates a comprehensive approach to improving model performance. However, the methodology could benefit from clearer descriptions of the CoT dataset construction process and the rationale behind certain design choices.
The experiments are well-structured, comparing the proposed model against both traditional and state-of-the-art LLM-based TS-ASR methods. The reported results show a significant reduction in word error rates (WER), indicating the effectiveness of the proposed framework. The use of ablation studies to assess the impact of different training strategies adds rigor to the evaluation. However, the paper lacks detailed statistical analysis of the results, such as confidence intervals or significance tests, which would strengthen the claims of improvement.
The paper provides a reasonable level of detail regarding the experimental setup, including the datasets used and the training parameters. The availability of the CoT dataset on GitHub enhances reproducibility. However, further details on the model architecture and specific hyperparameters used during training would improve the ability of other researchers to replicate the results.
One limitation is the reliance on specific datasets (LibriSpeech, Libri2Mix, Libri3Mix), which may restrict the generalizability of the findings to other domains or datasets. Additionally, while the proposed methods show improvements, the paper does not address potential overfitting issues or the model's performance in highly variable real-world scenarios. The complexity of the model may also pose challenges in terms of computational resources required for training and deployment.
The integration of reasoning capabilities into TS-ASR has significant implications for applications in real-time communication systems, assistive technologies, and multimedia content analysis. By improving the ability to transcribe overlapping speech in complex environments, this research could enhance accessibility for individuals with hearing impairments and improve the accuracy of automated transcription services in various industries. The main contribution of this paper is the novel integration of Chain-of-Thought and Reinforcement Learning into the TS-ASR task, resulting in a significant performance improvement in transcribing target speakers from overlapping speech. This work represents a meaningful advancement in the field of speech recognition, particularly in challenging acoustic environments.
In this work, we investigate multimodal foundation models (MFMs) for EmoFake detection (EFD) and hypothesize that they will outperform audio foundation models (AFMs). MFMs due to their cross-modal pre-training, learns emotional patterns from multiple modalities, while AFMs rely only on audio. As such, MFMs can better recognize unnatural emotional shifts and inconsistencies in manipulated audio, making them more effective at distinguishing real from fake emotional expressions. To validate our hypothesis, we conduct a comprehensive comparative analysis of state-of-the-art (SOTA) MFMs (e.g. LanguageBind) alongside AFMs (e.g. WavLM). Our experiments confirm that MFMs surpass AFMs for EFD. Beyond individual foundation models (FMs) performance, we explore FMs fusion, motivated by findings in related research areas such synthetic speech detection and speech emotion recognition. To this end, we propose SCAR, a novel framework for effective fusion. SCAR introduces a nested cross-attention mechanism, where representations from FMs interact at two stages sequentially to refine information exchange. Additionally, a self-attention refinement module further enhances feature representations by reinforcing important cross-FM cues while suppressing noise. Through SCAR with synergistic fusion of MFMs, we achieve SOTA performance, surpassing both standalone FMs and conventional fusion approaches and previous works on EFD.
The main contribution of this paper is the introduction of a novel framework, SCAR, for EmoFake detection that leverages multimodal foundation models and demonstrates superior performance compared to existing audio foundation models. This research significantly advances the understanding and capabilities in the detection of emotionally manipulated audio, addressing a critical gap in the field of audio deepfake detection.
The paper presents a well-structured methodology for EmoFake detection using multimodal foundation models (MFMs) and a novel framework called SCAR for fusing these models. The nested cross-attention mechanism is a significant innovation, allowing for enhanced interaction between different modalities, which is a critical aspect of the proposed approach. The authors provide a clear explanation of the architecture and the rationale behind their design choices, which strengthens the overall methodology. However, the paper could benefit from a more detailed comparison of the proposed SCAR framework with existing fusion techniques beyond simple concatenation.
The experiments are comprehensive, utilizing a unique dataset specifically designed for EmoFake detection. The authors validate their hypothesis through rigorous testing, demonstrating that MFMs outperform AFMs in EFD tasks. The use of Equal Error Rate (EER) as a metric is appropriate for the domain, and the results are clearly presented, showing significant improvements over baseline models. However, the paper lacks a thorough exploration of the statistical significance of the results, which would bolster the claims of superiority.
The authors provide a GitHub repository with accessible code and models, which is a positive aspect for reproducibility. The training details, including optimizer settings and architecture specifics, are adequately described, allowing other researchers to replicate the experiments. However, the paper could enhance reproducibility by including more extensive documentation on the dataset and preprocessing steps.
One limitation of the study is the reliance on a single dataset for evaluation, which may not capture the full variability of EmoFake detection scenarios. Additionally, while the proposed SCAR framework shows promise, its complexity may pose challenges for real-time applications. The paper also does not address potential biases in the dataset or the models used.
The implications of this research are significant, particularly in areas such as misinformation, security, and emotional manipulation detection. As deepfake technology becomes increasingly sophisticated, the ability to detect emotionally manipulated audio could play a crucial role in maintaining trust in digital communications. The findings could inform future research directions and applications in various fields, including forensics and media verification. The main contribution of this paper is the introduction of a novel framework, SCAR, for EmoFake detection that leverages multimodal foundation models and demonstrates superior performance compared to existing audio foundation models. This research significantly advances the understanding and capabilities in the detection of emotionally manipulated audio, addressing a critical gap in the field of audio deepfake detection.
Diffusion and flow matching (FM) models have achieved remarkable progress in speech enhancement (SE), yet their dependence on multi-step generation is computationally expensive and vulnerable to discretization errors. Recent advances in one-step generative modeling, particularly MeanFlow, provide a promising alternative by reformulating dynamics through average velocity fields. In this work, we present COSE, a one-step FM framework tailored for SE. To address the high training overhead of Jacobian-vector product (JVP) computations in MeanFlow, we introduce a velocity composition identity to compute average velocity efficiently, eliminating expensive computation while preserving theoretical consistency and achieving competitive enhancement quality. Extensive experiments on standard benchmarks show that COSE delivers up to 5x faster sampling and reduces training cost by 40%, all without compromising speech quality. Code is available at https://github.com/ICDM-UESTC/COSE.
The main contribution of this paper is the introduction of COSE, a one-step flow matching framework for speech enhancement that significantly reduces computational costs while maintaining high-quality output. This work represents a meaningful advancement in the efficiency of generative models for audio processing, with promising implications for real-time applications in speech technology.
The proposed COSE framework introduces a novel approach to one-step flow matching for speech enhancement by utilizing a velocity composition identity to efficiently compute average velocities. This innovation addresses the computational overhead associated with Jacobian-vector product computations in existing MeanFlow models, which is a significant improvement in terms of efficiency while maintaining theoretical consistency. The methodology is well-structured, clearly delineating the steps taken to achieve the proposed enhancements. However, the paper could benefit from a more detailed explanation of the underlying mathematical principles and their implications for the broader context of generative modeling.
The authors conducted extensive experiments on standard benchmarks, demonstrating that COSE achieves up to 5x faster sampling and a 40% reduction in training costs without sacrificing speech quality. This is a compelling result that indicates the practical applicability of the framework. However, the paper lacks detailed descriptions of the datasets used, the specific metrics for evaluating speech quality, and comparisons with other state-of-the-art methods, which would strengthen the validation of the results.
The availability of code on GitHub is a positive aspect that enhances reproducibility. However, the paper does not provide sufficient details on the experimental setup, hyperparameter configurations, or specific versions of libraries used, which could hinder other researchers from replicating the results accurately.
One limitation is the reliance on standard benchmarks without exploring real-world applications or datasets that may present different challenges. Additionally, while the reduction in computational cost is significant, the paper does not discuss potential trade-offs in terms of model complexity or performance in edge cases.
The COSE framework has the potential to significantly impact the field of speech enhancement by providing a more efficient method that can be integrated into real-time applications, such as voice assistants and hearing aids. Its implications extend to various domains where clear speech quality is crucial, potentially improving user experiences across multiple technologies. The main contribution of this paper is the introduction of COSE, a one-step flow matching framework for speech enhancement that significantly reduces computational costs while maintaining high-quality output. This work represents a meaningful advancement in the efficiency of generative models for audio processing, with promising implications for real-time applications in speech technology.
Multichannel speech enhancement leverages spatial cues to improve intelligibility and quality, but most learning-based methods rely on specific microphone array geometry, unable to account for geometry changes. To mitigate this limitation, current array-agnostic approaches employ large multi-geometry datasets but may still fail to generalize to unseen layouts. We propose AmbiDrop (Ambisonics with Dropouts), an Ambisonics-based framework that encodes arbitrary array recordings into the spherical harmonics domain using Ambisonics Signal Matching (ASM). A deep neural network is trained on simulated Ambisonics data, combined with channel dropout for robustness against array-dependent encoding errors, therefore omitting the need for a diverse microphone array database. Experiments show that while the baseline and proposed models perform similarly on the training arrays, the baseline degrades on unseen arrays. In contrast, AmbiDrop consistently improves SI-SDR, PESQ, and STOI, demonstrating strong generalization and practical potential for array-agnostic speech enhancement.
Primary: Ben Gurion University of the Negev
All Institutions: School of Electrical and Computer Engineering, Ben Gurion University of the Negev
The paper presents AmbiDrop, a novel Ambisonics-based framework for array-agnostic speech enhancement, demonstrating strong generalization capabilities and practical potential for diverse applications. The technical contribution is significant, addressing a critical challenge in the field of multichannel speech enhancement.
The proposed AmbiDrop framework introduces a novel approach to array-agnostic speech enhancement by utilizing Ambisonics encoding and dropout-based learning. The methodology effectively addresses the limitations of existing multichannel speech enhancement techniques that rely on specific microphone geometries. By encoding arbitrary array recordings into the spherical harmonics domain, the authors create a robust input representation that is independent of array configuration. The incorporation of dropout during training simulates the challenges of real-world encoding errors, enhancing the model's robustness. This innovative approach is well-justified and theoretically sound, providing a clear pathway for practical application.
The experiments are comprehensive, comparing the proposed model against a baseline that relies on specific microphone configurations. The results demonstrate that while both models perform similarly on training arrays, AmbiDrop significantly outperforms the baseline on unseen arrays, showcasing its generalization capabilities. The use of objective metrics such as SI-SDR, PESQ, and STOI provides a solid foundation for evaluating performance. However, the paper could benefit from additional qualitative assessments or user studies to further validate the perceptual quality improvements.
The paper includes sufficient detail regarding the experimental setup, including the generation of datasets and the training process. However, the absence of a publicly available code repository or demo URL limits reproducibility. Future work should consider releasing the code and datasets to facilitate further research and validation of the proposed methods.
One limitation of the study is the reliance on simulated data for training, which may not fully capture the complexities of real-world scenarios. Additionally, while the model shows strong performance on unseen arrays, the results on the AR glasses array indicate potential challenges in generalization to highly irregular configurations. Future work should explore these aspects further.
The AmbiDrop framework has significant implications for various applications, including telecommunication, hearing aids, and human-computer interaction. By providing a robust solution for speech enhancement across diverse microphone geometries, it can improve user experiences in real-world environments where array configurations vary widely. The potential for deployment in consumer devices could enhance accessibility and usability in everyday communication scenarios. The paper presents AmbiDrop, a novel Ambisonics-based framework for array-agnostic speech enhancement, demonstrating strong generalization capabilities and practical potential for diverse applications. The technical contribution is significant, addressing a critical challenge in the field of multichannel speech enhancement.
Variational Autoencoders (VAEs) are essential for large-scale audio tasks like diffusion-based generation. However, existing open-source models often neglect auditory perceptual aspects during training, leading to weaknesses in phase accuracy and stereophonic spatial representation. To address these challenges, we propose {\epsilon}ar-VAE, an open-source music signal reconstruction model that rethinks and optimizes the VAE training paradigm. Our contributions are threefold: (i) A K-weighting perceptual filter applied prior to loss calculation to align the objective with auditory perception. (ii) Two novel phase losses: a Correlation Loss for stereo coherence, and a Phase Loss using its derivatives--Instantaneous Frequency and Group Delay--for precision. (iii) A new spectral supervision paradigm where magnitude is supervised by all four Mid/Side/Left/Right components, while phase is supervised only by the LR components. Experiments show {\epsilon}ar-VAE at 44.1kHz substantially outperforms leading open-source models across diverse metrics, showing particular strength in reconstructing high-frequency harmonics and the spatial characteristics.
Primary: ar-LAB
All Institutions: ar-LAB
The paper presents a novel VAE architecture, epsilonar-VAE, that significantly enhances high-fidelity music reconstruction by integrating perceptual weighting and innovative loss functions. This comprehensive analysis highlights the technical contributions and potential impact on the audio processing field, showcasing a meaningful advancement in machine learning applications for audio.
The proposed methodology introduces several innovative components to the VAE architecture tailored for audio signal reconstruction. The integration of a K-weighting perceptual filter is a significant enhancement, aligning the model's objectives with psychoacoustic principles. The introduction of novel phase losses (Correlation Loss and Phase Loss) addresses critical aspects of audio fidelity, particularly in stereo coherence and transient clarity. The spectral supervision paradigm, which separates magnitude and phase supervision, is a thoughtful approach that reflects an understanding of the complexities involved in audio reconstruction. Overall, the methodology is well-structured and presents a comprehensive approach to improving high-fidelity audio generation.
The experimental setup is robust, utilizing a combination of public datasets and a proprietary in-house dataset, which strengthens the validity of the results. The paper provides detailed metrics for evaluating performance, including novel metrics for phase accuracy, which adds depth to the evaluation process. The comparison against leading models such as EnCodec and DAC demonstrates the effectiveness of the proposed approach. However, the results could benefit from additional qualitative assessments, such as listening tests, to complement the quantitative metrics.
The paper includes a clear description of the training process, model architecture, and loss functions, which aids in reproducibility. The availability of model weights and code on the provided demo URL is a positive aspect that encourages further exploration and validation by the research community. However, the paper could enhance reproducibility by providing more detailed hyperparameter settings and training configurations.
One limitation is the reliance on specific datasets, which may not fully represent the diversity of audio signals encountered in real-world applications. Additionally, while the model shows improvements in reconstruction quality, the paper does not address potential computational costs or efficiency concerns associated with the proposed architecture. The focus on perceptual aspects may also overlook other factors influencing audio quality.
The advancements presented in this paper have significant implications for audio engineering, music production, and machine learning applications in audio synthesis. By improving the fidelity of audio reconstruction, this work could enhance various applications, including music streaming, audio restoration, and virtual reality audio experiences. The open-source nature of the model promotes accessibility and encourages further research in the field. The paper presents a novel VAE architecture, epsilonar-VAE, that significantly enhances high-fidelity music reconstruction by integrating perceptual weighting and innovative loss functions. This comprehensive analysis highlights the technical contributions and potential impact on the audio processing field, showcasing a meaningful advancement in machine learning applications for audio.
This paper presents DAIEN-TTS, a zero-shot text-to-speech (TTS) framework that enables ENvironment-aware synthesis through Disentangled Audio Infilling. By leveraging separate speaker and environment prompts, DAIEN-TTS allows independent control over the timbre and the background environment of the synthesized speech. Built upon F5-TTS, the proposed DAIEN-TTS first incorporates a pretrained speech-environment separation (SES) module to disentangle the environmental speech into mel-spectrograms of clean speech and environment audio. Two random span masks of varying lengths are then applied to both mel-spectrograms, which, together with the text embedding, serve as conditions for infilling the masked environmental mel-spectrogram, enabling the simultaneous continuation of personalized speech and time-varying environmental audio. To further enhance controllability during inference, we adopt dual class-free guidance (DCFG) for the speech and environment components and introduce a signal-to-noise ratio (SNR) adaptation strategy to align the synthesized speech with the environment prompt. Experimental results demonstrate that DAIEN-TTS generates environmental personalized speech with high naturalness, strong speaker similarity, and high environmental fidelity.
Primary: University of Science and Technology of China
All Institutions: University of Science and Technology of China, National Engineering Research Center of Speech and Language Information Processing
The main contribution of this paper is the introduction of DAIEN-TTS, an innovative environment-aware zero-shot TTS framework that enables disentangled control of speaker timbre and background environments, significantly advancing the capabilities of text-to-speech synthesis. The methodology and experimental results demonstrate a meaningful step forward in the field, with potential applications that could reshape user interactions with synthesized speech.
The proposed DAIEN-TTS framework introduces a novel approach to zero-shot TTS by utilizing disentangled audio infilling, which allows for independent control over speaker timbre and environmental background. The incorporation of a pretrained speech-environment separation (SES) module is a significant methodological advancement, as it effectively disentangles the speech and environmental components. The use of random span masking during training and dual class-free guidance (DCFG) during inference enhances the model's controllability and adaptability to varying conditions. The methodology is well-structured, leveraging existing frameworks like F5-TTS while innovating on top of them, showcasing a clear progression in the field of TTS synthesis.
The experiments are comprehensive, utilizing the LibriTTS corpus and the DNS-Challenge dataset to simulate a variety of environmental conditions. The evaluation metrics include both objective measures (WER, SIM-o) and subjective assessments (MOS for naturalness, speaker similarity, and environment similarity), providing a robust framework for assessing the model's performance. The results demonstrate that DAIEN-TTS outperforms existing baselines, including F5-TTS, in both silence and background environment scenarios, indicating its effectiveness in generating high-quality, environment-aware speech. The thoroughness of the experimental setup and the clarity of the results contribute positively to the paper's impact.
The paper provides detailed descriptions of the model architecture, training procedures, and evaluation metrics, which are essential for reproducibility. However, specific hyperparameters and the exact configurations of the training environment (e.g., GPU specifications) are mentioned but could be elaborated further to enhance reproducibility. The authors could also consider providing a code repository to facilitate implementation by other researchers.
One limitation of the study is the reliance on the LibriTTS corpus, which may not fully capture the diversity of real-world speech and environmental conditions. Additionally, while the model shows strong performance in controlled settings, its robustness in highly variable real-world scenarios remains to be tested. The paper does not address potential biases in the training data, which could affect the generalizability of the model.
The DAIEN-TTS framework has significant implications for applications in virtual reality, audiobooks, and personalized voice assistants, where the ability to synthesize speech with varying environmental contexts can enhance user experience. The ability to independently control speaker characteristics and background environments could lead to more immersive and realistic interactions in various multimedia applications. The research also contributes to the broader field of speech synthesis by addressing the challenge of environment-aware synthesis, paving the way for future advancements in TTS technologies. The main contribution of this paper is the introduction of DAIEN-TTS, an innovative environment-aware zero-shot TTS framework that enables disentangled control of speaker timbre and background environments, significantly advancing the capabilities of text-to-speech synthesis. The methodology and experimental results demonstrate a meaningful step forward in the field, with potential applications that could reshape user interactions with synthesized speech.
Contrastive language--audio pretraining (CLAP) has achieved remarkable success as an audio--text embedding framework, but existing approaches are limited to monaural or single-source conditions and cannot fully capture spatial information. The central challenge in modeling spatial information lies in multi-source conditions, where the correct correspondence between each sound source and its location is required. To tackle this problem, we propose Spatial-CLAP, which introduces a content-aware spatial encoder that enables spatial representations coupled with audio content. We further propose spatial contrastive learning (SCL), a training strategy that explicitly enforces the learning of the correct correspondence and promotes more reliable embeddings under multi-source conditions. Experimental evaluations, including downstream tasks, demonstrate that Spatial-CLAP learns effective embeddings even under multi-source conditions, and confirm the effectiveness of SCL. Moreover, evaluation on unseen three-source mixtures highlights the fundamental distinction between conventional single-source training and our proposed multi-source training paradigm. These findings establish a new paradigm for spatially-aware audio--text embeddings.
Primary: The University of Tokyo
All Institutions: JSPS KAKENHI Grant Number 24KJ0860, JST Moonshot Grant Number JPMJMS2011, samples obtained by applying SCL to, and NII Open Collaborative Research 2025-(251S4-22735), JST FOREST Program JPMJFR226V, hidden units and ReLU activations) to produce the final, Japan. Keio University, footnotesize The work was supported by, The University of Tokyo
The development of Spatial-CLAP has significant implications for various applications, including augmented reality (AR), virtual reality (VR), and robotics, where understanding spatial audio cues is critical. By advancing the state of the art in audio-text embeddings, this research could enhance the capabilities of systems that rely on accurate audio perception and interpretation, leading to more immersive and responsive user experiences. The main contribution of this work is the introduction of Spatial-CLAP, a spatially-aware audio-text embedding model that effectively captures both content and spatial information in multi-source conditions. This research significantly advances the field of audio processing by addressing the limitations of existing models and providing a strong foundation for future developments in spatial audio understanding.
The paper introduces Spatial-CLAP, a novel audio-text embedding model that effectively integrates spatial information into the existing CLAP framework. The methodology is robust, featuring a content-aware spatial encoder (CA-SE) that captures spatial representations alongside audio content, and a spatial contrastive learning (SCL) strategy that enhances the model's ability to learn correct content-space correspondences in multi-source conditions. This dual approach is innovative and addresses a significant gap in existing models, which have primarily focused on single-source scenarios. The use of simulated room impulse responses (RIRs) for training and the incorporation of hard negative examples in SCL are particularly noteworthy, as they provide a rigorous framework for improving the model's performance in complex auditory environments.
The experimental evaluations are comprehensive, utilizing a variety of metrics to assess the performance of Spatial-CLAP across different conditions, including single-source and multi-source scenarios. The results demonstrate significant improvements over conventional methods, particularly in tasks that require understanding spatial relationships in audio. The paper includes detailed comparisons with baseline models, showcasing the effectiveness of the proposed methods. However, the reliance on synthetic data and simulated environments could limit the generalizability of the findings to real-world applications.
The authors have committed to releasing their code and pretrained models, which is crucial for reproducibility. The detailed descriptions of the model architecture, training procedures, and datasets used further enhance the reproducibility of the study. However, the paper could benefit from more explicit details regarding the hyperparameter tuning and the specific configurations used during training.
One limitation of the study is the potential overfitting to the synthetic training conditions, as the model's performance in real-world scenarios remains untested. Additionally, while the model shows promise in handling multi-source conditions, its performance in dynamic environments with moving sources has not been addressed. The paper also does not explore the computational efficiency of the model, which is an important consideration for practical applications.
The development of Spatial-CLAP has significant implications for various applications, including augmented reality (AR), virtual reality (VR), and robotics, where understanding spatial audio cues is critical. By advancing the state of the art in audio-text embeddings, this research could enhance the capabilities of systems that rely on accurate audio perception and interpretation, leading to more immersive and responsive user experiences. The main contribution of this work is the introduction of Spatial-CLAP, a spatially-aware audio-text embedding model that effectively captures both content and spatial information in multi-source conditions. This research significantly advances the field of audio processing by addressing the limitations of existing models and providing a strong foundation for future developments in spatial audio understanding.
While existing speech audio codecs designed for compression exploit limited forms of temporal redundancy and allow for multi-scale representations, they tend to represent all features of audio in the same way. In contrast, generative voice models designed for text-to-speech and voice transfer tasks have recently proved effective at factorizing audio signals into high-level semantic representations of fundamentally distinct features. In this paper, we leverage such representations in a novel semantic communications approach to achieve lower bitrates without sacrificing perceptual quality or suitability for specific downstream tasks. Our technique matches or outperforms existing audio codecs on transcription, sentiment analysis, and speaker verification when encoding at 2-4x lower bitrate -- notably surpassing Encodec in perceptual quality and speaker verification while using up to 4x less bitrate.
Primary: & Technology Research
All Institutions: & Technology Research
The main contribution of this paper is the introduction of a novel approach to semantic audio compression that significantly reduces bitrate while preserving perceptual quality and task-relevant information. This research represents a meaningful advancement in the field of audio processing and machine learning, offering a new direction for future exploration in efficient communication technologies.
The paper presents a novel semantic compression approach that leverages generative voice models to factor audio signals into high-level semantic representations. This method is innovative as it focuses on preserving semantic information relevant to specific downstream tasks rather than encoding all audio features uniformly. The approach is well-structured, utilizing a combination of content-style tokens and timbre samples to achieve lower bitrates while maintaining quality. However, the methodology could benefit from clearer explanations of the encoding schemes and the rationale behind the choices made, particularly regarding the use of Vevo and the auxiliary compression techniques.
The experiments are comprehensive, utilizing the VoxCeleb1 dataset and evaluating the proposed method against traditional and neural codecs across multiple downstream tasks, including transcription, sentiment analysis, and speaker verification. The results demonstrate that the proposed method consistently outperforms existing codecs at lower bitrates, which is a significant achievement. However, the paper lacks detailed statistical analyses of the results, such as confidence intervals or significance testing, which would strengthen the claims made about performance improvements.
The paper does not provide sufficient details on the implementation of the proposed methods, such as specific hyperparameters, training procedures, or the exact architecture of the models used. This lack of detail may hinder reproducibility for other researchers attempting to replicate the results. Including code or a detailed supplementary material section would greatly enhance reproducibility.
The paper acknowledges several limitations, including the inability to handle overlapping speakers and the potential for latency due to the timbre encoding approach. Additionally, the reliance on a single dataset (VoxCeleb1) may limit the generalizability of the findings. The authors also note that errors in timbre transmission can lead to permanent inaccuracies in voice reconstruction, which is a critical concern for real-time applications.
The proposed method has significant implications for ultra-low bandwidth voice communication, particularly in applications where bandwidth is constrained, such as remote areas or during emergencies. The ability to maintain high-quality audio while reducing bitrate could enhance communication technologies in various fields, including telecommunication, assistive technologies, and real-time translation services. The focus on semantic preservation aligns with ongoing trends in AI and machine learning, making this research relevant to future advancements in the field. The main contribution of this paper is the introduction of a novel approach to semantic audio compression that significantly reduces bitrate while preserving perceptual quality and task-relevant information. This research represents a meaningful advancement in the field of audio processing and machine learning, offering a new direction for future exploration in efficient communication technologies.
Adversarial perturbations in speech pose a serious threat to automatic speech recognition (ASR) and speaker verification by introducing subtle waveform modifications that remain imperceptible to humans but can significantly alter system outputs. While targeted attacks on end-to-end ASR models have been widely studied, the phonetic basis of these perturbations and their effect on speaker identity remain underexplored. In this work, we analyze adversarial audio at the phonetic level and show that perturbations exploit systematic confusions such as vowel centralization and consonant substitutions. These distortions not only mislead transcription but also degrade phonetic cues critical for speaker verification, leading to identity drift. Using DeepSpeech as our ASR target, we generate targeted adversarial examples and evaluate their impact on speaker embeddings across genuine and impostor samples. Results across 16 phonetically diverse target phrases demonstrate that adversarial audio induces both transcription errors and identity drift, highlighting the need for phonetic-aware defenses to ensure the robustness of ASR and speaker recognition systems.
This paper presents a phonetic perspective on adversarial attacks in audio processing, revealing how subtle perturbations can mislead both speech recognition and speaker verification systems. The innovative approach and thorough experimental evaluation contribute valuable insights to the field of machine learning and speech technology, emphasizing the need for phonetic-aware defenses.
The methodology is well-structured, employing a white-box attack approach on the DeepSpeech model to generate adversarial examples. The paper effectively formulates the problem of adversarial attacks at both the transcription and speaker identity levels, providing a clear mathematical framework for the attack success criteria. The phonetic analysis of perturbations is a novel angle that enriches the understanding of how adversarial attacks can exploit linguistic features. However, the paper could benefit from a more detailed explanation of the optimization process and the specific metrics used to quantify phonetic confusions.
The experiments are comprehensive, utilizing a diverse dataset (VCTK corpus) and a variety of target phrases that cover a wide range of phonetic structures. The results clearly demonstrate the dual impact of adversarial perturbations on both transcription accuracy and speaker identity drift. The use of two state-of-the-art embedding models for speaker verification adds robustness to the findings. However, the paper could improve by including more detailed statistical analysis of the results and addressing potential variability in the experimental setup.
The paper provides a GitHub repository for additional figures and visualizations, which is a positive aspect for reproducibility. However, it lacks detailed implementation instructions or code snippets within the text that would facilitate easier replication of the experiments by other researchers.
The study is limited to white-box attacks in controlled environments, which may not fully represent real-world scenarios where adversarial attacks can be more complex due to environmental factors and black-box models. Additionally, the paper does not address the potential implications of over-the-air effects or the performance of defenses against such attacks.
The findings of this work have significant implications for the security of ASR and speaker verification systems, highlighting vulnerabilities that could be exploited in real-world applications. The focus on phonetic features in adversarial attacks opens new avenues for research in developing more robust speech technologies and defenses against adversarial manipulation. The insights gained from this study could inform the design of future systems that are more resilient to such threats. This paper presents a phonetic perspective on adversarial attacks in audio processing, revealing how subtle perturbations can mislead both speech recognition and speaker verification systems. The innovative approach and thorough experimental evaluation contribute valuable insights to the field of machine learning and speech technology, emphasizing the need for phonetic-aware defenses.
Multistep inference is a bottleneck for real-time generative speech enhancement because flow- and diffusion-based systems learn an instantaneous velocity field and therefore rely on iterative ordinary differential equation (ODE) solvers. We introduce MeanFlowSE, a conditional generative model that learns the average velocity over finite intervals along a trajectory. Using a Jacobian-vector product (JVP) to instantiate the MeanFlow identity, we derive a local training objective that directly supervises finite-interval displacement while remaining consistent with the instantaneous-field constraint on the diagonal. At inference, MeanFlowSE performs single-step generation via a backward-in-time displacement, removing the need for multistep solvers; an optional few-step variant offers additional refinement. On VoiceBank-DEMAND, the single-step model achieves strong intelligibility, fidelity, and perceptual quality with substantially lower computational cost than multistep baselines. The method requires no knowledge distillation or external teachers, providing an efficient, high-fidelity framework for real-time generative speech enhancement. The proposed method is open-sourced at https://github.com/liduojia1/MeanFlowSE.
The main contribution of this paper is the introduction of MeanFlowSE, a novel framework for real-time generative speech enhancement that achieves high-quality results with significantly reduced computational costs. This work represents a meaningful advancement in the field of audio processing, particularly in the context of generative models for speech enhancement.
The proposed MeanFlowSE model innovatively addresses the bottleneck of multistep inference in generative speech enhancement by introducing a framework that learns an average velocity field for finite-interval displacement. This approach, leveraging the MeanFlow identity and Jacobian-vector product, allows for single-step inference, which is a significant advancement over traditional methods that rely on iterative ODE solvers. The methodology is well-structured, with a clear training objective that aligns with the instantaneous-field constraint, and the use of a backward-in-time displacement during inference is particularly noteworthy for its efficiency.
The experiments conducted on the VoiceBank-DEMAND dataset are robust, showcasing the performance of MeanFlowSE against several state-of-the-art baselines. The reported metrics, including intelligibility, fidelity, and perceptual quality, demonstrate that MeanFlowSE not only matches but often surpasses existing methods while achieving a significantly lower real-time factor. The comprehensive comparison with other models, such as SGMSE and FlowSE, provides a strong validation of the proposed method's effectiveness.
The paper provides sufficient details regarding the implementation, including the architecture (NCSN++ U-Net with self-attention), training procedures, and evaluation metrics. The open-sourcing of the code on GitHub enhances reproducibility, allowing other researchers to validate and build upon the findings presented.
One noted limitation is the reliance on a linear-Gaussian path for modeling, which may restrict the flexibility of the approach in more complex scenarios. Additionally, the use of first-order derivative estimation could introduce inaccuracies, particularly in non-linear contexts. Future work is suggested to explore more sophisticated modeling techniques that could mitigate these issues.
The implications of this research extend to various applications in real-time communication systems, automatic speech recognition, and assistive technologies for the hearing impaired. By improving the efficiency and quality of speech enhancement, this work has the potential to significantly enhance user experiences in noisy environments and contribute to advancements in human-computer interaction. The main contribution of this paper is the introduction of MeanFlowSE, a novel framework for real-time generative speech enhancement that achieves high-quality results with significantly reduced computational costs. This work represents a meaningful advancement in the field of audio processing, particularly in the context of generative models for speech enhancement.
This work introduces MELA-TTS, a novel joint transformer-diffusion framework for end-to-end text-to-speech synthesis. By autoregressively generating continuous mel-spectrogram frames from linguistic and speaker conditions, our architecture eliminates the need for speech tokenization and multi-stage processing pipelines. To address the inherent difficulties of modeling continuous features, we propose a representation alignment module that aligns output representations of the transformer decoder with semantic embeddings from a pretrained ASR encoder during training. This mechanism not only speeds up training convergence, but also enhances cross-modal coherence between the textual and acoustic domains. Comprehensive experiments demonstrate that MELA-TTS achieves state-of-the-art performance across multiple evaluation metrics while maintaining robust zero-shot voice cloning capabilities, in both offline and streaming synthesis modes. Our results establish a new benchmark for continuous feature generation approaches in TTS, offering a compelling alternative to discrete-token-based paradigms.
Primary: The first two authors contribute equally to this work
All Institutions: The first two authors contribute equally to this work
The paper presents MELA-TTS, a novel joint transformer-diffusion framework for TTS synthesis that eliminates the need for speech tokenization, achieving state-of-the-art performance while enhancing training efficiency and output coherence. This work significantly advances the field of speech synthesis by addressing key limitations of existing models and providing a compelling alternative to traditional approaches.
The proposed MELA-TTS framework integrates a joint transformer-diffusion model that innovatively addresses the limitations of traditional TTS systems reliant on discrete tokenization. The introduction of a representation alignment module is a significant methodological advancement, as it aligns the outputs of the transformer with semantic embeddings from a pretrained ASR encoder, enhancing both training efficiency and output coherence. The autoregressive generation of continuous mel-spectrograms without tokenization is a notable shift in paradigm, indicating a robust approach to continuous feature modeling.
The experiments are comprehensive, utilizing both small (LibriTTS) and large-scale datasets (170,000 hours) to validate the model's performance across various metrics, including WER and CER. The results demonstrate state-of-the-art performance in multiple scenarios, showcasing the model's effectiveness in both offline and streaming synthesis modes. The ablation studies provide strong evidence for the contributions of the representation alignment and utterance embedding, reinforcing the robustness of the experimental design.
While the paper provides a detailed description of the model architecture and training process, the absence of a publicly available code repository or demo limits reproducibility. The methodology is well-documented, but without access to the implementation, independent validation of the results is challenging.
One limitation noted is the model's performance in voice cloning, particularly in speaker similarity metrics compared to discrete-token-based systems. The authors acknowledge that the diffusion module's reliance on local context may hinder its ability to leverage broader input conditions, suggesting potential areas for future improvement. Additionally, the model's complexity may pose challenges for deployment in real-time applications.
The implications of MELA-TTS extend beyond TTS synthesis, potentially influencing areas such as audio generation, voice cloning, and even applications in music synthesis. By eliminating the need for tokenization and multi-stage processing, the framework could lead to more efficient and natural-sounding speech synthesis systems, enhancing user experiences in various domains. The paper presents MELA-TTS, a novel joint transformer-diffusion framework for TTS synthesis that eliminates the need for speech tokenization, achieving state-of-the-art performance while enhancing training efficiency and output coherence. This work significantly advances the field of speech synthesis by addressing key limitations of existing models and providing a compelling alternative to traditional approaches.
The task of Mel vocoding, i.e., the inversion of a Mel magnitude spectrogram to an audio waveform, is still a key component in many text-to-speech (TTS) systems today. Based on generative flow matching, our prior work on generative STFT phase retrieval (DiffPhase), and the pseudoinverse operator of the Mel filterbank, we develop MelFlow, a streaming-capable generative Mel vocoder for speech sampled at 16 kHz with an algorithmic latency of only 32 ms and a total latency of 48 ms. We show real-time streaming capability at this latency not only in theory, but in practice on a consumer laptop GPU. Furthermore, we show that our model achieves substantially better PESQ and SI-SDR values compared to well-established not streaming-capable baselines for Mel vocoding including HiFi-GAN.
Primary: Technology and Space (BMFTR) under grant agreement No. 01IS24072A (COMFORT)
All Institutions: Technology and Space (BMFTR) under grant agreement No. 01IS24072A (COMFORT), We acknowledge funding by the German Federal Ministry of Research
The main contribution of this paper is the introduction of MelFlow, a streaming generative Mel vocoder that achieves real-time performance with significantly improved audio quality metrics compared to existing methods. This work represents a meaningful advancement in the field of audio processing, particularly for applications requiring low-latency speech synthesis.
The paper introduces MelFlow, a novel streaming-capable generative Mel vocoder that leverages generative flow matching and builds on previous work in diffusion-based STFT phase retrieval. The methodology is well-structured, combining established techniques with new innovations to achieve real-time performance. The authors effectively define algorithmic and total latency, providing a clear framework for their streaming approach. The use of causal convolutional neural networks and an efficient caching mechanism for inference is particularly noteworthy, as it allows for real-time processing without compromising output quality. However, the paper could benefit from a more detailed explanation of the iterative inference scheme and how it compares to traditional methods.
The experimental section is robust, demonstrating the effectiveness of MelFlow against established baselines such as HiFi-GAN. The authors provide comprehensive metrics (PESQ, SI-SDR, etc.) to evaluate performance, showing significant improvements in quality metrics while maintaining real-time capabilities. The use of multiple datasets (EARS-WHAM v2 and LibriTTS) adds credibility to the results. However, the experiments could be strengthened by including more diverse datasets and additional qualitative assessments of audio quality.
The paper mentions plans to provide a public code repository and model checkpoints, which is a positive step towards reproducibility. However, specific implementation details, such as hyperparameter settings and training configurations, could be more explicitly stated to facilitate replication by other researchers. The lack of a demo URL also limits immediate accessibility for interested parties.
One limitation is the potential trade-off between the number of inference steps and the quality of output, as indicated by the results showing differences between N=5 and N=25. Additionally, while the paper claims substantial improvements over non-streaming methods, it does not fully explore the implications of these improvements in practical applications. The focus on a single sampling rate (16 kHz) may also limit the generalizability of the findings.
The development of a real-time streaming Mel vocoder has significant implications for text-to-speech systems and other speech processing applications. By enabling more natural and interactive communication, this research could enhance user experiences in various domains, including virtual assistants, gaming, and telecommunication. The methodology could also inspire further innovations in real-time audio processing and other generative models. The main contribution of this paper is the introduction of MelFlow, a streaming generative Mel vocoder that achieves real-time performance with significantly improved audio quality metrics compared to existing methods. This work represents a meaningful advancement in the field of audio processing, particularly for applications requiring low-latency speech synthesis.
Speech large language models (SLLMs) built on speech encoders, adapters, and LLMs demonstrate remarkable multitask understanding performance in high-resource languages such as English and Chinese. However, their effectiveness substantially degrades in low-resource languages such as Thai. This limitation arises from three factors: (1) existing commonly used speech encoders, like the Whisper family, underperform in low-resource languages and lack support for broader spoken language understanding tasks; (2) the ASR-based alignment paradigm requires training the entire SLLM, leading to high computational cost; (3) paired speech-text data in low-resource languages is scarce. To overcome these challenges in the low-resource language Thai, we introduce XLSR-Thai, the first self-supervised learning (SSL) speech encoder for Thai. It is obtained by continuously training the standard SSL XLSR model on 36,000 hours of Thai speech data. Furthermore, we propose U-Align, a speech-text alignment method that is more resource-efficient and multitask-effective than typical ASR-based alignment. Finally, we present Thai-SUP, a pipeline for generating Thai spoken language understanding data from high-resource languages, yielding the first Thai spoken language understanding dataset of over 1,000 hours. Multiple experiments demonstrate the effectiveness of our methods in building a Thai multitask-understanding SLLM. We open-source XLSR-Thai and Thai-SUP to facilitate future research.
Primary: School of Computer Science
All Institutions: School of Computer Science
The main contribution of this paper is the introduction of XLSR-Thai, U-Align, and the Thai-SUP pipeline, which collectively address the challenges of building effective speech large language models for multitask understanding in low-resource languages. This work significantly advances the field by providing innovative solutions to longstanding issues in speech processing for underrepresented languages.
The paper proposes a comprehensive methodology that includes the development of XLSR-Thai, a self-supervised learning speech encoder specifically for Thai, which is a significant advancement given the scarcity of resources for low-resource languages. The introduction of U-Align as a more efficient speech-text alignment method is innovative, as it circumvents the computational costs associated with traditional ASR-based methods. The Thai-SUP pipeline for generating spoken language understanding data from high-resource languages is a practical solution to the data scarcity problem, showcasing a well-rounded approach to the challenges faced in low-resource language processing.
The experiments conducted are extensive and demonstrate the effectiveness of the proposed methods. The results show that XLSR-Thai outperforms existing models in ASR performance and multitask understanding tasks. The comparative analysis of U-Align against ASR-based alignment methods provides clear evidence of its advantages in terms of both performance and efficiency. The use of multiple metrics (e.g., character error rate, classification accuracy) adds robustness to the evaluation.
The paper mentions that XLSR-Thai and Thai-SUP are open-sourced, which is a positive aspect for reproducibility. However, the details regarding the implementation of U-Align and the specific datasets used could be more thoroughly documented to enhance reproducibility further. The reliance on various external datasets and models also necessitates careful attention to their availability and licensing.
One limitation is the focus on a single low-resource language (Thai), which may not generalize to other languages with different linguistic structures or phonetic characteristics. Additionally, while the proposed methods are resource-efficient, the initial training on large datasets (36,000 hours) may still pose a barrier for some researchers. The paper could also benefit from a more detailed discussion on the potential biases introduced by the data generation process in Thai-SUP.
The research has significant implications for the development of speech technologies in low-resource languages, which are often overlooked in the field of machine learning. By providing tools and datasets for Thai, the work encourages further research and development in similar languages, potentially leading to more inclusive and accessible AI technologies. The findings could also influence policy decisions regarding language preservation and technology deployment in multilingual societies. The main contribution of this paper is the introduction of XLSR-Thai, U-Align, and the Thai-SUP pipeline, which collectively address the challenges of building effective speech large language models for multitask understanding in low-resource languages. This work significantly advances the field by providing innovative solutions to longstanding issues in speech processing for underrepresented languages.
Multimodal acoustic event classification plays a key role in audio-visual systems. Although combining audio and visual signals improves recognition, it is still difficult to align them over time and to reduce the effect of noise across modalities. Existing methods often treat audio and visual streams separately, fusing features later with contrastive or mutual information objectives. Recent advances explore multimodal graph learning, but most fail to distinguish between intra- and inter-modal temporal dependencies. To address this, we propose Temporally Heterogeneous Graph-based Contrastive Learning (THGCL). Our framework constructs a temporal graph for each event, where audio and video segments form nodes and their temporal links form edges. We introduce Gaussian processes for intra-modal smoothness, Hawkes processes for inter-modal decay, and contrastive learning to capture fine-grained relationships. Experiments on AudioSet show that THGCL achieves state-of-the-art performance.
The main contribution of this paper is the development of a novel framework, THGCL, that effectively addresses the challenges of temporal alignment and noise reduction in multimodal acoustic event classification through a sophisticated graph-based approach. This work represents a meaningful advancement in the field of audio-visual machine learning, demonstrating both theoretical innovation and practical applicability.
The proposed Temporally Heterogeneous Graph-based Contrastive Learning (THGCL) framework innovatively constructs a temporal heterogeneous graph to model both intra- and inter-modal dependencies in acoustic event classification. The integration of Gaussian processes and Hawkes processes to manage temporal relationships is a significant methodological advancement. The contrastive learning component effectively enhances the robustness of the model against noise, showcasing a thoughtful design that addresses key challenges in multimodal learning.
The experiments conducted on the AudioSet dataset demonstrate a thorough evaluation of the proposed method against existing state-of-the-art approaches. The use of mean average precision (mAP) and area under the ROC curve (AUC) as evaluation metrics is appropriate, and the results indicate a clear performance advantage of THGCL. However, further comparisons with more diverse datasets could strengthen the validation of the method's generalizability.
The paper provides sufficient implementation details, including the architecture of the Temporal Heterogeneous Graph Network (THGN), hyperparameters, and training procedures, which are essential for reproducibility. The availability of the code repository on GitHub further supports this aspect.
While the proposed method shows promise, it may be limited by its reliance on the quality of the input features from the audio and video modalities. Additionally, the complexity of the model could pose challenges in real-time applications where computational efficiency is critical. The paper could also benefit from a more extensive discussion on potential biases in the dataset used.
The advancements in multimodal acoustic event classification have significant implications for various applications, including surveillance systems, smart environments, and human-computer interaction. By improving the robustness of audio-visual systems, this research could enhance the reliability of automated systems in real-world scenarios. The main contribution of this paper is the development of a novel framework, THGCL, that effectively addresses the challenges of temporal alignment and noise reduction in multimodal acoustic event classification through a sophisticated graph-based approach. This work represents a meaningful advancement in the field of audio-visual machine learning, demonstrating both theoretical innovation and practical applicability.
Current audio captioning systems rely heavily on supervised learning with paired audio-caption datasets, which are expensive to curate and may not reflect human preferences in real-world scenarios. To address this limitation, we propose a preference-aligned audio captioning framework based on Reinforcement Learning from Human Feedback (RLHF). To effectively capture nuanced human preferences, we train a Contrastive Language-Audio Pretraining (CLAP)-based reward model using human-labeled pairwise preference data. This reward model is integrated into a reinforcement learning framework to fine-tune any baseline captioning system without relying on ground-truth caption annotations. Extensive human evaluations across multiple datasets show that our method produces captions preferred over those from baseline models, particularly in cases where the baseline models fail to provide correct and natural captions. Furthermore, our framework achieves performance comparable to supervised approaches with ground-truth data, demonstrating its effectiveness in aligning audio captioning with human preferences and its scalability in real-world scenarios.
The main contribution of this paper is the introduction of a novel RLHF framework for audio captioning that effectively aligns model outputs with human preferences, demonstrating competitive performance without the need for ground-truth captions. This work significantly advances the field of audio captioning by addressing the limitations of existing methods and providing a scalable solution for real-world applications.
The proposed methodology is innovative in its use of Reinforcement Learning from Human Feedback (RLHF) to align audio captions with human preferences without requiring paired audio-caption datasets. The integration of a Contrastive Language-Audio Pretraining (CLAP)-based reward model trained on pairwise human preference data is a significant advancement over traditional supervised learning approaches. The paper effectively addresses the challenges of audio captioning, particularly the ambiguity and temporal complexity inherent in audio data, by focusing on human alignment rather than static similarity metrics. The use of reward shaping techniques to mitigate reward hacking is a thoughtful addition that enhances the robustness of the approach.
The experiments are comprehensive, utilizing both public and proprietary datasets to evaluate the performance of the proposed system. The results demonstrate that the RLHF-based method consistently outperforms baseline models, particularly in challenging scenarios where traditional models fail. The human evaluations provide strong evidence of the method's effectiveness in producing captions that are preferred by annotators, and the comparative analysis with supervised approaches highlights the scalability and cost-effectiveness of the proposed framework. However, the reliance on limited preference data for training the reward model may affect the generalizability of the results.
The paper provides detailed implementation details, including model architecture, training procedures, and hyperparameter settings, which enhances reproducibility. However, the absence of publicly available code or datasets limits the ability for others to fully replicate the experiments. The authors should consider releasing their code and datasets to facilitate further research and validation.
One notable limitation is the reliance on pairwise preference data, which may not capture the full spectrum of human judgment. Additionally, the method's performance may vary significantly depending on the quality and quantity of the preference data used for training the reward model. The authors acknowledge the potential for reward hacking and the challenges associated with human evaluation, which can introduce variability in the results.
The proposed framework has significant potential applications in various domains, including assistive technologies for the hearing impaired, content creation for multimedia platforms, and enhancing user experiences in audio-based applications. By aligning audio captioning systems more closely with human preferences, the research could lead to more intuitive and effective interactions with audio content, ultimately benefiting a wide range of users. The main contribution of this paper is the introduction of a novel RLHF framework for audio captioning that effectively aligns model outputs with human preferences, demonstrating competitive performance without the need for ground-truth captions. This work significantly advances the field of audio captioning by addressing the limitations of existing methods and providing a scalable solution for real-world applications.
Child-centered long-form recordings are essential for studying early language development, but existing speech models trained on clean adult data perform poorly due to acoustic and linguistic differences. We introduce BabyHuBERT, the first self-supervised speech representation model trained on 13,000 hours of multilingual child-centered long-form recordings spanning over 40 languages. We evaluate BabyHuBERT on speaker segmentation, identifying when target children speak versus female adults, male adults, or other children -- a fundamental preprocessing step for analyzing naturalistic language experiences. BabyHuBERT achieves F1-scores from 52.1% to 74.4% across six diverse datasets, consistently outperforming W2V2-LL4300 (trained on English long-forms) and standard HuBERT (trained on clean adult speech). Notable improvements include 13.2 absolute F1 points over HuBERT on Vanuatu and 15.9 points on Solomon Islands corpora, demonstrating effectiveness on underrepresented languages. By sharing code and models, BabyHuBERT serves as a foundation model for child speech research, enabling fine-tuning on diverse downstream tasks.
BabyHuBERT introduces a pioneering self-supervised speech representation model tailored for multilingual child-centered recordings, demonstrating substantial improvements in speaker segmentation tasks. The comprehensive methodology and significant technical contributions position this work as a valuable asset for advancing research in child speech processing and language development.
The methodology proposed in BabyHuBERT is robust and innovative, leveraging a large-scale multilingual dataset specifically tailored for child-centered recordings. The adoption of HuBERT's masked prediction approach is well-justified, considering the inherent noise in child-centered audio. The two-iteration pre-training strategy, utilizing features from different layers of WavLM, demonstrates a thoughtful adaptation of existing models to the unique challenges of the task. The fine-tuning strategy is also comprehensive, employing both frozen feature extraction and full fine-tuning to evaluate the model's performance effectively.
The experiments conducted are thorough, with a clear focus on speaker segmentation as the primary task. The paper presents a well-structured evaluation across multiple datasets, showcasing substantial improvements over existing models like W2V2-LL4300 and standard HuBERT. The reported F1-scores highlight the model's effectiveness, particularly in underrepresented languages, which is a significant contribution to the field. However, the paper could benefit from additional details on the datasets used for evaluation and comparisons against more diverse baselines.
The paper provides adequate implementation details, including training procedures, hyperparameters, and dataset partitioning strategies. However, the absence of a publicly available code repository or demo limits reproducibility. Future work should prioritize sharing code and trained models to facilitate further research and application of BabyHuBERT.
The paper acknowledges limitations related to the computational expense of self-supervised pre-training, which restricts the exploration of hyperparameter configurations. Additionally, the performance on underrepresented classes suggests that further improvements are necessary. The reliance on human annotators for topline comparisons introduces variability that could affect the interpretation of results.
BabyHuBERT has the potential to significantly advance research in child language acquisition by providing a robust tool for analyzing naturalistic language experiences. Its multilingual focus addresses a critical gap in existing speech models, promoting inclusivity in language development research. The model's ability to improve speaker segmentation in diverse acoustic environments can facilitate better understanding of child interactions and language learning processes. BabyHuBERT introduces a pioneering self-supervised speech representation model tailored for multilingual child-centered recordings, demonstrating substantial improvements in speaker segmentation tasks. The comprehensive methodology and significant technical contributions position this work as a valuable asset for advancing research in child speech processing and language development.
In this paper, we present state-of-the-art diarization error rates (DERs) on multiple publicly available datasets, including AliMeeting-far, AliMeeting-near, AMI-Mix, AMI-SDM, DIHARD III, and MagicData RAMC. Leveraging EEND-TA, a single unified non-autoregressive model for end-to-end speaker diarization, we achieve new benchmark results, most notably a DER of 14.49% on DIHARD III. Our approach scales pretraining through 8-speaker simulation mixtures, ensuring each generated speaker mixture configuration is sufficiently represented. These experiments highlight that EEND-based architectures possess a greater capacity for learning than previously explored, surpassing many existing diarization solutions while maintaining efficient speeds during inference.
The paper makes a substantial contribution to the field of speaker diarization by presenting a state-of-the-art model that effectively leverages large-scale pre-training and demonstrates competitive performance across multiple datasets. The methodology is robust, though the lack of reproducibility resources and some limitations in performance on specific datasets suggest areas for future improvement.
The paper presents a novel approach to speaker diarization using the EEND-TA model, which is a unified non-autoregressive architecture. The methodology is well-structured, leveraging a combination of Conformer encoders and Transformer decoders, and introduces a significant innovation in scaling pre-training with simulated mixtures of up to 8 speakers. This addresses the challenge of limited annotated datasets in diarization tasks. The authors provide a clear explanation of their model architecture and the rationale behind their design choices, which enhances the understanding of their contributions.
The experiments are comprehensive, covering multiple publicly available datasets and demonstrating state-of-the-art performance in terms of Diarization Error Rates (DER). The authors effectively compare their results against existing methods, showcasing improvements across various configurations. The use of a large-scale pre-training dataset (over 80,000 hours) is particularly noteworthy, as it demonstrates the model's capacity to learn effectively from diverse speaker configurations. However, the paper could benefit from more detailed discussions on the experimental setup and the specific conditions under which the results were obtained.
While the paper includes sufficient details regarding the model architecture and training procedures, it lacks explicit links to code repositories or supplementary materials that would facilitate reproduction of the results. The absence of a demo or project URL is a significant limitation for reproducibility, as other researchers may find it challenging to replicate the experiments without access to the code or datasets used.
The paper acknowledges that the model does not outperform existing state-of-the-art results on all datasets, particularly AISHELL-4, CALLHOME, and VoxConverse. This limitation highlights the need for further refinement and tuning of the model for specific datasets. Additionally, the reliance on simulated mixtures may not fully capture the complexities of real-world audio recordings, which could affect generalization.
The findings of this research have significant implications for real-time applications in speech processing, such as automated transcription services, video conferencing, and customer service systems. By improving the efficiency and accuracy of speaker diarization, this work can enhance user experiences in various audio-based applications. The emphasis on end-to-end models also aligns with trends in machine learning towards more integrated and efficient solutions. The paper makes a substantial contribution to the field of speaker diarization by presenting a state-of-the-art model that effectively leverages large-scale pre-training and demonstrates competitive performance across multiple datasets. The methodology is robust, though the lack of reproducibility resources and some limitations in performance on specific datasets suggest areas for future improvement.
In this paper, we show that discrete optimal transport (DOT) is an effective black-box adversarial attack against modern audio anti-spoofing countermeasures (CMs). Our attack operates as a post-processing, distribution-alignment step: frame-level WavLM embeddings of generated speech are aligned to an unpaired bona fide pool via entropic OT and a top-$k$ barycentric projection, then decoded with a neural vocoder. Evaluated on ASVspoof2019 and ASVspoof5 with AASIST baselines, DOT yields consistently high equal error rate (EER) across datasets and remains competitive after CM fine-tuning, outperforming several conventional attacks in cross-dataset transfer. Ablation analysis highlights the practical impact of vocoder overlap. Results indicate that distribution-level alignment is a powerful and stable attack surface for deployed CMs.
Primary: University of Rochester
All Institutions: and J. Prakasan, University of Rochester, Submitted to ICASSP 2026. A demonstration webpage TBA
The main contribution of this paper is the introduction of discrete optimal transport as a powerful method for generating adversarial audio attacks against anti-spoofing systems, demonstrating both theoretical and practical advancements in the field. The comprehensive analysis of the methodology and results highlights the significance of distribution-level alignment in enhancing the effectiveness of audio adversarial attacks.
The paper introduces a novel approach using discrete optimal transport (DOT) as a black-box adversarial attack against audio anti-spoofing countermeasures, which is a significant advancement in the field of audio security. The methodology is well-structured, detailing the process of aligning frame-level WavLM embeddings to a bona fide pool using entropic OT and a top-$k$ barycentric projection. The use of a neural vocoder for waveform reconstruction is appropriate and adds to the realism of the generated audio. The authors provide a clear theoretical foundation for their approach, although the paper could benefit from a more detailed explanation of the entropic regularization and its implications.
The experiments are robust, utilizing established datasets (ASVspoof2019 and ASVspoof5) and employing AASIST baselines for evaluation. The results demonstrate that the DOT attack consistently achieves high equal error rates (EER), indicating its effectiveness across different datasets and after countermeasure fine-tuning. The ablation analysis regarding vocoder overlap is particularly insightful, showcasing the practical implications of the attack. However, the paper lacks a comprehensive comparison with more recent adversarial attack methodologies, which could further contextualize its contributions.
While the paper outlines the experimental setup and methodologies, it does not provide sufficient details for full reproducibility. Key parameters, such as the specific configurations for the neural vocoder and the exact implementation of the DOT algorithm, are not fully disclosed. Additionally, the absence of a code repository or demonstration URL limits the ability for other researchers to replicate the findings.
The primary limitation of the study is its reliance on specific datasets, which may not generalize to all audio environments or countermeasures. The effectiveness of the DOT attack may vary with different types of audio data or countermeasures not covered in the experiments. Furthermore, the paper does not address potential defenses against the proposed attack, which is critical for understanding its practical implications.
The findings of this research have significant implications for the field of audio security, particularly in enhancing the robustness of anti-spoofing systems. The methodology could be applied to improve the security of voice recognition systems in various applications, including banking, personal assistants, and security systems. However, the potential for misuse in creating more sophisticated adversarial attacks raises ethical considerations that need to be addressed. The main contribution of this paper is the introduction of discrete optimal transport as a powerful method for generating adversarial audio attacks against anti-spoofing systems, demonstrating both theoretical and practical advancements in the field. The comprehensive analysis of the methodology and results highlights the significance of distribution-level alignment in enhancing the effectiveness of audio adversarial attacks.
Obstructive sleep apnoea (OSA) is a prevalent condition with significant health consequences, yet many patients remain undiagnosed due to the complexity and cost of over-night polysomnography. Acoustic-based screening provides a scalable alternative, yet performance is limited by environmental noise and the lack of physiological context. Respiratory effort is a key signal used in clinical scoring of OSA events, but current approaches require additional contact sensors that reduce scalability and patient comfort. This paper presents the first study to estimate respiratory effort directly from nocturnal audio, enabling physiological context to be recovered from sound alone. We propose a latent-space fusion framework that integrates the estimated effort embeddings with acoustic features for OSA detection. Using a dataset of 157 nights from 103 participants recorded in home environments, our respiratory effort estimator achieves a concordance correlation coefficient of 0.48, capturing meaningful respiratory dynamics. Fusing effort and audio improves sensitivity and AUC over audio-only baselines, especially at low apnoea-hypopnoea index thresholds. The proposed approach requires only smartphone audio at test time, which enables sensor-free, scalable, and longitudinal OSA monitoring.
Primary: University of Sheffield
All Institutions: School of Computer Science, GOSH BRC grant 187217, University of Sheffield, and Ning Ma, The authors would like to thank Passion For Life Healthcare (UK) Ltd for providing sleep data. The study was partially funded by MRC IAA grant 182731, and Innovate UK Open Grant 26767, Passion for Life Healthcare
This paper presents a novel approach to estimating respiratory effort from nocturnal audio, significantly advancing the field of acoustic-based sleep apnoea detection. The integration of respiratory dynamics into OSA screening offers a promising direction for non-invasive monitoring, though challenges remain in reproducibility and model performance in noisy environments.
The proposed methodology introduces a latent-space fusion framework that innovatively combines respiratory effort embeddings inferred from nocturnal audio with acoustic features for obstructive sleep apnoea (OSA) detection. This approach is commendable as it addresses the limitations of existing methods that rely on additional sensors, thus enhancing scalability and patient comfort. The use of a CNN-LSTM architecture to extract features from audio signals is appropriate, and the decision to use the concordance correlation coefficient (CCC) as the optimization objective is well-justified, given its sensitivity to both correlation and bias. However, the methodology could benefit from more detailed explanations of the model training process and hyperparameter tuning.
The experiments are robust, utilizing a dataset of 157 nights from 103 participants, which is a significant sample size for a study of this nature. The results demonstrate that the respiratory effort estimator achieves a CCC of 0.478, indicating a meaningful relationship between audio and respiratory dynamics. The performance metrics for OSA severity classification show that the proposed model outperforms audio-only baselines, particularly at lower AHI thresholds, which is clinically relevant. However, the paper could improve by providing more comprehensive comparisons with existing state-of-the-art methods and discussing the implications of the performance metrics in a clinical context.
The paper lacks sufficient details regarding the implementation and code availability, which are crucial for reproducibility. While it describes the model architecture and training procedures, it does not provide information on the specific datasets used for training and validation splits, nor does it mention whether the code or trained models will be made publicly available. This limits the ability of other researchers to replicate the study.
Several limitations are noted, including the challenges posed by environmental noise and the variability of smartphone recordings, which can impact the accuracy of the respiratory effort predictions. Additionally, the temporal misalignment between audio and respiratory signals may lead to lower correlation values. The CCC of 0.478, while indicative of some predictive capability, suggests that the model may still struggle with certain segments of audio. The paper also does not address the potential for overfitting given the relatively small dataset size compared to the complexity of the model.
The implications of this research are significant, as it presents a non-invasive, scalable method for OSA screening that could improve early detection and management of the condition. The ability to monitor respiratory effort using only smartphone audio could lead to widespread adoption in home settings, reducing the burden on healthcare systems and improving patient outcomes. Future work could explore further enhancements to the model and its application in diverse populations. This paper presents a novel approach to estimating respiratory effort from nocturnal audio, significantly advancing the field of acoustic-based sleep apnoea detection. The integration of respiratory dynamics into OSA screening offers a promising direction for non-invasive monitoring, though challenges remain in reproducibility and model performance in noisy environments.
The majority of mainstream neural vocoders primarily focus on speech quality and generation speed, while overlooking latency, which is a critical factor in real-time applications. Excessive latency leads to noticeable delays in user interaction, severely degrading the user experience and rendering such systems impractical for real-time use. Therefore, this paper proposes DLL-APNet, a Distilled Low-Latency neural vocoder which first predicts the Amplitude and Phase spectra explicitly from input mel spectrogram and then reconstructs the speech waveform via inverse short-time Fourier transform (iSTFT). The DLL-APNet vocoder leverages causal convolutions to constrain the utilization of information to current and historical contexts, effectively minimizing latency. To mitigate speech quality degradation caused by causal constraints, a knowledge distillation strategy is proposed, where a pre-trained non-causal teacher vocoder guides intermediate feature generation of the causal student DLL-APNet vocoder. Experimental results demonstrate that the proposed DLL-APNet vocoder produces higher-quality speech than other causal vocoders, while requiring fewer computational resources. Furthermore, the proposed DLL-APNet vocoder achieves speech quality on par with mainstream non-causal neural vocoders, validating its ability to deliver both high perceptual quality and low latency.
Primary: University of Science and Technology of China
All Institutions: and Zhen-Hua Ling, National Engineering Research Center of Speech and Language Information, University of Science and Technology of China
The paper presents DLL-APNet, a novel low-latency neural vocoder that effectively balances speech quality and latency through innovative use of causal convolutions and knowledge distillation. The methodology and experimental results indicate a meaningful contribution to the field of speech synthesis, particularly for applications requiring real-time performance.
The proposed methodology of DLL-APNet is well-structured, leveraging causal convolutions to minimize latency while employing knowledge distillation to enhance speech quality. The explicit prediction of amplitude and phase spectra from mel spectrograms is a significant improvement over traditional vocoders that often neglect latency. The integration of a pre-trained non-causal model as a teacher for the student model is innovative and effectively addresses the trade-off between latency and speech quality. The use of causal convolutions is appropriate for real-time applications, and the paper provides a clear explanation of how these convolutions operate to maintain causality.
The experimental setup is robust, utilizing the VCTK dataset, which is a standard benchmark for speech synthesis tasks. The authors compare their model against several state-of-the-art vocoders, both causal and non-causal, providing a comprehensive analysis of performance metrics. The results demonstrate that DLL-APNet outperforms other causal vocoders while maintaining quality comparable to non-causal models. The use of multiple objective metrics (SNR, RMSE, MCD, etc.) adds credibility to their findings. However, the paper could benefit from more qualitative evaluations, such as user studies or perceptual tests, to complement the objective metrics.
The paper includes sufficient implementation details, such as hyperparameter settings, model architecture, and training procedures, which facilitate reproducibility. The authors also mention the use of a demo page for generated speech samples, enhancing transparency. However, the lack of a publicly available code repository limits the ease of reproduction for other researchers.
One limitation is the reliance on a pre-trained non-causal model, which may not be readily available to all researchers. Additionally, while the paper addresses latency, it does not explore the potential trade-offs in terms of model size and complexity, which could impact deployment in resource-constrained environments. The paper also lacks a discussion on the generalizability of the model to different languages or accents, which could be an important consideration for real-world applications.
The proposed DLL-APNet vocoder has significant implications for real-time speech applications, such as telecommunication, virtual assistants, and interactive voice response systems. By addressing the critical issue of latency while maintaining high speech quality, this work contributes to the advancement of practical speech synthesis technologies. The findings could influence future research directions in low-latency vocoding and real-time audio processing. The paper presents DLL-APNet, a novel low-latency neural vocoder that effectively balances speech quality and latency through innovative use of causal convolutions and knowledge distillation. The methodology and experimental results indicate a meaningful contribution to the field of speech synthesis, particularly for applications requiring real-time performance.
Singing Accompaniment Generation (SAG) is the process of generating instrumental music for a given clean vocal input. However, existing SAG techniques use source-separated vocals as input and overfit to separation artifacts. This creates a critical train-test mismatch, leading to failure on clean, real-world vocal inputs. We introduce AnyAccomp, a framework that resolves this by decoupling accompaniment generation from source-dependent artifacts. AnyAccomp first employs a quantized melodic bottleneck, using a chromagram and a VQ-VAE to extract a discrete and timbre-invariant representation of the core melody. A subsequent flow-matching model then generates the accompaniment conditioned on these robust codes. Experiments show AnyAccomp achieves competitive performance on separated-vocal benchmarks while significantly outperforming baselines on generalization test sets of clean studio vocals and, notably, solo instrumental tracks. This demonstrates a qualitative leap in generalization, enabling robust accompaniment for instruments - a task where existing models completely fail - and paving the way for more versatile music co-creation tools. Demo audio and code: https://anyaccomp.github.io
Primary: The Chinese University of Hong Kong
All Institutions: The Chinese University of Hong Kong
The main contribution of this paper is the introduction of AnyAccomp, a framework that effectively resolves the train-test mismatch in singing accompaniment generation by utilizing a quantized melodic bottleneck to enhance generalization capabilities. This work represents a significant advancement in the field of audio machine learning, addressing critical challenges and paving the way for more versatile music co-creation tools.
The paper introduces a novel two-stage framework, AnyAccomp, which effectively decouples accompaniment generation from source-dependent artifacts through a quantized melodic bottleneck using VQ-VAE and a flow-matching model. This approach is innovative as it addresses the critical train-test mismatch prevalent in existing SAG models, which typically overfit to artifacts from source-separated vocals. The use of a chromagram for timbre-invariant representation is a significant methodological advancement, allowing the model to focus on the core melody rather than irrelevant acoustic details. The combination of a robust representation and a flow-matching transformer for accompaniment generation is well-conceived and demonstrates a clear understanding of the challenges in the field.
The experiments are comprehensive, utilizing a substantial dataset of 8k hours of paired singing voice and accompaniment data. The evaluation metrics are well-defined, including both objective measures (FAD, APA) and subjective assessments (MOS tests), which provide a holistic view of the model's performance. The results convincingly demonstrate that AnyAccomp outperforms existing models, particularly in generalization to clean vocals and instrumental tracks, validating the effectiveness of the proposed methodology. However, the paper could benefit from more detailed comparisons with additional baseline models to further substantiate its claims.
The implementation details are clear, including model architectures, training parameters, and data preparation processes, which facilitate reproducibility. The authors provide sufficient information about the training setup, including the number of parameters and optimization strategies. However, the absence of a publicly available code repository limits the ease of reproduction for other researchers, which is a significant consideration in machine learning research.
While the proposed method shows promising results, the paper does not address potential limitations in terms of computational efficiency and scalability. The reliance on a large dataset for training may also pose challenges for users with limited resources. Additionally, the model's performance on highly diverse or unconventional vocal inputs has not been thoroughly tested, which could affect its applicability in real-world scenarios.
The framework has the potential to significantly enhance music creation tools, enabling artists and producers to generate high-quality instrumental accompaniments from vocal inputs. This could democratize music production, making it more accessible to amateurs and non-professionals. Furthermore, the advancements in generalization could lead to more robust AI systems in creative fields, fostering innovation in music technology and related domains. The main contribution of this paper is the introduction of AnyAccomp, a framework that effectively resolves the train-test mismatch in singing accompaniment generation by utilizing a quantized melodic bottleneck to enhance generalization capabilities. This work represents a significant advancement in the field of audio machine learning, addressing critical challenges and paving the way for more versatile music co-creation tools.
Stuttered and dysfluent speech detection systems have traditionally suffered from the trade-off between accuracy and clinical interpretability. While end-to-end deep learning models achieve high performance, their black-box nature limits clinical adoption. This paper looks at the Unconstrained Dysfluency Modeling (UDM) series-the current state-of-the-art framework developed by Berkeley that combines modular architecture, explicit phoneme alignment, and interpretable outputs for real-world clinical deployment. Through extensive experiments involving patients and certified speech-language pathologists (SLPs), we demonstrate that UDM achieves state-of-the-art performance (F1: 0.89+-0.04) while providing clinically meaningful interpretability scores (4.2/5.0). Our deployment study shows 87% clinician acceptance rate and 34% reduction in diagnostic time. The results provide strong evidence that UDM represents a practical pathway toward AI-assisted speech therapy in clinical environments.
Primary: Eric ZhangCorresponding author
All Institutions: Eric ZhangCorresponding author, SSHealth Team, AI for Healthcare Laboratory, Corresponding author
This paper presents a comprehensive evaluation of the UDM framework, demonstrating its effectiveness in clinical dysfluency detection while addressing the critical need for interpretability in AI applications within healthcare. The methodology and results contribute meaningfully to the field of machine learning in audio processing, particularly in enhancing the clinical utility of dysfluency detection systems.
The paper introduces the Unconstrained Dysfluency Modeling (UDM) framework, which is a modular and interpretable architecture designed to address the limitations of traditional dysfluency detection systems. The methodology is well-structured, incorporating multi-scale feature extraction, phoneme alignment, and explicit classification of dysfluency types. The explicit phoneme alignment module is a significant innovation that enhances interpretability, allowing clinicians to understand the model's decisions. However, the paper could benefit from a more detailed explanation of the equations and algorithms used, as well as the specific training and validation processes.
The experiments are robust, utilizing a large dataset from a clinical setting, which adds to the real-world applicability of the findings. The paper compares UDM against several baseline models, demonstrating superior performance across various metrics, including F1-score and interpretability scores. The inclusion of clinician feedback and acceptance rates provides valuable insights into the practical implications of the model. However, the results could be strengthened by including more diverse datasets and additional clinical settings to validate the model's generalizability.
The paper lacks explicit details regarding the implementation of the UDM framework, such as code availability or links to a repository. This absence limits the reproducibility of the results. Providing access to the model and datasets would enhance the credibility and usability of the research.
The paper acknowledges several limitations, including challenges with silent blocks, the current focus on Mandarin Chinese speakers, and the need for further validation in longitudinal studies. Additionally, the model's performance on different dysfluency types varies, indicating areas for improvement.
The UDM framework has significant potential for improving clinical practices in speech therapy by providing interpretable and accurate dysfluency detection. Its modular design could facilitate broader applications in other areas of healthcare where interpretability is crucial. The findings may also influence future research directions in AI-assisted speech therapy, particularly in under-resourced clinical environments. This paper presents a comprehensive evaluation of the UDM framework, demonstrating its effectiveness in clinical dysfluency detection while addressing the critical need for interpretability in AI applications within healthcare. The methodology and results contribute meaningfully to the field of machine learning in audio processing, particularly in enhancing the clinical utility of dysfluency detection systems.
We use the term re-identification to refer to the process of recovering the original speaker's identity from anonymized speech outputs. Speaker de-identification systems aim to reduce the risk of re-identification, but most evaluations focus only on individual-level measures and overlook broader risks from soft biometric leakage. We introduce the Soft Biometric Leakage Score (SBLS), a unified method that quantifies resistance to zero-shot inference attacks on non-unique traits such as channel type, age range, dialect, sex of the speaker, or speaking style. SBLS integrates three elements: direct attribute inference using pre-trained classifiers, linkage detection via mutual information analysis, and subgroup robustness across intersecting attributes. Applying SBLS with publicly available classifiers, we show that all five evaluated de-identification systems exhibit significant vulnerabilities. Our results indicate that adversaries using only pre-trained models - without access to original speech or system details - can still reliably recover soft biometric information from anonymized output, exposing fundamental weaknesses that standard distributional metrics fail to capture.
Primary: National Institute of Standards and Technology
All Institutions: This research is based upon work supported by the Intelligence Advanced Research Projects Activity (IARPA), National Institute of Standards and Technology, emphAnonymous Real-Time Speech (ARTS), when any subgroup has perfect leakage
The paper makes a significant contribution by introducing a novel metric for assessing soft biometric leakage in speaker de-identification systems, revealing vulnerabilities that traditional metrics overlook. The comprehensive methodology and experimental evaluation underscore the importance of addressing privacy concerns in speech processing, although further work is needed to enhance reproducibility and generalizability.
The paper introduces the Soft Biometric Leakage Score (SBLS), a novel metric that integrates three components: zero-shot attribute inference, systematic linkage detection, and subgroup robustness. This comprehensive approach addresses a significant gap in the evaluation of speaker de-identification systems by focusing on soft biometric leakage rather than traditional metrics. The methodology is well-structured and employs established statistical techniques, such as mutual information analysis, to quantify vulnerabilities effectively. However, the choice of heuristic weights in the SBLS calculation could benefit from further justification or exploration of alternative weighting strategies.
The experimental setup is robust, utilizing a diverse dataset (Mixer 3 corpus) with rich demographic annotations to evaluate five different speaker de-identification systems. The results demonstrate significant vulnerabilities in all evaluated systems, highlighting the effectiveness of the SBLS in revealing soft biometric leakage. The paper provides clear performance metrics, including AUC values and subgroup-specific leakage scores, which enhance the interpretability of the findings. However, the reliance on a single dataset may limit the generalizability of the results.
While the paper outlines the methodology and experimental setup in detail, it lacks specific implementation details or code availability, which could hinder reproducibility. The absence of a public repository for the SBLS implementation or the evaluated systems means that other researchers cannot easily replicate the experiments or build upon the findings.
The primary limitation is the focus on a single dataset, which may not capture the full spectrum of speaker characteristics and de-identification challenges. Additionally, the paper does not address potential variations in performance across different languages or dialects, which could affect the applicability of the findings. The heuristic nature of the SBLS component weights also raises questions about the robustness of the results.
This research has significant implications for privacy and security in speech processing applications, particularly in contexts where speaker anonymity is crucial. By quantifying soft biometric leakage, the findings can inform the development of more robust speaker de-identification systems and contribute to the broader discourse on privacy-preserving technologies in machine learning. The introduction of SBLS could lead to improved standards for evaluating de-identification systems, ultimately enhancing user trust and safety in voice-based applications. The paper makes a significant contribution by introducing a novel metric for assessing soft biometric leakage in speaker de-identification systems, revealing vulnerabilities that traditional metrics overlook. The comprehensive methodology and experimental evaluation underscore the importance of addressing privacy concerns in speech processing, although further work is needed to enhance reproducibility and generalizability.
This paper proposes APSS, a novel neural speech separation model with parallel amplitude and phase spectrum estimation. Unlike most existing speech separation methods, the APSS distinguishes itself by explicitly estimating the phase spectrum for more complete and accurate separation. Specifically, APSS first extracts the amplitude and phase spectra from the mixed speech signal. Subsequently, the extracted amplitude and phase spectra are fused by a feature combiner into joint representations, which are then further processed by a deep processor with time-frequency Transformers to capture temporal and spectral dependencies. Finally, leveraging parallel amplitude and phase separators, the APSS estimates the respective spectra for each speaker from the resulting features, which are then combined via inverse short-time Fourier transform (iSTFT) to reconstruct the separated speech signals. Experimental results indicate that APSS surpasses both time-domain separation methods and implicit-phase-estimation-based time-frequency approaches. Also, APSS achieves stable and competitive results on multiple datasets, highlighting its strong generalization capability and practical applicability.
Primary: University of Science and Technology of China
All Institutions: University of Science and Technology of China, National Engineering Research Center of Speech and Language Information Processing
The main contribution of this paper is the introduction of the APSS model, which explicitly estimates both amplitude and phase spectra for improved speech separation. This innovative approach, combined with rigorous experimental validation, positions the work as a significant advancement in the audio processing domain, addressing a critical challenge in speech separation tasks.
The proposed APSS model introduces a novel approach to speech separation by explicitly modeling both amplitude and phase spectra, which is a significant advancement over existing methods that often neglect phase information. The architecture is well-structured, utilizing a feature combiner and deep processors with time-frequency Transformers, which effectively captures the temporal and spectral dependencies. The parallel amplitude and phase separators are a clever design choice that allows for independent estimation while still leveraging the correlation between amplitude and phase. This dual modeling is a notable methodological contribution, addressing a critical gap in the field.
The experimental setup is robust, utilizing well-known datasets (WSJ0-2Mix and Libri2Mix) to validate the model's performance. The results demonstrate that APSS outperforms various baseline models, including both time-domain and implicit-phase estimation methods. The use of ablation studies to assess the contributions of different components of the model adds rigor to the evaluation, providing clear evidence of the importance of each part of the architecture. However, the paper could benefit from more detailed comparisons with additional state-of-the-art methods to further contextualize its contributions.
The paper provides sufficient details regarding the model architecture, training criteria, and experimental setup, which should allow for reproducibility. However, the absence of a publicly available code repository or demo URL limits the ease with which others can replicate the results. Including such resources would enhance the paper's impact and facilitate further research.
While the model shows strong performance, it is primarily focused on monaural two-speaker separation, which may limit its applicability in more complex scenarios involving multiple speakers or varying acoustic conditions. Additionally, the reliance on specific datasets for validation raises questions about generalization to real-world applications where conditions may differ significantly from those in the training data.
The advancements presented in this paper have the potential to significantly improve speech separation technologies, which are crucial for applications in automatic speech recognition, hearing aids, and communication systems in noisy environments. By effectively addressing the cocktail party problem, the APSS model could enhance user experience in various audio processing applications, making it a valuable contribution to the field. The main contribution of this paper is the introduction of the APSS model, which explicitly estimates both amplitude and phase spectra for improved speech separation. This innovative approach, combined with rigorous experimental validation, positions the work as a significant advancement in the audio processing domain, addressing a critical challenge in speech separation tasks.
This paper introduces a multi-stage self-directed framework designed to address the spatial semantic segmentation of sound scene (S5) task in the DCASE 2025 Task 4 challenge. This framework integrates models focused on three distinct tasks: Universal Sound Separation (USS), Single-label Classification (SC), and Target Sound Extraction (TSE). Initially, USS breaks down a complex audio mixture into separate source waveforms. Each of these separated waveforms is then processed by a SC block, generating two critical pieces of information: the waveform itself and its corresponding class label. These serve as inputs for the TSE stage, which isolates the source that matches this information. Since these inputs are produced within the system, the extraction target is identified autonomously, removing the necessity for external guidance. The extracted waveform can be looped back into the classification task, creating a cycle of iterative refinement that progressively enhances both separability and labeling accuracy. We thus call our framework a multi-stage self-guided system due to these self-contained characteristics. On the official evaluation dataset, the proposed system achieves an 11.00 dB increase in class-aware signal-to-distortion ratio improvement (CA-SDRi) and a 55.8\% accuracy in label prediction, outperforming the ResUNetK baseline by 4.4 dB and 4.3\%, respectively, and achieving first place among all submissions.
Primary: Fictional University
All Institutions: School of Electrical Engineering, 8765 Dream Blvd, University Imagination, Important Laboratory, Fictional University, Meta Reality Labs, 2133 Long Road
This paper presents a novel self-guided multi-stage framework for sound scene analysis that significantly improves audio source separation and classification. The technical contributions are substantial, with a well-defined methodology and promising experimental results, although the lack of reproducibility resources and broader comparative analyses could be areas for improvement.
The paper proposes a multi-stage self-guided framework that integrates Universal Sound Separation (USS), Single-label Classification (SC), and Target Sound Extraction (TSE) in a novel way. The architecture is well-structured and leverages iterative refinement to enhance both separation and classification accuracy. The use of a modified DeFT-Mamba model for USS and TSE is innovative, as it allows for the simultaneous processing of audio and class labels, which is a significant improvement over traditional methods that rely on external cues. The methodology is robust, with clear delineation of stages and the rationale for each component's design.
The experimental results demonstrate a significant improvement in both CA-SDRi and classification accuracy compared to the baseline. Achieving first place in the DCASE 2025 Task 4 challenge indicates a strong validation of the proposed framework. The paper provides comprehensive details on the training setup, data augmentation strategies, and evaluation metrics, which are crucial for assessing the performance of the models. However, the absence of comparisons with a wider range of existing methods could limit the contextual understanding of the results.
The paper includes detailed descriptions of the model architectures, loss functions, and training procedures, which are essential for reproducibility. However, the lack of publicly available code or datasets limits the ability of other researchers to replicate the findings fully. Providing a GitHub repository or similar resource would greatly enhance reproducibility.
One limitation is the reliance on a specific dataset (DCASE 2025 Task 4), which may not generalize to other audio separation tasks or real-world applications. Additionally, while the iterative refinement process is beneficial, it may introduce computational overhead, making the framework less practical for real-time applications. The paper does not address potential issues related to model complexity and inference time.
The proposed framework has significant implications for audio processing applications, particularly in environments where sound source separation and classification are critical, such as in robotics, surveillance, and assistive technologies. By improving the accuracy of sound event detection, this research could enhance user experiences in various audio-related fields, including augmented reality and smart home devices. This paper presents a novel self-guided multi-stage framework for sound scene analysis that significantly improves audio source separation and classification. The technical contributions are substantial, with a well-defined methodology and promising experimental results, although the lack of reproducibility resources and broader comparative analyses could be areas for improvement.
While generative Text-to-Speech (TTS) systems leverage vast ``in-the-wild" data to achieve remarkable success, speech-to-speech processing tasks like enhancement face data limitations, which lead data-hungry generative approaches to distort speech content and speaker identity. To bridge this gap, we present SpeechOp, a multi-task latent diffusion model that transforms pre-trained TTS models into a universal speech processor capable of performing a wide range of speech tasks and composing them in novel ways at inference time. By adapting a pre-trained TTS model, SpeechOp inherits a rich understanding of natural speech, accelerating training and improving S2S task quality, while simultaneously enhancing core TTS performance. Finally, we introduce Implicit Task Composition (ITC), a novel pipeline where ASR-derived transcripts (e.g., from Whisper) guide SpeechOp's enhancement via our principled inference-time task composition. ITC achieves state-of-the-art content preservation by robustly combining web-scale speech understanding with SpeechOp's generative capabilities. Audio samples are available at https://justinlovelace.github.io/projects/speechop
Primary: Work done during an internship at Adobe
All Institutions: Work done during an internship at Adobe
The main contribution of this paper is the introduction of SpeechOp, a novel framework for generative speech processing that enables inference-time task composition, significantly improving the quality and versatility of speech-to-speech processing tasks. The technical contributions, particularly in task composition and leveraging pre-trained models, represent a meaningful advancement in the field of audio machine learning.
The methodology presented in SpeechOp is innovative, leveraging a multi-task latent diffusion model that repurposes pre-trained TTS models for a variety of speech processing tasks. The introduction of Implicit Task Composition (ITC) is particularly noteworthy, as it allows for dynamic task composition at inference time, which is a significant advancement in the field. The integration of ASR-derived transcripts to guide the generative process adds a layer of sophistication that enhances the model's ability to preserve content and speaker identity, addressing a critical challenge in speech-to-speech processing.
The experimental setup is robust, with a clear focus on evaluating the performance of SpeechOp across multiple speech tasks. The paper provides comparative results against state-of-the-art methods, demonstrating significant improvements in content preservation and task quality. However, the specifics of the datasets used and the metrics for evaluation could be elaborated further to strengthen the findings.
The paper mentions audio samples and provides a demo URL, which is a positive aspect for reproducibility. However, there is a lack of detailed information regarding the implementation, such as code availability or specific hyperparameters used in training, which could hinder full reproducibility by other researchers.
One limitation is the reliance on pre-trained TTS models, which may introduce biases inherent in those models. Additionally, while the paper claims state-of-the-art performance, it would benefit from a more extensive discussion on the generalizability of the approach across diverse speech datasets and languages.
The implications of SpeechOp are significant, as it offers a versatile framework for various speech processing tasks, potentially transforming applications in accessibility, voice synthesis, and real-time communication. The ability to compose tasks at inference time could lead to more adaptive and intelligent speech systems, enhancing user experience in numerous domains. The main contribution of this paper is the introduction of SpeechOp, a novel framework for generative speech processing that enables inference-time task composition, significantly improving the quality and versatility of speech-to-speech processing tasks. The technical contributions, particularly in task composition and leveraging pre-trained models, represent a meaningful advancement in the field of audio machine learning.
High-fidelity binaural audio synthesis is crucial for immersive listening, but existing methods require extensive computational resources, limiting their edge-device application. To address this, we propose the Lightweight Implicit Neural Network (LINN), a novel two-stage framework. LINN first generates initial estimates using a time-domain warping, which is then refined by an Implicit Binaural Corrector (IBC) module. IBC is an implicit neural network that predicts amplitude and phase corrections directly, resulting in a highly compact model architecture. Experimental results show that LINN achieves statistically comparable perceptual quality to the best-performing baseline model while significantly improving computational efficiency. Compared to the most efficient existing method, LINN achieves a 72.7% reduction in parameters and significantly fewer compute operations (MACs). This demonstrates that our approach effectively addresses the trade-off between synthesis quality and computational efficiency, providing a new solution for high-fidelity edge-device spatial audio applications.
Primary: East China Normal University
All Institutions: This work is funded by the Interdisciplinary Programs on AI-Enabled, Shanghai Institute of Artificial Intelligence for Education, East China Normal University, School of Computer Science and Technology
The main contribution of this paper is the introduction of LINN, a lightweight framework for binaural audio synthesis that effectively balances high perceptual quality with computational efficiency, making it suitable for edge-device applications. This work represents a meaningful step forward in the field of audio synthesis, particularly in the context of deep learning and implicit neural representations.
The proposed Lightweight Implicit Neural Network (LINN) introduces a two-stage framework that effectively combines a Time-Domain Warping (TDW) module with an Implicit Binaural Corrector (IBC). The IBC's innovative approach of modeling spectral corrections as a continuous function is a significant advancement in binaural audio synthesis. The use of implicit neural representations allows for a compact model architecture, which is particularly beneficial for edge-device applications. The methodology is well-structured, with clear explanations of the architecture, loss functions, and positional encoding strategies, making it a robust contribution to the field.
The paper presents a thorough experimental evaluation using the Binaural Speech dataset, comparing LINN against several state-of-the-art models. The results indicate that while LINN does not outperform all baselines in every metric, it achieves competitive performance with significantly lower computational requirements. The combination of quantitative metrics and perceptual evaluations (MOS tests) provides a comprehensive assessment of the model's effectiveness. The statistical analysis further strengthens the findings, showcasing LINN's ability to maintain quality while reducing complexity.
The implementation details are well-documented, including the architecture specifications, training procedures, and evaluation metrics. The authors provide a GitHub repository for source code and audio samples, which enhances the reproducibility of the results. However, the paper could benefit from more extensive documentation on the dataset preprocessing and specific hyperparameter choices to facilitate easier replication of the experiments.
One limitation is that LINN does not achieve the highest performance in all objective metrics when compared to some baseline models, which may raise questions about its applicability in scenarios where absolute performance is critical. Additionally, the reliance on a specific dataset may limit the generalizability of the results to other audio synthesis tasks or datasets.
The development of LINN has significant implications for the deployment of binaural audio synthesis in resource-constrained environments, such as mobile devices and IoT applications. By addressing the trade-off between computational efficiency and audio quality, this work opens avenues for more widespread use of spatial audio technologies in virtual reality, gaming, and immersive media experiences. The main contribution of this paper is the introduction of LINN, a lightweight framework for binaural audio synthesis that effectively balances high perceptual quality with computational efficiency, making it suitable for edge-device applications. This work represents a meaningful step forward in the field of audio synthesis, particularly in the context of deep learning and implicit neural representations.
Unsupervised anomalous sound detection aims to detect unknown anomalous sounds by training a model using only normal audio data. Despite advancements in self-supervised methods, the issue of frequent false alarms when handling samples of the same type from different machines remains unresolved. This paper introduces a novel training technique called one-stage supervised contrastive learning (OS-SCL), which significantly addresses this problem by perturbing features in the embedding space and employing a one-stage noisy supervised contrastive learning approach. On the DCASE 2020 Challenge Task 2, it achieved 94.64\% AUC, 88.42\% pAUC, and 89.24\% mAUC using only Log-Mel features. Additionally, a time-frequency feature named TFgram is proposed, which is extracted from raw audio. This feature effectively captures critical information for anomalous sound detection, ultimately achieving 95.71\% AUC, 90.23\% pAUC, and 91.23\% mAUC. The source code is available at: \underline{www.github.com/huangswt/OS-SCL}.
Primary: and in part by the State Grid Xinjiang Electric Power Company and Xinjiang Siji Information Technology Co
All Institutions: This work was supported by the National Natural Science Foundation of China No. 62366051, and in part by the State Grid Xinjiang Electric Power Company and Xinjiang Siji Information Technology Co, This work was supported in part by the National Natural Science Foundation of China under Grant 62366051, Ltd. under Grant SGITXX00ZHXX2200262
The paper presents a novel approach to anomalous sound detection that combines one-stage supervised contrastive learning with feature perturbation techniques, achieving state-of-the-art performance while challenging traditional beliefs about feature importance in audio analysis. The methodology is innovative, and the results have significant implications for real-world applications in industrial settings.
The proposed methodology introduces a novel one-stage supervised contrastive learning (OS-SCL) approach that effectively reduces false alarms in anomalous sound detection by perturbing features in the embedding space. The integration of a feature perturbation head (FPH) and the use of MixUp for data augmentation are innovative strategies that enhance the model's ability to learn decision boundaries. The introduction of TFgram as a time-frequency feature extraction method is also a significant contribution, as it challenges the conventional belief regarding the necessity of high-frequency components in anomaly detection. The methodology is well-structured, with clear explanations of the components and their roles in the overall framework.
The experiments are robust, utilizing the DCASE 2020 Challenge Task 2 dataset, which is a well-regarded benchmark in the field. The reported results demonstrate strong performance metrics (AUC, pAUC, mAUC) that surpass existing methods, indicating the effectiveness of the proposed approach. The paper includes a thorough comparison with state-of-the-art methods, and the ablation studies provide valuable insights into the contributions of each component of the proposed framework. However, more detailed statistical analyses and comparisons with additional datasets could further strengthen the findings.
The paper provides sufficient implementation details, including model architecture, training parameters, and the dataset used. The availability of the source code on GitHub enhances reproducibility, allowing other researchers to validate the results. However, the paper could benefit from a more comprehensive description of the experimental setup and hyperparameter tuning processes to facilitate easier reproduction of the results.
One limitation of the study is the reliance on a specific dataset (DCASE 2020 Challenge Task 2), which may limit the generalizability of the results to other domains or types of anomalous sounds. Additionally, while the OS-SCL method shows promise, the impact of label noise introduced during training could be further explored, as it may lead to unintended consequences in certain scenarios.
The proposed method has significant implications for industrial applications where anomalous sound detection is critical for maintenance and operational efficiency. By reducing false alarms and improving detection stability, the approach can enhance the reliability of automated monitoring systems in various industries, potentially leading to cost savings and improved safety. The findings challenge existing paradigms regarding feature reliance in audio processing, paving the way for further research in this area. The paper presents a novel approach to anomalous sound detection that combines one-stage supervised contrastive learning with feature perturbation techniques, achieving state-of-the-art performance while challenging traditional beliefs about feature importance in audio analysis. The methodology is innovative, and the results have significant implications for real-world applications in industrial settings.
Self-supervised learning (SSL) has pushed speaker verification accuracy close to state-of-the-art levels, but the Transformer backbones used in most SSL encoders hinder on-device and real-time deployment. Prior compression work trims layer depth or width yet still inherits the quadratic cost of self-attention. We propose SV-Mixer, the first fully MLP-based student encoder for SSL distillation. SV-Mixer replaces Transformer with three lightweight modules: Multi-Scale Mixing for multi-resolution temporal features, Local-Global Mixing for frame-to-utterance context, and Group Channel Mixing for spectral subspaces. Distilled from WavLM, SV-Mixer outperforms a Transformer student by 14.6% while cutting parameters and GMACs by over half, and at 75% compression, it closely matches the teacher's performance. Our results show that attention-free SSL students can deliver teacher-level accuracy with hardware-friendly footprints, opening the door to robust on-device speaker verification.
Primary: This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government. (MSIT) (2023R1A2C1005744)
All Institutions: This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government. (MSIT) (2023R1A2C1005744), Corresponding author
This paper presents a significant advancement in the field of speaker verification by introducing SV-Mixer, a lightweight, attention-free encoder that achieves competitive performance while being suitable for deployment in resource-constrained environments. The innovative approach and thorough experimental validation position this work as a valuable contribution to the ongoing evolution of self-supervised learning in audio applications.
The paper introduces SV-Mixer, a novel architecture that replaces Transformer encoders with a fully MLP-based design tailored for self-supervised learning in speaker verification. The methodology is well-structured, incorporating three specialized mixing modules (Multi-Scale Mixing, Local-Global Mixing, and Group Channel Mixing) that enhance temporal and spectral feature extraction while reducing computational complexity. The design is justified through a clear rationale for moving away from self-attention mechanisms, which are computationally expensive. The paper effectively demonstrates how these modules work together to maintain accuracy under aggressive model compression, showcasing a thoughtful approach to architectural design in the context of SSL.
The experiments are comprehensive, utilizing the VoxCeleb2 dataset for training and evaluating on multiple test sets, including VoxCeleb1 and VoxSRC 2023. The results consistently show that SV-Mixer outperforms both Transformer-based models and other MLP architectures, providing strong empirical evidence for its effectiveness. The ablation studies further substantiate the contributions of each mixing module, and the robustness of SV-Mixer under varying compression levels is particularly noteworthy. However, the paper could benefit from additional comparisons with more diverse architectures and a broader range of datasets.
The authors have made their code, pretrained models, and inference scripts publicly available on GitHub, which is a positive aspect for reproducibility. The implementation details are clearly outlined, including training parameters and data augmentation strategies. However, the paper could improve by providing more detailed instructions on how to replicate the experiments, such as specific configurations for the training environment.
The paper acknowledges limitations, such as the fixed training setup and reliance on a single distillation strategy, which may overlook more effective combinations of training objectives. Additionally, while the results are promising, the generalizability of the findings to other domains or tasks beyond speaker verification remains to be explored.
The proposed SV-Mixer architecture has significant implications for on-device speaker verification, particularly in resource-constrained environments. By reducing the computational burden associated with traditional Transformer architectures, this work could facilitate the deployment of advanced speaker verification systems in mobile and embedded applications, enhancing accessibility and usability in real-world scenarios. The findings may also inspire further research into attention-free architectures across various domains in machine learning. This paper presents a significant advancement in the field of speaker verification by introducing SV-Mixer, a lightweight, attention-free encoder that achieves competitive performance while being suitable for deployment in resource-constrained environments. The innovative approach and thorough experimental validation position this work as a valuable contribution to the ongoing evolution of self-supervised learning in audio applications.
Foundation models such as Wav2Vec2 excel at representation learning in speech tasks, including audio deepfake detection. However, after being fine-tuned on a fixed set of bonafide and spoofed audio clips, they often fail to generalize to novel deepfake methods not represented in training. To address this, we propose a mixture-of-LoRA-experts approach that integrates multiple low-rank adapters (LoRA) into the model's attention layers. A routing mechanism selectively activates specialized experts, enhancing adaptability to evolving deepfake attacks. Experimental results show that our method outperforms standard fine-tuning in both in-domain and out-of-domain scenarios, reducing equal error rates relative to baseline models. Notably, our best MoE-LoRA model lowers the average out-of-domain EER from 8.55\% to 6.08\%, demonstrating its effectiveness in achieving generalizable audio deepfake detection.
The main contribution of this paper is the introduction of a mixture-of-LoRA-experts framework that significantly improves the generalization capabilities of audio deepfake detection systems. This work represents a meaningful advancement in the field, combining innovative methodologies with a rigorous experimental evaluation, although it would benefit from improved reproducibility and a deeper exploration of limitations.
The proposed methodology introduces a mixture-of-LoRA-experts (MoE-LoRA) framework that enhances the adaptability of audio deepfake detection models by integrating multiple low-rank adapters into the attention layers of the Wav2Vec2 model. This approach is innovative in that it combines the benefits of parameter-efficient fine-tuning with the dynamic selection of specialized experts, allowing for improved generalization to unseen deepfake attacks. The routing mechanism for expert selection is well-conceived, promoting flexibility in model behavior depending on input characteristics. However, the paper could benefit from a more detailed explanation of the routing mechanism and its implications on computational efficiency.
The experimental setup is robust, utilizing multiple datasets that cover a range of spoofing techniques, which is crucial for evaluating the generalizability of the proposed method. The results demonstrate a clear improvement in equal error rates (EER) compared to baseline models, particularly in out-of-domain scenarios, which is a significant contribution to the field. The ablation studies conducted further strengthen the findings by highlighting the contributions of individual components of the MoE-LoRA framework. However, the paper lacks a discussion on the statistical significance of the results, which would enhance the credibility of the claims made.
The paper provides a thorough description of the experimental setup, including dataset details, training protocols, and evaluation metrics. However, the absence of a publicly available code repository or demo limits reproducibility. Future work should consider releasing the code to facilitate validation and further exploration by the research community.
One limitation is the reliance on the specific architecture of Wav2Vec2 and AASIST, which may not generalize well to other model architectures or domains outside audio deepfake detection. Additionally, while the MoE-LoRA approach shows promise, the complexity of the model may introduce challenges in deployment and real-time applications. The paper also does not address potential biases in the training datasets, which could affect the model's performance in real-world scenarios.
The implications of this research are significant, as audio deepfake detection is increasingly relevant in various domains, including security, media integrity, and misinformation prevention. The proposed method could enhance the robustness of voice authentication systems and contribute to the development of more reliable detection tools in the face of evolving deepfake technologies. The adaptability of the MoE-LoRA framework may also inspire similar approaches in other domains of machine learning where generalization to unseen data is critical. The main contribution of this paper is the introduction of a mixture-of-LoRA-experts framework that significantly improves the generalization capabilities of audio deepfake detection systems. This work represents a meaningful advancement in the field, combining innovative methodologies with a rigorous experimental evaluation, although it would benefit from improved reproducibility and a deeper exploration of limitations.
Audio codecs are a critical component of modern speech generation systems. This paper introduces a low-bitrate, multi-scale residual codec that encodes speech into four distinct streams: semantic, timbre, prosody, and residual. This architecture achieves high-fidelity speech reconstruction at competitive low bitrates while demonstrating an inherent ability for information disentanglement. We construct a two-stage language model for text-to-speech (TTS) synthesis using this codec, which, despite its lightweight design and minimal data requirements, achieves a state-of-the-art Word Error Rate (WER) and superior speaker similarity compared to several larger models. Furthermore, the codec's design proves highly effective for voice conversion, enabling independent manipulation of speaker timbre and prosody.
Primary: The Hong Kong University of Science and Technology
All Institutions: Hong Kong SAR, The Hong Kong University of Science and Technology, utterances are randomly sampled from the dataset of VCTK. These utterances come from different speakers (4 male/ 4 female)
This paper presents a novel low-bitrate multi-stream residual codec that effectively disentangles speech attributes for high-fidelity speech generation. The technical contributions are significant, with a well-structured methodology and robust experimental validation, positioning it as a valuable advancement in the field of audio processing and speech synthesis.
The proposed method introduces a multi-stream residual codec that effectively disentangles speech into semantic, timbre, prosody, and residual streams. This architecture is innovative in its approach to information disentanglement, achieving high compression rates while maintaining speech quality. The use of pre-trained models for feature extraction and the cascaded architecture for stream fusion are well-justified and contribute to the overall efficiency of the codec. The methodology is clearly articulated, with a logical flow from speech encoding to reconstruction, and the integration of auxiliary losses enhances the model's ability to capture prosodic features.
The experiments are comprehensive, utilizing diverse datasets that ensure robustness in speaker identity and prosody representation. The results demonstrate the codec's competitive performance against existing models, achieving state-of-the-art WER and speaker similarity metrics. The inclusion of various evaluation metrics, such as STOI, PESQ, and WER, provides a well-rounded assessment of the model's capabilities. However, the paper could benefit from more extensive comparisons with a wider range of existing codecs to further validate its claims.
The implementation details are described in sufficient depth, including the architecture of the codec and TTS model, as well as the training objectives and evaluation metrics. However, the lack of URLs for code repositories or demo pages limits the reproducibility of the work. Providing access to the code and trained models would significantly enhance the ability of other researchers to replicate the results.
One limitation is the reliance on pre-trained models, which may not generalize well across different languages or dialects. Additionally, the evaluation is primarily focused on English datasets, potentially limiting the applicability of the findings to other languages. The paper also does not address the scalability of the model to larger datasets or more complex speech scenarios.
The proposed codec has significant implications for applications in speech synthesis, voice conversion, and other audio processing tasks. Its ability to disentangle speech attributes could lead to advancements in personalized TTS systems and more efficient audio streaming technologies. The lightweight design and low bitrate requirements make it particularly relevant for mobile and real-time applications, potentially broadening access to high-quality speech generation technologies. This paper presents a novel low-bitrate multi-stream residual codec that effectively disentangles speech attributes for high-fidelity speech generation. The technical contributions are significant, with a well-structured methodology and robust experimental validation, positioning it as a valuable advancement in the field of audio processing and speech synthesis.
Speech therapy plays a critical role in training speech disorders caused by neurological impairments such as stroke. However, traditional manual and computer-assisted systems are limited in real-time accessibility and articulatory motion feedback, constraining their practical utility. Recent advances in multimodal large language models (MLLMs) have demonstrated significant potential in healthcare, particularly through their ability to integrate multimodal data for adaptive assessment and therapeutic feedback. Nevertheless, challenges including insufficient acquisition and fusion of articulatory information, inadequate parsing of articulatory organ motion trajectories, and the scarcity of high-quality domain-specific datasets hinder the application of MLLMs in speech therapy. To address these limitations, we propose an MLLM-based speech rehabilitation assistance system that synergistically leverages ultrasound tongue imaging and speech signals to deliver precise, interactive articulatory feedback. We construct a high-quality domain-specific dataset comprising UTI-speech dialogue pairs. This dataset facilitates fine-tuning to enhance the model's clinical adaptability. Building on this dataset, our methods achieves spatiotemporal fusion training strategy of ultrasound videos and speech signals, enabling fine-grained articulatory impairment analysis and ultimately generating actionable feedback.
Primary: The Eighth Affiliated Hospital of Sun Yat-sen University
All Institutions: Chinese Academy of Sciences, The Eighth Affiliated Hospital of Sun Yat-sen University, Key Laboratory of Biomedical Imaging Science and System, Department of Rehabilitation Medicine, Shenzhen Institute of Advanced Technology, Shaofeng zhao, *Corresponding authors
The main contribution of this paper is the development of a multimodal large language model-based system for personalized speech therapy that integrates ultrasound imaging and speech signals, demonstrating significant potential to enhance the effectiveness of speech rehabilitation. The innovative methodology and promising experimental results position this work as a significant advancement in the intersection of machine learning and healthcare.
The proposed methodology is innovative, leveraging a multimodal large language model (MLLM) that integrates ultrasound tongue imaging (UTI) with speech signals to provide personalized feedback for speech rehabilitation. The authors construct a high-quality dataset of UTI-speech dialogue pairs, which is critical for fine-tuning the model. The dual-agent collaborative QA generation framework is a notable contribution, as it enhances the generation of dialogue data for therapy applications. The spatiotemporal fusion training strategy is well-conceived, allowing for a nuanced understanding of articulatory dynamics. However, the paper could benefit from a more detailed explanation of the model architecture and the specific algorithms used for data processing and feature extraction.
The experiments are comprehensive, utilizing a well-defined dataset and clear evaluation metrics, including BLEU, METEOR, and ROUGE-L for natural language generation, as well as accuracy and F1-score for dysarthria assessment. The results demonstrate significant improvements over baseline models, indicating the effectiveness of the proposed approach. The ablation studies provide valuable insights into the contributions of different modalities, confirming the importance of integrating UTI data. However, the paper lacks a comparative analysis with more recent state-of-the-art models in the same domain, which could strengthen the claims of superiority.
The implementation details are adequately described, including the training configuration and dataset characteristics. However, the absence of a publicly available code repository or demo limits reproducibility. Providing access to the dataset and model would enhance the ability of other researchers to validate and build upon this work.
One limitation is the reliance on a relatively small dataset, which may affect the generalizability of the model. Additionally, the paper does not address potential biases in the dataset or the implications of using a single model architecture. The authors could also explore the scalability of their approach in real-world clinical settings.
This research has significant implications for the field of speech therapy, particularly in enhancing accessibility and personalization of treatment for individuals with speech disorders. By integrating advanced machine learning techniques with clinical practice, the proposed system could improve patient outcomes and reduce the burden on healthcare professionals. The work may inspire further research into multimodal approaches in other areas of rehabilitation and therapy. The main contribution of this paper is the development of a multimodal large language model-based system for personalized speech therapy that integrates ultrasound imaging and speech signals, demonstrating significant potential to enhance the effectiveness of speech rehabilitation. The innovative methodology and promising experimental results position this work as a significant advancement in the intersection of machine learning and healthcare.
Dysarthric speech severity classification is crucial for objective clinical assessment and progress monitoring in individuals with motor speech disorders. Although prior methods have addressed this task, achieving robust generalization in speaker-independent (SID) scenarios remains challenging. This work introduces DSSCNet, a novel deep neural architecture that combines Convolutional, Squeeze-Excitation (SE), and Residual network, helping it extract discriminative representations of dysarthric speech from mel spectrograms. The addition of SE block selectively focuses on the important features of the dysarthric speech, thereby minimizing loss and enhancing overall model performance. We also propose a cross-corpus fine-tuning framework for severity classification, adapted from detection-based transfer learning approaches. DSSCNet is evaluated on two benchmark dysarthric speech corpora: TORGO and UA-Speech under speaker-independent evaluation protocols: One-Speaker-Per-Severity (OSPS) and Leave-One-Speaker-Out (LOSO) protocols. DSSCNet achieves accuracies of 56.84% and 62.62% under OSPS and 63.47% and 64.18% under LOSO setting on TORGO and UA-Speech respectively outperforming existing state-of-the-art methods. Upon fine-tuning, the performance improves substantially, with DSSCNet achieving up to 75.80% accuracy on TORGO and 68.25% on UA-Speech in OSPS, and up to 77.76% and 79.44%, respectively, in LOSO. These results demonstrate the effectiveness and generalizability of DSSCNet for fine-grained severity classification across diverse dysarthric speech datasets.
Primary: Arnab Kumar Roy is with Sikkim Manipal Institute of Technology (SMIT)
All Institutions: Manuscript received March 19, India - 737136 (e-mail: arnab, Arnab Kumar Roy is with Sikkim Manipal Institute of Technology (SMIT), and Paban Sapkota are with National Institute of Technology Sikkim, and Paban Sapkota, 2025; revised X X
The main contribution of this paper is the development of DSSCNet, a novel deep learning architecture that significantly improves dysarthric speech severity classification through innovative use of transfer learning and advanced neural network components. This work represents a meaningful step forward in the field of speech processing, addressing critical challenges in speaker-independent scenarios and enhancing the potential for real-world applications.
The proposed DSSCNet architecture effectively integrates convolutional layers, squeeze-excitation blocks, and residual connections to enhance the classification of dysarthric speech severity. The methodology is well-structured, leveraging deep learning principles to address the challenges of speaker-independent classification. The incorporation of a cross-corpus fine-tuning strategy is particularly noteworthy, as it allows the model to generalize better across different datasets, which is a significant advancement over traditional methods that often struggle with speaker variability.
The experimental setup is robust, utilizing two benchmark datasets (TORGO and UA-Speech) and employing rigorous evaluation protocols (OSPS and LOSO) to assess model performance. The results demonstrate a clear improvement over baseline models and existing state-of-the-art methods, particularly after fine-tuning, which showcases the effectiveness of the proposed architecture. However, the paper could benefit from more extensive ablation studies to further validate the contributions of individual components.
The paper provides a detailed description of the methodology, including data preprocessing, model architecture, and training procedures, which aids in reproducibility. However, the absence of a publicly available code repository or demo URL limits the ability for other researchers to replicate the results directly.
While the model shows promising results, it still faces challenges related to class imbalance, particularly for medium severity levels. The reliance on two specific datasets may also limit the generalizability of the findings to broader dysarthric speech scenarios. Additionally, the paper does not address potential overfitting issues, which could arise from the model's complexity.
The implications of this research are significant for clinical applications, particularly in developing assistive technologies for individuals with dysarthria. By improving the accuracy of severity classification, the proposed model can enhance the effectiveness of therapy plans and assistive communication devices, ultimately contributing to better quality of life for affected individuals. The approach also lays the groundwork for future research in speaker-independent speech processing tasks. The main contribution of this paper is the development of DSSCNet, a novel deep learning architecture that significantly improves dysarthric speech severity classification through innovative use of transfer learning and advanced neural network components. This work represents a meaningful step forward in the field of speech processing, addressing critical challenges in speaker-independent scenarios and enhancing the potential for real-world applications.
End-to-end multi-talker automatic speech recognition (MTASR) faces significant challenges in accurately transcribing overlapping speech, especially under high-overlap conditions. To address these challenges, we proposed Global-Local Aware Dynamic (GLAD) Mixture-of-Experts, which dynamically fuse speaker-aware global information and fine-grained local features to guide expert selection. This mechanism enables speaker-specific routing by leveraging both global context and local acoustic cues. Experiments on LibriSpeechMix show that GLAD outperforms existing MTASR approaches, particularly in challenging multi-talker scenarios. To our best knowledge, this is the first work to apply Mixture-of-Experts (MoE) to end-to-end MTASR with a global-local fusion strategy. Our code and train dataset can be found at https://github.com/NKU-HLT/GLAD.
Primary: Nankai University
All Institutions: Nankai University, Corresponding author, College of Computer Science
The paper presents GLAD, a novel framework for multi-talker ASR that significantly enhances transcription accuracy in overlapping speech scenarios. The comprehensive methodology and rigorous experimental evaluation underscore its potential impact on the field of speech processing and related applications.
The paper proposes a novel framework called GLAD, which utilizes a Mixture-of-Experts (MoE) approach to improve multi-talker automatic speech recognition (MTASR). The methodology is well-structured, introducing a global-local aware dynamic routing mechanism that effectively combines global context with local acoustic features. This dual approach allows for more precise expert selection, particularly in high-overlap scenarios where speaker identities and content are entangled. The integration of LoRA-based experts enhances the scalability and efficiency of the model, making it suitable for real-world applications. The design choices are justified, and the paper provides a clear explanation of how the proposed framework addresses existing limitations in MTASR.
The experiments conducted on the LibriSpeechMix dataset are comprehensive, comparing GLAD-SOT against various baseline models. The results demonstrate significant improvements in performance, especially in challenging multi-talker scenarios. The use of both Permutation-Invariant WER and Overlap-Aware WER as evaluation metrics provides a nuanced understanding of the model's capabilities. The ablation studies effectively highlight the contributions of different components of the GLAD architecture, reinforcing the importance of the proposed global-local fusion strategy.
The authors provide a link to their GitHub repository, which includes the code and training dataset, enhancing the reproducibility of their work. The detailed descriptions of the model architecture, training settings, and evaluation metrics further support the ability of other researchers to replicate the study. However, the paper could benefit from additional details regarding hyperparameter tuning and specific configurations used during training.
While the proposed GLAD framework shows promising results, it may still face challenges in extremely noisy environments or with highly variable speaker characteristics that were not extensively tested. Additionally, the reliance on the LibriSpeechMix dataset may limit the generalizability of the findings to other real-world datasets with different characteristics. The paper does not address potential computational costs associated with the dynamic routing mechanism, which could be a concern for deployment in resource-constrained environments.
The advancements in MTASR presented in this paper have significant implications for various applications, including meeting transcription, voice assistants, and multi-party dialogue systems. By improving the accuracy of speech recognition in overlapping scenarios, this research could enhance communication technologies and accessibility tools, benefiting diverse user groups. The dynamic routing approach could also inspire further research in other domains where multi-modal data processing is required. The paper presents GLAD, a novel framework for multi-talker ASR that significantly enhances transcription accuracy in overlapping speech scenarios. The comprehensive methodology and rigorous experimental evaluation underscore its potential impact on the field of speech processing and related applications.
Recently, Large Audio Language Models (LALMs) have progressed rapidly, demonstrating their strong efficacy in universal audio understanding through cross-modal integration. To evaluate LALMs' audio understanding performance, researchers have proposed different benchmarks. However, key aspects for real-world interactions are underexplored in existing benchmarks, i.e., audio signals typically contain both speech and non-speech components, and energy levels of these components can vary significantly across different scenarios. Moreover, most benchmarks do not consider the joint understanding of speech, scene, and events within the same audio clip. In this work, we introduce SSEU-Bench, the first versatile audio understanding benchmark that explicitly accounts for energy differences between speech and non-speech audio, with both independent and joint understanding settings for speech, scene, and events. Furthermore, we demonstrate that some LALMs tend to underperform on certain tasks in a joint understanding setting. To address this issue, we introduce Chain-of-Thought, which effectively improves LALMs' joint audio understanding performance by decomposing complex tasks into simpler reasoning steps.
Primary: School of Electrical Engineering
All Institutions: Republic of Korea, School of Electrical Engineering, and Jung-Woo Choi
The paper presents SSEU-Bench, a novel benchmark for evaluating Large Audio Language Models (LALMs) in joint audio understanding tasks, significantly advancing the field of audio processing by addressing key gaps in existing benchmarks. The innovative methodology and comprehensive evaluation highlight the potential of LALMs while revealing challenges that remain in achieving robust audio understanding.
The paper introduces SSEU-Bench, a novel benchmark that addresses the joint understanding of speech, scene, and events in audio signals, which is a significant advancement over existing benchmarks that typically treat these components separately. The methodology effectively incorporates energy differences between speech and non-speech audio, which is crucial for realistic audio understanding tasks. The introduction of the Chain-of-Thought (CoT) approach for improving joint understanding by decomposing tasks into simpler steps is innovative and adds depth to the methodology.
The experiments are well-structured, evaluating multiple Large Audio Language Models (LALMs) across various tasks (ASR, ASC, and AT) under different signal-to-noise ratios (SNRs). The results demonstrate the performance of LALMs in independent and joint understanding settings, providing a comprehensive view of their capabilities. However, the paper could benefit from more extensive comparisons with state-of-the-art models beyond CLAP-based approaches to contextualize the findings further.
The authors have committed to releasing all data and code, which is a positive step towards reproducibility. However, the paper lacks detailed descriptions of the experimental setups, including hyperparameters and specific configurations used for the LALMs, which could hinder full reproducibility.
One limitation is the reliance on a limited number of LALMs for evaluation, which may not fully represent the landscape of audio understanding models. Additionally, the performance degradation observed in joint understanding tasks raises questions about the robustness of LALMs in complex scenarios. The paper also does not address potential biases in the datasets used for training and evaluation.
The proposed benchmark and methodologies have significant implications for real-world applications, such as human-machine interaction, automatic transcription services, and environmental sound recognition. By improving the understanding of audio signals in a joint context, this work paves the way for more sophisticated audio processing systems that can better mimic human auditory perception. The paper presents SSEU-Bench, a novel benchmark for evaluating Large Audio Language Models (LALMs) in joint audio understanding tasks, significantly advancing the field of audio processing by addressing key gaps in existing benchmarks. The innovative methodology and comprehensive evaluation highlight the potential of LALMs while revealing challenges that remain in achieving robust audio understanding.
Target Speaker Extraction (TSE) is a critical challenge in cocktail party scenarios. While leveraging multiple modalities, such as voice, lip, face, and expression embeddings, can enhance performance, real-world applications often suffer from intermittent modality dropout. This paper presents a comprehensive study on the interactions and robustness of various multimodal fusion strategies under varying degrees of modality dropout. We build upon a state-of-the-art audio-visual speech enhancement system and integrate four distinct speaker identity cues: lip embeddings for synchronized contextual information, a voice speaker embedding extracted via cross-attention for acoustic consistency, a static face embedding for speaker identity, and a novel dynamic expression embedding for frame-wise emotional features. We systematically evaluate different combinations of these modalities under two key training regimes: zero dropout and 80% modality dropout. Extensive experiments demonstrate that while a full multimodal ensemble achieves optimal performance under ideal (zero dropout) conditions, its effectiveness diminishes significantly when test-time dropout occurs without prior exposure during training. Crucially, we show that training with a high (80%) modality dropout rate dramatically enhances model robustness, enabling the system to maintain superior performance even under severe test-time missing modalities. Our findings highlight that voice embeddings exhibit consistent robustness, while the proposed expression embedding provides valuable complementary information. This work underscores the importance of training strategies that account for real-world imperfection, moving beyond pure performance maximization to achieve practical reliability in multimodal speech enhancement systems.
Primary: Duke Kunshan University
All Institutions: Duke Kunshan University, Manuscript received April 19, Wuhan University, This paper was produced by the IEEE Publication Technology Group. They are in Piscataway, 2021; revised August 16, *Corresponding author
The main contribution of this paper is the introduction of a robust multimodal target speaker extraction system that integrates various speaker identity cues and demonstrates the importance of training with modality dropout to enhance real-world applicability. The comprehensive analysis of the interactions between modalities and their robustness under dropout conditions provides valuable insights for future research in audio-visual speech enhancement.
The methodology presented in this paper is robust and well-structured, focusing on the integration of multiple modalities (lip, voice, face, and expression embeddings) for target speaker extraction. The authors employ a state-of-the-art audio-visual speech enhancement system and introduce a novel dynamic expression embedding that adds significant value to the model. The systematic evaluation of different combinations of modalities under varying dropout conditions is a strong point, demonstrating a thorough understanding of the challenges in real-world applications. The use of cross-attention mechanisms for voice embeddings is particularly noteworthy, as it enhances the contextual relationship between the enrollment and mixed speech.
The experiments are comprehensive, utilizing a well-defined dataset from the 3rd COG-MHEAR Audio-Visual Speech Enhancement Challenge (AVSEC-3). The paper presents clear results that illustrate the performance of different multimodal configurations under both ideal and challenging conditions. The findings indicate that while a full multimodal ensemble performs best under zero dropout conditions, the robustness of the model significantly improves when trained with high dropout rates. This highlights the practical implications of the research, as it addresses real-world scenarios where modality dropout is common.
The implementation details are adequately described, including the architecture of the baseline system, the training process, and the evaluation metrics used. However, the absence of a publicly available code repository or demo limits reproducibility. Providing access to the model and datasets would enhance the ability of other researchers to validate and build upon this work.
One limitation is the reliance on a specific dataset, which may not fully represent the diversity of real-world audio-visual scenarios. Additionally, while the paper discusses the robustness of the model under dropout conditions, it does not explore the impact of other potential noise factors or environmental variables that could affect performance. The sensitivity of the expression embedding to modality availability could also be a concern in practical applications.
The findings of this research have significant implications for the development of robust audio-visual systems in various applications, including telecommunications, assistive technologies, and interactive systems. By emphasizing the importance of training strategies that account for real-world imperfections, this work contributes to the advancement of more reliable multimodal systems that can operate effectively in unpredictable environments. The main contribution of this paper is the introduction of a robust multimodal target speaker extraction system that integrates various speaker identity cues and demonstrates the importance of training with modality dropout to enhance real-world applicability. The comprehensive analysis of the interactions between modalities and their robustness under dropout conditions provides valuable insights for future research in audio-visual speech enhancement.
Voice activity detection (VAD) is essential in speech-based systems, but traditional methods detect only speech presence without identifying speakers. Target-speaker VAD (TS-VAD) extends this by detecting the speech of a known speaker using a short enrollment utterance, but this assumption fails in open-domain scenarios such as meetings or customer service calls, where the main speaker is unknown. We propose EEND-SAA, an enrollment-less, streaming-compatible framework for main-speaker VAD, which identifies the primary speaker without prior knowledge. Unlike TS-VAD, our method determines the main speaker as the one who talks more steadily and clearly, based on speech continuity and volume. We build our model on EEND using two self-attention attractors in a Transformer and apply causal masking for real-time use. Experiments on multi-speaker LibriSpeech mixtures show that EEND-SAA reduces main-speaker DER from 6.63% to 3.61% and improves F1 from 0.9667 to 0.9818 over the SA-EEND baseline, achieving state-of-the-art performance under conditions involving speaker overlap and noise.
Primary: National Yang Ming Chiao Tung University
All Institutions: Institute of Electrical and Computer Engineering, Taiwan under Grant NSTC 113-2221-E-A49-146, This research is supported by the National Science and Technology Council, National Yang Ming Chiao Tung University
The paper presents a novel framework for enrollment-less main speaker voice activity detection using self-attention attractors, significantly advancing the state of the art in multi-speaker scenarios. The technical contributions, particularly the dual self-attention mechanism and real-time processing capabilities, position this work as a meaningful addition to the field of audio processing and speech technology.
The methodology presented in EEND-SAA is innovative in its approach to voice activity detection by eliminating the need for speaker enrollment, which is a significant limitation in traditional systems. The use of self-attention attractors within a Transformer framework is a novel contribution that enhances the model's ability to distinguish the main speaker from background noise and overlapping speech. The dual self-attention mechanism is particularly noteworthy, as it allows the model to focus on both the main speaker and background speakers simultaneously, improving detection accuracy. The incorporation of causal masking for real-time processing further enhances the practicality of the model in interactive environments. Overall, the methodology is sound and builds effectively on existing work in the field, particularly EEND and its variants.
The experimental evaluation is robust, utilizing the LibriSpeech dataset to simulate real-world conditions with overlapping speakers and noise. The authors provide comprehensive results that demonstrate significant improvements in main-speaker detection metrics, such as DER and F1 scores, compared to baseline models. The ablation studies are particularly useful in highlighting the contributions of various components, such as positional encoding and the dual attractor design. However, the paper could benefit from more extensive comparisons with additional state-of-the-art methods to further validate the performance claims.
The paper provides a detailed description of the model architecture, training procedures, and evaluation metrics, which aids in reproducibility. However, the absence of a publicly accessible code repository or demo limits the ability for other researchers to easily replicate the results. Including such resources would significantly enhance the reproducibility of the findings.
One limitation of the proposed approach is its reliance on the quality of the input audio. In extremely noisy environments or with significant overlap from multiple speakers, the model's performance may degrade. Additionally, while the model shows promise in real-time applications, the computational efficiency and latency in practical deployments have not been extensively discussed. The paper also does not address potential biases in the training data that could affect the model's generalizability across different speaker demographics.
The implications of this research are significant for various applications in speech recognition, customer service, and interactive voice systems, where identifying the main speaker in noisy environments is crucial. The enrollment-less approach can facilitate more flexible and user-friendly systems, making them more accessible in real-world scenarios. This work could lead to advancements in smart assistants, meeting transcription tools, and other audio processing applications, ultimately improving user experience and system efficiency. The paper presents a novel framework for enrollment-less main speaker voice activity detection using self-attention attractors, significantly advancing the state of the art in multi-speaker scenarios. The technical contributions, particularly the dual self-attention mechanism and real-time processing capabilities, position this work as a meaningful addition to the field of audio processing and speech technology.
Cardiovascular diseases (CVDs) are the leading cause of death worldwide, accounting for approximately 17.9 million deaths each year. Early detection is critical, creating a demand for accurate and inexpensive pre-screening methods. Deep learning has recently been applied to classify abnormal heart sounds indicative of CVDs using synchronised phonocardiogram (PCG) and electrocardiogram (ECG) signals, as well as multichannel PCG (mPCG). However, state-of-the-art architectures remain underutilised due to the limited availability of synchronised and multichannel datasets. Augmented datasets and pre-trained models provide a pathway to overcome these limitations, enabling transformer-based architectures to be trained effectively. This work combines traditional signal processing with denoising diffusion models, WaveGrad and DiffWave, to create an augmented dataset to fine-tune a Wav2Vec 2.0-based classifier on multimodal and multichannel heart sound datasets. The approach achieves state-of-the-art performance. On the Computing in Cardiology (CinC) 2016 dataset of single channel PCG, accuracy, unweighted average recall (UAR), sensitivity, specificity and Matthew's correlation coefficient (MCC) reach 92.48\%, 93.05\%, 93.63\%, 92.48\%, 94.93\% and 0.8283, respectively. Using the synchronised PCG and ECG signals of the training-a dataset from CinC, 93.14\%, 92.21\%, 94.35\%, 90.10\%, 95.12\% and 0.8380 are achieved for accuracy, UAR, sensitivity, specificity and MCC, respectively. Using a wearable vest dataset consisting of mPCG data, the model achieves 77.13\% accuracy, 74.25\% UAR, 86.47\% sensitivity, 62.04\% specificity, and 0.5082 MCC. These results demonstrate the effectiveness of transformer-based models for CVD detection when supported by augmented datasets, highlighting their potential to advance multimodal and multichannel heart sound classification.
Primary: inst2 Kayapanda Mandana
All Institutions: inst1 Yue Rong, inst1 Milan Marocchi, inst2 Kayapanda Mandana, inst1 Matthew Fynn
This paper effectively combines advanced machine learning techniques with practical applications in cardiovascular health, showcasing a novel approach to heart sound classification that could significantly impact clinical practices. The integration of synthetic data generation methods addresses critical data limitations, making it a valuable contribution to the field of medical machine learning.
The paper presents a comprehensive methodology that integrates traditional signal processing with advanced deep learning techniques, specifically leveraging Wav2Vec 2.0 and diffusion models (WaveGrad and DiffWave) for synthetic and augmented biosignal generation. The approach is innovative in that it addresses the scarcity of high-quality, synchronized datasets for heart sound classification by generating synthetic data, which is a significant contribution to the field. The methodology is well-structured, with clear steps for data augmentation, model training, and evaluation, although some details regarding hyperparameter tuning and model architecture could be elaborated further for clarity.
The experimental section is robust, utilizing multiple datasets (CinC 2016 and a wearable vest dataset) to validate the proposed models. The results demonstrate state-of-the-art performance on the CinC dataset and near-state-of-the-art results on the multichannel vest dataset. The metrics used for evaluation (accuracy, UAR, sensitivity, specificity, and MCC) are appropriate for the classification task, and the use of cross-validation enhances the reliability of the findings. However, the paper could benefit from a more detailed comparison with existing methods beyond just accuracy metrics to provide a clearer context of its contributions.
The paper provides a reasonable level of detail regarding the implementation, including the hardware used and the data preprocessing steps. However, it lacks a direct link to a code repository or demo, which would greatly enhance reproducibility. Additionally, while the hyperparameter optimization process is mentioned, specific values and configurations used in the experiments are not fully disclosed, which could hinder replication efforts by other researchers.
One limitation of the study is the reliance on synthetic data, which may not fully capture the complexities of real-world heart sounds. The performance on the multichannel vest dataset is lower than on the CinC dataset, indicating challenges in generalizing the model to noisier, real-world data. Furthermore, the paper does not address potential biases in the datasets used, which could affect the model's applicability across diverse populations.
The implications of this research are significant, particularly in the context of early detection of cardiovascular diseases, which is a leading cause of mortality worldwide. The ability to classify heart sounds accurately and inexpensively could enhance pre-screening methods and facilitate better patient outcomes. The integration of multimodal data (PCG and ECG) also opens avenues for more comprehensive diagnostic tools in cardiology. This paper effectively combines advanced machine learning techniques with practical applications in cardiovascular health, showcasing a novel approach to heart sound classification that could significantly impact clinical practices. The integration of synthetic data generation methods addresses critical data limitations, making it a valuable contribution to the field of medical machine learning.
Text-guided sound separation supports flexible audio editing across media and assistive applications, but existing models like AudioSep are too compute-heavy for edge deployment. Neural audio codec (NAC) models such as CodecFormer and SDCodec are compute-efficient but limited to fixed-class separation. We introduce CodecSep, the first NAC-based model for on-device universal, text-driven separation. CodecSep combines DAC compression with a Transformer masker modulated by CLAP-derived FiLM parameters. Across six open-domain benchmarks under matched training/prompt protocols, \textbf{CodecSep} surpasses \textbf{AudioSep} in separation fidelity (SI-SDR) while remaining competitive in perceptual quality (ViSQOL) and matching or exceeding fixed-stem baselines (TDANet, CodecFormer, SDCodec). In code-stream deployments, it needs just 1.35~GMACs end-to-end -- approximately $54\times$ less compute ($25\times$ architecture-only) than spectrogram-domain separators like AudioSep -- while remaining fully bitstream-compatible.
The main contribution of this paper is the introduction of CodecSep, a novel NAC-based model for universal sound separation that combines efficient audio processing with text-driven control, outperforming existing methods in both fidelity and computational efficiency. This work represents a significant advancement in the field of audio processing, particularly for applications requiring real-time performance on resource-constrained devices.
The paper introduces CodecSep, a novel approach to universal sound separation that leverages neural audio codecs (NACs) and a transformer-based masker modulated by text embeddings. This combination allows for efficient, on-device sound separation while maintaining high fidelity and perceptual quality. The use of Feature-wise Linear Modulation (FiLM) to condition the transformer on text embeddings is an innovative aspect that enhances the model's ability to interpret prompts semantically. The methodology is well-structured, with a clear rationale for the design choices, particularly the decision to operate in the latent space of the codec rather than the spectrogram domain. This choice significantly reduces computational requirements and improves separation performance.
The authors conduct a comprehensive evaluation across multiple datasets, including both in-domain and open-domain benchmarks. The results demonstrate that CodecSep outperforms existing models like AudioSep in separation fidelity (measured by SI-SDR) while remaining competitive in perceptual quality (ViSQOL). The experiments are well-designed, with careful attention to matched training and prompt protocols, and the inclusion of ablation studies helps to isolate the contributions of different components of the model. The reported performance gains are substantial, indicating the effectiveness of the proposed approach.
The paper mentions that supplementary code will be provided to facilitate reproducibility, which is a positive aspect. However, specific implementation details, hyperparameters, and training configurations are deferred to the appendix, which may pose challenges for full reproducibility unless the supplementary materials are made readily accessible.
The paper acknowledges several limitations, including the modest scale of training data and prompt diversity, which could affect generalization. Additionally, it notes that while the model is robust to synonymic paraphrases, it has not been tested with prompts that include explicit temporal structures. The perceptual quality of sound effects (SFX) in some cases trails behind the best competing scores, indicating room for improvement.
The ability to perform efficient, text-guided sound separation has significant implications for various applications, including media production, assistive technologies, and real-time audio editing. The model's efficiency makes it suitable for deployment on edge devices, which could democratize access to advanced audio processing capabilities. The main contribution of this paper is the introduction of CodecSep, a novel NAC-based model for universal sound separation that combines efficient audio processing with text-driven control, outperforming existing methods in both fidelity and computational efficiency. This work represents a significant advancement in the field of audio processing, particularly for applications requiring real-time performance on resource-constrained devices.
Speech tokenization enables discrete representation and facilitates speech language modeling. However, existing neural codecs capture low-level acoustic features, overlooking the semantic and contextual cues inherent to human speech. While recent efforts introduced semantic representations from self-supervised speech models or incorporated contextual representations from pre-trained language models, challenges remain in aligning and unifying the semantic and contextual representations. We introduce FuseCodec, which unifies acoustic, semantic, and contextual representations through strong cross-modal alignment and globally informed supervision. We propose three complementary techniques: (i) Latent Representation Fusion, integrating semantic and contextual features directly into the encoder latent space for robust and unified representation learning; (ii) Global Semantic-Contextual Supervision, supervising discrete tokens with globally pooled and broadcasted representations to enhance temporal consistency and cross-modal alignment; and (iii) Temporally Aligned Contextual Supervision, strengthening alignment by dynamically matching contextual and speech tokens within a local window for fine-grained token-level supervision. We further introduce FuseCodec-TTS, demonstrating our methodology's applicability to zero-shot speech synthesis. Empirically, FuseCodec achieves state-of-the-art performance in LibriSpeech, surpassing EnCodec, SpeechTokenizer, and DAC in transcription accuracy, perceptual quality, intelligibility, and speaker similarity. Results highlight the effectiveness of contextually and semantically guided tokenization for speech tokenization and downstream tasks. Code and pretrained models are available at https://github.com/mubtasimahasan/FuseCodec.
Primary: Work does not relate to position at Amazon
All Institutions: Work does not relate to position at Amazon
FuseCodec introduces a novel framework for speech tokenization that effectively integrates acoustic, semantic, and contextual signals, significantly advancing the state of the art in speech processing. The combination of innovative methodologies and strong empirical results positions this work as a meaningful contribution to the field of machine learning and audio processing.
The methodology proposed in FuseCodec is innovative, particularly in its approach to integrating semantic and contextual features into the encoder latent space. The three techniques—Latent Representation Fusion, Global Semantic-Contextual Supervision, and Temporally Aligned Contextual Supervision—are well-defined and address significant challenges in speech tokenization. The use of strong cross-modal alignment and globally informed supervision is a notable advancement, enhancing the robustness of the model. However, the paper could benefit from a more detailed explanation of the implementation specifics and how these techniques interact in practice.
The experiments conducted on the LibriSpeech dataset demonstrate the effectiveness of FuseCodec, achieving state-of-the-art results in multiple metrics such as transcription accuracy and perceptual quality. The comparison with existing models like EnCodec and SpeechTokenizer is thorough, showcasing clear improvements. However, the paper lacks a comprehensive analysis of the statistical significance of the results, which would strengthen the claims made about performance improvements.
The availability of code and pretrained models on GitHub is a positive aspect, promoting reproducibility. However, the paper should include more detailed instructions on the setup and any dependencies required to run the experiments, as well as the specific configurations used for training and evaluation.
One limitation is the potential overfitting to the LibriSpeech dataset, as the paper does not discuss generalization to other datasets or real-world applications. Additionally, while the methods are promising, the complexity of the model may pose challenges in deployment scenarios where computational resources are limited.
The implications of FuseCodec extend beyond speech tokenization, potentially impacting areas such as speech synthesis and natural language processing. The integration of semantic and contextual cues could enhance various applications, including virtual assistants, transcription services, and accessibility tools for the hearing impaired. The work encourages further exploration of multimodal approaches in audio processing, which could lead to more intuitive human-computer interactions. FuseCodec introduces a novel framework for speech tokenization that effectively integrates acoustic, semantic, and contextual signals, significantly advancing the state of the art in speech processing. The combination of innovative methodologies and strong empirical results positions this work as a meaningful contribution to the field of machine learning and audio processing.
Agentic AI has been standardized in industry as a practical paradigm for coordinating specialized models and tools to solve complex multimodal tasks. In this work, we present WeaveMuse, a multi-agent system for music understanding, symbolic composition, and audio synthesis. Each specialist agent interprets user requests, derives machine-actionable requirements (modalities, formats, constraints), and validates its own outputs, while a manager agent selects and sequences tools, mediates user interaction, and maintains state across turns. The system is extendable and deployable either locally, using quantization and inference strategies to fit diverse hardware budgets, or via the HFApi to preserve free community access to open models. Beyond out-of-the-box use, the system emphasizes controllability and adaptation through constraint schemas, structured decoding, policy-based inference, and parameter-efficient adapters or distilled variants that tailor models to MIR tasks. A central design goal is to facilitate intermodal interaction across text, symbolic notation and visualization, and audio, enabling analysis-synthesis-render loops and addressing cross-format constraints. The framework aims to democratize, implement, and make accessible MIR tools by supporting interchangeable open-source models of various sizes, flexible memory management, and reproducible deployment paths.
Primary: University
All Institutions: Company, Department of Computer Science, International Laboratories, University
WeaveMuse presents a novel multi-agent system for music understanding and generation, emphasizing controllability and accessibility in MIR tasks. The paper's technical contributions are significant, but further empirical validation and methodological details are needed to fully realize its potential impact in the field.
The methodology presented in WeaveMuse is innovative in its approach to creating a multi-agent system for music understanding and generation. The architecture effectively combines various specialized agents that handle different aspects of music processing, such as symbolic composition and audio synthesis. The use of a manager agent to orchestrate these interactions is a significant contribution, allowing for a seamless user experience across modalities. The emphasis on controllability through structured decoding and parameter-efficient adapters is commendable, as it addresses a critical need for flexibility in music information retrieval (MIR) tasks. However, the paper could benefit from a more detailed explanation of the specific algorithms employed within the agents and how they interact with one another.
The paper provides a conceptual framework and initial behaviors under constrained settings, but lacks extensive experimental validation. While it outlines the deployment modes and resource management strategies, concrete results demonstrating the performance of the system in real-world scenarios are limited. The authors mention using various models and tools, but without empirical data or comparative analysis, it is challenging to assess the effectiveness of WeaveMuse against existing systems. Future work should include rigorous experiments with quantitative metrics to substantiate the claims made.
The authors have made efforts to ensure reproducibility by providing an open-source framework and a public repository. The identical planner and prompt templates for local and hosted configurations are a positive aspect for users looking to replicate the results. However, the paper could improve by including more detailed instructions on setting up the environment and running experiments, as well as providing sample datasets for testing.
The paper acknowledges several limitations, such as the dependency on the underlying large language model's capabilities and potential performance degradation with smaller models. Additionally, the orchestration of tools and agentic prompting may not always yield the expected results, which could hinder user experience. The authors also note that the system is still a work in progress, indicating that further refinements are necessary.
WeaveMuse has the potential to significantly impact the field of music information retrieval and generation by democratizing access to advanced music processing tools. Its open-source nature and support for various models could foster community engagement and innovation. The framework's ability to facilitate intermodal interaction across text, symbolic notation, and audio could lead to new applications in music education, composition, and analysis. WeaveMuse presents a novel multi-agent system for music understanding and generation, emphasizing controllability and accessibility in MIR tasks. The paper's technical contributions are significant, but further empirical validation and methodological details are needed to fully realize its potential impact in the field.