Advanced deep learning architectures, particularly recurrent neural networks (RNNs), have been widely applied in audio, bioacoustic, and biomedical signal analysis, especially in data-scarce environments. While gated RNNs remain effective, they can be relatively over-parameterised and less training-efficient in some regimes, while linear RNNs tend to fall short in capturing the complexity inherent in bio-signals. To address these challenges, we propose the Parallel Delayed Memory Unit (PDMU), a {delay-gated state-space module for short-term temporal credit assignment} targeting audio and bioacoustic signals, which enhances short-term temporal state interactions and memory efficiency via a gated delay-line mechanism. Unlike previous Delayed Memory Units (DMU) that embed temporal dynamics into the delay-line architecture, the PDMU further compresses temporal information into vector representations using Legendre Memory Units (LMU). This design serves as a form of causal attention, allowing the model to dynamically adjust its reliance on past states and improve real-time learning performance. Notably, in low-information scenarios, the gating mechanism behaves similarly to skip connections by bypassing state decay and preserving early representations, thereby facilitating long-term memory retention. The PDMU is modular, supporting parallel training and sequential inference, and can be easily integrated into existing linear RNN frameworks. Furthermore, we introduce bidirectional, efficient, and spiking variants of the architecture, each offering additional gains in performance or energy efficiency. Experimental results on diverse audio and biomedical benchmarks demonstrate that the PDMU significantly enhances both memory capacity and overall model performance.
Primary: Ghent University
All Institutions: Institute for Infocomm Research (IR), Agency for Science, Technology and Research (A*STAR), Ghent University, Department of Electrical and Electronic Engineering
The main contribution of this paper is the introduction of the Parallel Delayed Memory Unit (PDMU), which enhances temporal modeling in audio and biomedical signal analysis through a novel delay-gated architecture. This work represents a significant advancement in the efficiency and effectiveness of RNNs for processing complex temporal data, with potential applications in real-time healthcare solutions and audio processing technologies.
The proposed Parallel Delayed Memory Unit (PDMU) introduces a novel architecture that effectively combines delay-gated mechanisms with Legendre Memory Units to enhance temporal modeling in audio and biomedical signal processing. The methodology is well-structured, leveraging existing frameworks while innovatively addressing the limitations of traditional RNNs and linear models. The introduction of various PDMU variants (bi-directional, efficient, and spiking) demonstrates a comprehensive approach to optimizing performance and energy efficiency, which is particularly relevant for real-time applications.
The experimental evaluation is robust, utilizing a diverse set of benchmarks across audio and biomedical domains. The results demonstrate significant performance improvements over existing models, particularly in low-information scenarios, which is a critical aspect of real-world applications. The ablation studies further validate the contributions of individual components of the PDMU, providing clear evidence of its effectiveness.
The paper includes sufficient implementation details, such as the use of the PyTorch library and specific training configurations, which enhance reproducibility. However, the absence of a publicly available code repository limits the ease with which other researchers can replicate the results.
While the PDMU shows promise, the paper does not extensively discuss potential limitations, such as the scalability of the model to larger datasets or its performance in highly variable real-world conditions. Additionally, the reliance on specific datasets may limit generalizability.
The PDMU has significant implications for fields requiring efficient processing of temporal data, particularly in healthcare and audio signal analysis. Its ability to enhance real-time learning and memory retention could lead to advancements in medical diagnostics and monitoring technologies. The main contribution of this paper is the introduction of the Parallel Delayed Memory Unit (PDMU), which enhances temporal modeling in audio and biomedical signal analysis through a novel delay-gated architecture. This work represents a significant advancement in the efficiency and effectiveness of RNNs for processing complex temporal data, with potential applications in real-time healthcare solutions and audio processing technologies.
Pre-trained audio models excel at detecting acoustic patterns in auscultation sounds but often fail to grasp their clinical significance, limiting their use and performance in diagnostic tasks. To bridge this gap, we introduce AcuLa (Audio-Clinical Understanding via Language Alignment), a lightweight post-training framework that instills semantic understanding into any audio encoder by aligning it with a medical language model, which acts as a "semantic teacher." To enable alignment at scale, we construct a large-scale dataset by leveraging off-the-shelf large language models to translate the rich, structured metadata accompanying existing audio recordings into coherent clinical reports. Our alignment strategy combines a representation-level contrastive objective with a self-supervised modeling, ensuring that the model learns clinical semantics while preserving fine-grained temporal cues. AcuLa achieves state-of-the-art results across 18 diverse cardio-respiratory tasks from 10 different datasets, improving the mean AUROC on classification benchmarks from 0.68 to 0.79 and, on the most challenging COVID-19 cough detection task, boosting the AUROC from 0.55 to 0.89. Our work demonstrates that this audio-language alignment transforms purely acoustic models into clinically-aware diagnostic tools, establishing a novel paradigm for enhancing physiological understanding in audio-based health monitoring.
Primary: Eindhoven University of Technology
All Institutions: Eindhoven University of Technology, Kyutai, Eindhoven Artificial Intelligence Systems Institute
The paper introduces AcuLa, a framework that enhances audio encoders with semantic understanding through alignment with a medical language model, demonstrating significant improvements in diagnostic performance across multiple tasks.
The paper presents a novel framework, AcuLa, that effectively aligns audio encoders with a medical language model to enhance the semantic understanding of auscultation sounds. The methodology is robust, utilizing a dual-objective optimization strategy that combines representation alignment with self-supervised modeling to ensure that the audio encoder retains its acoustic modeling capabilities while gaining semantic insights. The use of synthetic data generation from structured metadata to create a large-scale dataset for training is particularly innovative and addresses the common issue of data scarcity in medical audio analysis.
The experiments are comprehensive, evaluating AcuLa across 18 diverse tasks related to cardio-respiratory health, with results showing significant improvements in performance metrics such as AUROC and MAE. The authors provide a thorough comparison against multiple baseline models, demonstrating that AcuLa consistently outperforms these models, indicating strong empirical support for their claims. The use of a standardized linear probing methodology for evaluation adds rigor to the experimental design.
The paper includes detailed implementation details, including model architectures, training protocols, and data preprocessing steps, which enhances reproducibility. However, the lack of a publicly accessible demo or project URL limits the ability for others to easily replicate the results.
One limitation is the reliance on synthetic data generation, which, while innovative, may introduce biases or inaccuracies if the language model does not perfectly capture the clinical nuances of the audio data. Additionally, the framework's performance on tasks outside the cardio-respiratory domain remains untested, which may limit its generalizability.
The work has significant implications for the field of medical audio analysis, potentially transforming how audio data is utilized in clinical diagnostics. By bridging the gap between acoustic features and clinical semantics, AcuLa could lead to more accurate and clinically relevant diagnostic tools, ultimately improving patient outcomes in respiratory and cardiac health monitoring. The paper introduces AcuLa, a framework that enhances audio encoders with semantic understanding through alignment with a medical language model, demonstrating significant improvements in diagnostic performance across multiple tasks.
Singing Voice Synthesis (SVS) remains constrained in practical deployment due to its strong dependence on accurate phoneme-level alignment and manually annotated melody contours, requirements that are resource-intensive and hinder scalability. To overcome these limitations, we propose a melody-driven SVS framework capable of synthesizing arbitrary lyrics following any reference melody, without relying on phoneme-level alignment. Our method builds on a Diffusion Transformer (DiT) architecture, enhanced with a dedicated melody extraction module that derives melody representations directly from reference audio. To ensure robust melody encoding, we employ a teacher model to guide the optimization of the melody extractor, alongside an implicit alignment mechanism that enforces similarity distribution constraints for improved melodic stability and coherence. Additionally, we refine duration modeling using weakly annotated song data and introduce a Flow-GRPO reinforcement learning strategy with a multi-objective reward function to jointly enhance pronunciation clarity and melodic fidelity. Experiments show that our model achieves superior performance over existing approaches in both objective measures and subjective listening tests, especially in zero-shot and lyric adaptation settings, while maintaining high audio quality without manual annotation. This work offers a practical and scalable solution for advancing data-efficient singing voice synthesis. To support reproducibility, we release our inference code and model checkpoints.
Primary: Northwestern Polytechnical University
All Institutions: Northwestern Polytechnical University, University College London
The main contribution of this paper is the introduction of a scalable, annotation-free framework for singing voice synthesis that significantly enhances the accessibility and quality of synthesized singing voices. This work represents a meaningful step forward in the field of audio synthesis, combining innovative methodologies with practical applications that could reshape how music is produced and experienced.
The proposed methodology introduces a novel melody-driven Singing Voice Synthesis (SVS) framework that leverages a Diffusion Transformer architecture combined with a melody extraction module. This innovative approach eliminates the need for phoneme-level alignment and manual annotations, addressing significant scalability issues in existing SVS systems. The integration of a teacher model for melody extraction and the use of a similarity distribution constraint for melodic stability are particularly noteworthy. Additionally, the incorporation of reinforcement learning through Flow-GRPO for post-training optimization is a strong methodological advancement, enhancing both pronunciation clarity and melodic fidelity.
The experimental evaluation is robust, demonstrating the proposed model's superiority over existing approaches in both objective metrics (like Word Error Rate and F0 correlation) and subjective listening tests. The authors provide comprehensive comparisons across multiple experimental setups, including zero-shot synthesis and lyric editing, showcasing significant improvements in performance. The use of diverse datasets and the detailed reporting of results strengthen the credibility of the findings.
The authors have taken steps to ensure reproducibility by releasing their inference code and model checkpoints, which is a positive aspect for the research community. However, the paper could benefit from more detailed descriptions of the training data preparation and specific hyperparameter settings used during experiments to further enhance reproducibility.
While the paper presents a significant advancement, it does not address potential limitations such as the model's performance across different languages or styles beyond the Mandarin dataset used for training. Additionally, the reliance on weakly annotated data may introduce variability in performance that is not fully explored.
The framework has the potential to democratize music creation by enabling users without professional musical training to synthesize high-quality singing voices. This could have wide-ranging applications in the entertainment industry, including music production, virtual performances, and interactive media. The approach also opens avenues for further research in multilingual and cross-cultural singing synthesis. The main contribution of this paper is the introduction of a scalable, annotation-free framework for singing voice synthesis that significantly enhances the accessibility and quality of synthesized singing voices. This work represents a meaningful step forward in the field of audio synthesis, combining innovative methodologies with practical applications that could reshape how music is produced and experienced.
Large audio language models (LALMs) are increasingly deployed in real-world settings where they inevitably capture speech from unintended nearby bystanders, raising privacy risks that existing benchmarks and defences largely overlook. We introduce SH-Bench, the first benchmark designed to evaluate selective hearing: a model's ability to attend to an intended main speaker while refusing to process or reveal information about incidental bystander speech. SH-Bench contains 3,968 multi-speaker audio mixtures spanning both real-world and synthetic scenarios, paired with 77k multiple-choice questions that probe models under general and selective operating modes. We propose Selective Efficacy (SE), a unified metric capturing both multi-speaker comprehension and bystander-privacy protection. Our evaluation of state-of-the-art open-source and proprietary LALMs reveals substantial privacy leakage, with strong audio understanding failing to translate into selective protection of bystander privacy. To mitigate this gap, we introduce Bystander Privacy Fine-Tuning (BPFT), a training pipeline that teaches models to refuse bystander-related queries without degrading main-speaker comprehension. BPFT yields substantial gains which improve SE by up to 15.9% over Gemini 2.5 Pro, demonstrating that selective hearing is learnable but far from achieved in current LALMs. SH-Bench and BPFT provide the first systematic framework for measuring and improving bystander privacy in audio foundation models.
Primary: Trinity College, Cambridge
All Institutions: Trinity College, Cambridge
The paper presents a pioneering approach to evaluating and improving bystander privacy in large audio language models through the introduction of SH-Bench and BPFT. This work is significant as it not only highlights an overlooked aspect of AI deployment but also provides a systematic framework for addressing privacy concerns, marking a crucial step towards responsible AI development in audio processing.
The paper introduces a novel benchmark (SH-Bench) specifically designed to evaluate the selective hearing capabilities of LALMs, focusing on bystander privacy. The methodology includes a comprehensive dataset of multi-speaker audio mixtures and a unified metric (Selective Efficacy) to assess both comprehension and privacy protection. The introduction of the Bystander Privacy Fine-Tuning (BPFT) training pipeline is a significant methodological advancement that addresses the identified privacy gaps in existing models.
The experiments are well-structured, utilizing a diverse set of models and rigorous evaluation metrics. The results demonstrate the effectiveness of BPFT in improving bystander privacy protection while maintaining comprehension abilities. The evaluation of various models under different operational modes provides a thorough understanding of the current state of LALMs in terms of privacy risks.
The paper provides a clear description of the data collection and evaluation processes, along with links to the GitHub repository for code and implementation details. However, specific hyperparameters and training configurations for the models used in experiments could be better detailed to enhance reproducibility.
The paper acknowledges limitations in focusing solely on single main speaker scenarios and does not explore more complex situations like group discussions. Additionally, while BPFT shows improvements, it may lead to slight degradation in main speaker comprehension, which could be a concern in practical applications.
The work has significant implications for the deployment of LALMs in real-world applications, particularly in contexts where privacy is a concern. By addressing the privacy of bystanders, the research contributes to the ethical deployment of AI technologies in everyday environments, potentially influencing future standards and practices in the field. The paper presents a pioneering approach to evaluating and improving bystander privacy in large audio language models through the introduction of SH-Bench and BPFT. This work is significant as it not only highlights an overlooked aspect of AI deployment but also provides a systematic framework for addressing privacy concerns, marking a crucial step towards responsible AI development in audio processing.
Predicting a song's commercial success prior to its release remains an open and critical research challenge for the music industry. Early prediction of music popularity informs strategic decisions, creative planning, and marketing. Existing methods suffer from four limitations:(i) temporal dynamics in audio and lyrics are averaged away; (ii) lyrics are represented as a bag of words, disregarding compositional structure and affective semantics; (iii) artist- and song-level historical performance is ignored; and (iv) multimodal fusion approaches rely on simple feature concatenation, resulting in poorly aligned shared representations. To address these limitations, we introduce GAMENet, an end-to-end multimodal deep learning architecture for music popularity prediction. GAMENet integrates modality-specific experts for audio, lyrics, and social metadata through an adaptive gating mechanism. We use audio features from Music4AllOnion processed via OnionEnsembleAENet, a network of autoencoders designed for robust feature extraction; lyric embeddings derived through a large language model pipeline; and newly introduced Career Trajectory Dynamics (CTD) features that capture multi-year artist career momentum and song-level trajectory statistics. Using the Music4All dataset (113k tracks), previously explored in MIR tasks but not popularity prediction, GAMENet achieves a 12% improvement in R^2 over direct multimodal feature concatenation. Spotify audio descriptors alone yield an R^2 of 0.13. Integrating aggregate CTD features increases this to 0.69, with an additional 7% gain from temporal CTD features. We further validate robustness using the SpotGenTrack Popularity Dataset (100k tracks), achieving a 16% improvement over the previous baseline. Extensive ablations confirm the model's effectiveness and the distinct contribution of each modality.
Primary: AAAI Publications Committee
All Institutions: AAAI Publications Committee
This paper presents GAMENet, a novel multimodal deep learning framework for predicting music popularity that effectively integrates diverse data sources and temporal dynamics. The technical contributions, particularly the introduction of CTD features and the adaptive gating mechanism, represent a meaningful advancement in the field of music information retrieval and machine learning.
The methodology is robust, introducing GAMENet, a multimodal deep learning architecture that effectively combines audio, lyrics, and social metadata through an adaptive gating mechanism. The incorporation of Career Trajectory Dynamics (CTD) features is particularly innovative, addressing the temporal dynamics and historical performance aspects that previous models overlooked. The use of OnionEnsembleAENet for audio feature extraction demonstrates a thoughtful approach to feature engineering, ensuring that the model captures essential characteristics from diverse modalities.
The experiments are extensive and well-structured, utilizing two large datasets (Music4All and SpotGenTrack) to validate the model's performance. The reported improvements in R² scores (from 0.13 to 0.69 with CTD features) are significant, indicating that the proposed methods yield meaningful advancements over existing baselines. The ablation studies effectively demonstrate the contribution of each modality, reinforcing the importance of the CTD features in the overall model performance.
The paper provides detailed descriptions of the datasets, preprocessing steps, and model training procedures, which enhances reproducibility. However, it lacks a direct link to code or a demo, which would further facilitate independent verification of results.
One limitation is the reliance on historical data, which may not fully capture future trends in music popularity, especially in a rapidly evolving industry. Additionally, while the model shows strong performance, it may still struggle with songs that have atypical engagement patterns or those that do not fit established trends.
The findings have significant implications for the music industry, offering a tool for predicting song popularity that can inform marketing strategies and creative decisions. The model's ability to integrate various data types could also inspire further research into multimodal learning applications across different domains. This paper presents GAMENet, a novel multimodal deep learning framework for predicting music popularity that effectively integrates diverse data sources and temporal dynamics. The technical contributions, particularly the introduction of CTD features and the adaptive gating mechanism, represent a meaningful advancement in the field of music information retrieval and machine learning.
Pre-trained audio models excel at detecting acoustic patterns in auscultation sounds but often fail to grasp their clinical significance, limiting their use and performance in diagnostic tasks. To bridge this gap, we introduce AcuLa (Audio-Clinical Understanding via Language Alignment), a lightweight post-training framework that instills semantic understanding into any audio encoder by aligning it with a medical language model, which acts as a "semantic teacher." To enable alignment at scale, we construct a large-scale dataset by leveraging off-the-shelf large language models to translate the rich, structured metadata accompanying existing audio recordings into coherent clinical reports. Our alignment strategy combines a representation-level contrastive objective with a self-supervised modeling, ensuring that the model learns clinical semantics while preserving fine-grained temporal cues. AcuLa achieves state-of-the-art results across 18 diverse cardio-respiratory tasks from 10 different datasets, improving the mean AUROC on classification benchmarks from 0.68 to 0.79 and, on the most challenging COVID-19 cough detection task, boosting the AUROC from 0.55 to 0.89. Our work demonstrates that this audio-language alignment transforms purely acoustic models into clinically-aware diagnostic tools, establishing a novel paradigm for enhancing physiological understanding in audio-based health monitoring.
Primary: Eindhoven University of Technology
All Institutions: Eindhoven University of Technology, Kyutai, Eindhoven Artificial Intelligence Systems Institute
The paper introduces AcuLa, a framework that enhances audio encoders with semantic understanding through alignment with a medical language model, demonstrating significant improvements in diagnostic performance across multiple tasks.
The paper presents a novel framework, AcuLa, that effectively aligns audio encoders with a medical language model to enhance the semantic understanding of auscultation sounds. The methodology is robust, utilizing a dual-objective optimization strategy that combines representation alignment with self-supervised modeling to ensure that the audio encoder retains its acoustic modeling capabilities while gaining semantic insights. The use of synthetic data generation from structured metadata to create a large-scale dataset for training is particularly innovative and addresses the common issue of data scarcity in medical audio analysis.
The experiments are comprehensive, evaluating AcuLa across 18 diverse tasks related to cardio-respiratory health, with results showing significant improvements in performance metrics such as AUROC and MAE. The authors provide a thorough comparison against multiple baseline models, demonstrating that AcuLa consistently outperforms these models, indicating strong empirical support for their claims. The use of a standardized linear probing methodology for evaluation adds rigor to the experimental design.
The paper includes detailed implementation details, including model architectures, training protocols, and data preprocessing steps, which enhances reproducibility. However, the lack of a publicly accessible demo or project URL limits the ability for others to easily replicate the results.
One limitation is the reliance on synthetic data generation, which, while innovative, may introduce biases or inaccuracies if the language model does not perfectly capture the clinical nuances of the audio data. Additionally, the framework's performance on tasks outside the cardio-respiratory domain remains untested, which may limit its generalizability.
The work has significant implications for the field of medical audio analysis, potentially transforming how audio data is utilized in clinical diagnostics. By bridging the gap between acoustic features and clinical semantics, AcuLa could lead to more accurate and clinically relevant diagnostic tools, ultimately improving patient outcomes in respiratory and cardiac health monitoring. The paper introduces AcuLa, a framework that enhances audio encoders with semantic understanding through alignment with a medical language model, demonstrating significant improvements in diagnostic performance across multiple tasks.
Differentiable reinforcement learning (RL) frameworks like DiffRO offer a powerful approach for controllable text-to-speech (TTS), but are vulnerable to reward hacking, particularly for nuanced tasks like emotion control. The policy model can exploit a vanilla Reward Model (RM) by generating acoustic artifacts to achieve spurious rewards, but at the cost of degrading perceptual quality. To address this, we propose Robust Reward Policy Optimization (RRPO), a novel framework that employs a hybrid regularization scheme. This scheme develops a robust RM whose reward signal is more reliably aligned with human perception, compelling the policy to abandon detrimental shortcuts and instead learn the complex features of genuine emotions. Our ablation study confirms the enhanced robustness of our RM, as evidenced by its strong cross-lingual generalization. The subjective evaluation demonstrates that this robust RM effectively mitigates reward hacking, leading to significant improvements in both emotional expressiveness and naturalness over all baselines. Demo page: https://lrwinr.github.io/RRPO-CosyVoice.
Primary: Beijing University of Posts and Telecommunications
All Institutions: Beijing University of Posts and Telecommunications, Tongyi Lab
The main contribution of this paper is the introduction of the RRPO framework, which effectively mitigates reward hacking in emotional TTS systems through a novel hybrid regularization scheme. This work represents a significant advancement in the field, addressing critical challenges in generating emotionally expressive and natural speech synthesis while providing a robust methodology that can be adapted to other applications.
The proposed Robust Reward Policy Optimization (RRPO) framework introduces a hybrid regularization scheme that effectively addresses the vulnerabilities of existing differentiable reinforcement learning methods in emotional text-to-speech (TTS). The methodology is well-structured, leveraging three distinct regularization techniques—Label Smoothing, Energy-Adaptive Mixup, and Adversarial Training—to enhance the robustness of the reward model (RM). Each component is justified with clear explanations of how they mitigate specific issues related to reward hacking, making the approach both innovative and practical. However, the complexity of the hybrid scheme may pose challenges in terms of implementation and tuning.
The experiments are comprehensive, involving both subjective and objective evaluations to demonstrate the effectiveness of RRPO. The use of a high-quality emotional dataset and a well-defined evaluation methodology, including Mean Opinion Score (MOS) assessments, strengthens the validity of the results. The ablation study provides insights into the contributions of each regularization component, although further exploration of the trade-offs between generalization and adversarial robustness could enhance understanding.
The paper provides sufficient implementation details, including the architecture of the RM and the training setup. However, the lack of a publicly available code repository limits the reproducibility of the results. Providing access to the trained models and code would significantly enhance the ability of other researchers to validate and build upon this work.
One limitation is the reliance on a single language (Mandarin) for training and evaluation, which may affect the generalizability of the findings to other languages or dialects. Additionally, while the hybrid regularization scheme shows promise, its complexity may hinder practical application in real-world scenarios. The potential for overfitting to the small, high-quality dataset used for fine-tuning the RM is also a concern.
The proposed RRPO framework has the potential to significantly improve emotional TTS systems, enhancing human-computer interaction through more expressive and natural speech synthesis. The robustness of the RM against reward hacking can lead to more reliable applications in various domains, including virtual assistants, gaming, and mental health support. The adaptability of the hybrid regularization scheme for other acoustic tasks suggests broader implications for the field of audio processing and machine learning. The main contribution of this paper is the introduction of the RRPO framework, which effectively mitigates reward hacking in emotional TTS systems through a novel hybrid regularization scheme. This work represents a significant advancement in the field, addressing critical challenges in generating emotionally expressive and natural speech synthesis while providing a robust methodology that can be adapted to other applications.
Singing Voice Synthesis (SVS) remains constrained in practical deployment due to its strong dependence on accurate phoneme-level alignment and manually annotated melody contours, requirements that are resource-intensive and hinder scalability. To overcome these limitations, we propose a melody-driven SVS framework capable of synthesizing arbitrary lyrics following any reference melody, without relying on phoneme-level alignment. Our method builds on a Diffusion Transformer (DiT) architecture, enhanced with a dedicated melody extraction module that derives melody representations directly from reference audio. To ensure robust melody encoding, we employ a teacher model to guide the optimization of the melody extractor, alongside an implicit alignment mechanism that enforces similarity distribution constraints for improved melodic stability and coherence. Additionally, we refine duration modeling using weakly annotated song data and introduce a Flow-GRPO reinforcement learning strategy with a multi-objective reward function to jointly enhance pronunciation clarity and melodic fidelity. Experiments show that our model achieves superior performance over existing approaches in both objective measures and subjective listening tests, especially in zero-shot and lyric adaptation settings, while maintaining high audio quality without manual annotation. This work offers a practical and scalable solution for advancing data-efficient singing voice synthesis. To support reproducibility, we release our inference code and model checkpoints.
Primary: Northwestern Polytechnical University
All Institutions: Northwestern Polytechnical University, University College London
The main contribution of this paper is the introduction of a scalable, annotation-free framework for singing voice synthesis that significantly enhances the accessibility and quality of synthesized singing voices. This work represents a meaningful step forward in the field of audio synthesis, combining innovative methodologies with practical applications that could reshape how music is produced and experienced.
The proposed methodology introduces a novel melody-driven Singing Voice Synthesis (SVS) framework that leverages a Diffusion Transformer architecture combined with a melody extraction module. This innovative approach eliminates the need for phoneme-level alignment and manual annotations, addressing significant scalability issues in existing SVS systems. The integration of a teacher model for melody extraction and the use of a similarity distribution constraint for melodic stability are particularly noteworthy. Additionally, the incorporation of reinforcement learning through Flow-GRPO for post-training optimization is a strong methodological advancement, enhancing both pronunciation clarity and melodic fidelity.
The experimental evaluation is robust, demonstrating the proposed model's superiority over existing approaches in both objective metrics (like Word Error Rate and F0 correlation) and subjective listening tests. The authors provide comprehensive comparisons across multiple experimental setups, including zero-shot synthesis and lyric editing, showcasing significant improvements in performance. The use of diverse datasets and the detailed reporting of results strengthen the credibility of the findings.
The authors have taken steps to ensure reproducibility by releasing their inference code and model checkpoints, which is a positive aspect for the research community. However, the paper could benefit from more detailed descriptions of the training data preparation and specific hyperparameter settings used during experiments to further enhance reproducibility.
While the paper presents a significant advancement, it does not address potential limitations such as the model's performance across different languages or styles beyond the Mandarin dataset used for training. Additionally, the reliance on weakly annotated data may introduce variability in performance that is not fully explored.
The framework has the potential to democratize music creation by enabling users without professional musical training to synthesize high-quality singing voices. This could have wide-ranging applications in the entertainment industry, including music production, virtual performances, and interactive media. The approach also opens avenues for further research in multilingual and cross-cultural singing synthesis. The main contribution of this paper is the introduction of a scalable, annotation-free framework for singing voice synthesis that significantly enhances the accessibility and quality of synthesized singing voices. This work represents a meaningful step forward in the field of audio synthesis, combining innovative methodologies with practical applications that could reshape how music is produced and experienced.
Singing voice conversion (SVC) aims to render the target singer's timbre while preserving melody and lyrics. However, existing zero-shot SVC systems remain fragile in real songs due to harmony interference, F0 errors, and the lack of inductive biases for singing. We propose YingMusic-SVC, a robust zero-shot framework that unifies continuous pre-training, robust supervised fine-tuning, and Flow-GRPO reinforcement learning. Our model introduces a singing-trained RVC timbre shifter for timbre-content disentanglement, an F0-aware timbre adaptor for dynamic vocal expression, and an energy-balanced rectified flow matching loss to enhance high-frequency fidelity. Experiments on a graded multi-track benchmark show that YingMusic-SVC achieves consistent improvements over strong open-source baselines in timbre similarity, intelligibility, and perceptual naturalness, especially under accompanied and harmony-contaminated conditions, demonstrating its effectiveness for real-world SVC deployment.
Primary: East China University of Science and Technology
All Institutions: East China University of Science and Technology, Tsinghua University, University College London (UCL)
The main contribution of this paper is the development of YingMusic-SVC, a robust zero-shot singing voice conversion framework that effectively integrates singing-specific inductive biases and advanced training methodologies to achieve high-quality voice conversion in real-world scenarios. This work represents a significant step forward in the field of audio processing, addressing critical challenges and paving the way for future innovations in singing voice conversion technology.
The proposed YingMusic-SVC framework presents a comprehensive approach to singing voice conversion (SVC) by integrating continuous pre-training, robust supervised fine-tuning, and reinforcement learning. The introduction of singing-specific modules, such as the RVC timbre shifter and F0-aware timbre adaptor, demonstrates a thoughtful adaptation of existing methodologies to better suit the unique characteristics of singing voices. The use of energy-balanced flow matching loss further enhances the model's ability to capture high-frequency details, which is critical for realistic singing voice conversion. The multi-stage training approach is well-structured and effectively addresses the challenges posed by real-world audio scenarios.
The experiments conducted on a graded multi-track benchmark are extensive and provide a solid basis for evaluating the proposed system's performance. The results indicate significant improvements over existing baselines in various metrics, including timbre similarity, intelligibility, and perceptual naturalness, particularly in challenging conditions with background music. The use of both objective and subjective evaluation metrics strengthens the findings, showcasing the model's robustness and effectiveness in real-world applications.
The paper provides sufficient implementation details, including the architecture, training data, and evaluation metrics, which facilitate reproducibility. The open-source availability of the code and models further enhances the potential for other researchers to replicate and build upon this work.
While the proposed method shows promising results, it may still struggle with extreme vocal ranges or highly complex musical arrangements that deviate significantly from the training data. Additionally, the reliance on a well-structured training dataset may limit the model's generalizability to less common singing styles or languages not represented in the training corpus.
The advancements presented in YingMusic-SVC have significant implications for various applications, including music production, virtual singers, and content creation in social media. By improving the quality and robustness of singing voice conversion, this work could enhance user experiences in entertainment and creative industries, potentially leading to new forms of artistic expression and collaboration. The main contribution of this paper is the development of YingMusic-SVC, a robust zero-shot singing voice conversion framework that effectively integrates singing-specific inductive biases and advanced training methodologies to achieve high-quality voice conversion in real-world scenarios. This work represents a significant step forward in the field of audio processing, addressing critical challenges and paving the way for future innovations in singing voice conversion technology.
In our recent work, we proposed Lightweight Speech Enhancement Guided Target Speech Extraction (LGTSE) and demonstrated its effectiveness in multi-speaker-plus-noise scenarios. However, real-world applications often involve more diverse and complex conditions, such as one-speaker-plus-noise or two-speaker-without-noise. To address this challenge, we extend LGTSE with a Cross-Condition Consistency learning strategy, termed TripleC Learning. This strategy is first validated under multi-speaker-plus-noise condition and then evaluated for its generalization across diverse scenarios. Moreover, building upon the lightweight front-end denoiser in LGTSE, which can flexibly process both noisy and clean mixtures and shows strong generalization to unseen conditions, we integrate TripleC learning with a proposed parallel universal training scheme that organizes batches containing multiple scenarios for the same target speaker. By enforcing consistent extraction across different conditions, easier cases can assist harder ones, thereby fully exploiting diverse training data and fostering a robust universal model. Experimental results on the Libri2Mix three-condition tasks demonstrate that the proposed LGTSE with TripleC learning achieves superior performance over condition-specific models, highlighting its strong potential for universal deployment in real-world speech applications.
Primary: Shanghai Normal University
All Institutions: Shanghai Normal University, Unisound AI Technology Co
The paper presents a significant advancement in target speech extraction by introducing TripleC Learning, which enhances the robustness of speech enhancement models across diverse acoustic conditions. The methodology is innovative, and the results demonstrate its potential for practical applications, although further validation and reproducibility efforts are needed.
The paper introduces a novel approach termed TripleC Learning, which enhances the existing Lightweight Speech Enhancement Guided Target Speech Extraction (LGTSE) model by enforcing cross-condition consistency during training. This methodology is well-structured, leveraging the strengths of easier conditions to improve performance in more challenging scenarios. The integration of a parallel universal training scheme is particularly innovative, as it allows the model to learn from multiple conditions simultaneously, thereby improving generalization across diverse acoustic environments. The use of a lightweight denoiser (GTCRN) is also a strong point, as it maintains efficiency while achieving robust performance.
The experiments are conducted on the Libri2Mix dataset, which is a well-regarded benchmark in the field. The results demonstrate significant improvements over condition-specific models, validating the proposed method's effectiveness. The paper provides comprehensive comparisons with existing approaches, showcasing the superiority of the proposed model across various metrics (SI-SDR, PESQ, STOI). However, the evaluation could benefit from additional datasets or real-world scenarios to further substantiate claims of generalization.
The paper lacks detailed implementation specifics, such as hyperparameter settings and model architecture diagrams, which are crucial for reproducibility. While it mentions the use of Adam optimizer and training strategies, more explicit details would enhance the ability of other researchers to replicate the study. Additionally, the absence of a public code repository limits accessibility for further exploration and validation of the findings.
One notable limitation is the slight performance drop observed in the easier 1-speaker-plus-noise condition when applying TripleC Learning. This suggests a trade-off that may need further investigation. Additionally, the model's performance in real-world applications remains to be tested, as the experiments are confined to a controlled dataset. The reliance on a specific dataset (Libri2Mix) may also limit the generalizability of the findings to other speech enhancement tasks.
The proposed model has significant implications for real-world applications in automatic speech recognition, hearing aids, and communication systems, especially in environments with varying noise conditions. Its lightweight nature makes it suitable for deployment in resource-constrained devices, potentially improving accessibility to advanced speech processing technologies. The focus on universal models that generalize across conditions aligns well with current trends in machine learning, emphasizing the need for adaptable and efficient solutions. The paper presents a significant advancement in target speech extraction by introducing TripleC Learning, which enhances the robustness of speech enhancement models across diverse acoustic conditions. The methodology is innovative, and the results demonstrate its potential for practical applications, although further validation and reproducibility efforts are needed.
Non-autoregressive (NAR) text-to-speech synthesis relies on length alignment between text sequences and audio representations, constraining naturalness and expressiveness. Existing methods depend on duration modeling or pseudo-alignment strategies that severely limit naturalness and computational efficiency. We propose M3-TTS, a concise and efficient NAR TTS paradigm based on multi-modal diffusion transformer (MM-DiT) architecture. M3-TTS employs joint diffusion transformer layers for cross-modal alignment, achieving stable monotonic alignment between variable-length text-speech sequences without pseudo-alignment requirements. Single diffusion transformer layers further enhance acoustic detail modeling. The framework integrates a mel-vae codec that provides 3* training acceleration. Experimental results on Seed-TTS and AISHELL-3 benchmarks demonstrate that M3-TTS achieves state-of-the-art NAR performance with the lowest word error rates (1.36\% English, 1.31\% Chinese) while maintaining competitive naturalness scores. Code and demos will be available at https://wwwwxp.github.io/M3-TTS.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of M3-TTS, a novel non-autoregressive text-to-speech synthesis framework that leverages a multi-modal diffusion transformer architecture to achieve efficient and high-fidelity speech synthesis without the need for pseudo-alignment. This work addresses critical challenges in the field and demonstrates promising results, positioning it as a significant advancement in TTS technology.
The proposed M3-TTS framework introduces a novel multi-modal diffusion transformer architecture that effectively aligns text and speech without relying on traditional pseudo-alignment methods. The use of joint diffusion transformer layers for cross-modal alignment is innovative and addresses significant limitations in existing NAR TTS systems. The integration of a mel-vae codec for training acceleration further enhances the efficiency of the model, making it a compelling contribution to the field of speech synthesis.
The authors present experimental results on established benchmarks, Seed-TTS and AISHELL-3, showcasing state-of-the-art performance with the lowest word error rates reported. The evaluation metrics used, including naturalness scores, are appropriate for assessing TTS systems. However, the paper could benefit from a more detailed comparison with other recent methods to contextualize the improvements achieved by M3-TTS.
The paper mentions that code and demos will be available, which is a positive aspect for reproducibility. However, the details regarding the implementation specifics, hyperparameter settings, and training procedures are not extensively covered, which may hinder complete reproducibility by other researchers.
One limitation of the study is the lack of a comprehensive analysis of the model's performance across diverse languages and accents, which could impact its generalizability. Additionally, while the model achieves competitive naturalness scores, the paper does not provide a thorough qualitative analysis of the generated speech, which is crucial for TTS systems.
The advancements in zero-shot high-fidelity speech synthesis have significant implications for various applications, including virtual assistants, audiobooks, and accessibility tools. The ability to generate natural-sounding speech without extensive training on specific datasets could democratize access to high-quality TTS systems, benefiting a wide range of users. The main contribution of this paper is the introduction of M3-TTS, a novel non-autoregressive text-to-speech synthesis framework that leverages a multi-modal diffusion transformer architecture to achieve efficient and high-fidelity speech synthesis without the need for pseudo-alignment. This work addresses critical challenges in the field and demonstrates promising results, positioning it as a significant advancement in TTS technology.
Subjective mean opinion scores (MOS) remain the de-facto target for non-intrusive speech and singing quality assessment. However, MOS is a scalar that collapses heterogeneous user expectations, ignores service-level objectives, and is difficult to compare across deployment graphs. We propose a contract-driven QoE auditing framework: each service graph G is evaluated under a set of human-interpretable experience contracts C, yielding a contract-level satisfaction vector Q(G, C). We show that (i) classical MOS regression is a special case with a degenerate contract set, (ii) contract-driven quality is more stable than MOS under graph view transformations (e.g., pooling by system vs. by system type), and (iii) the effective sample complexity of learning contracts is governed by contract semantics rather than merely the dimensionality of C. We instantiate the framework on URGENT2024 MOS (6.9k speech utterances with raw rating vectors) and SingMOS v1 (7,981 singing clips; 80 systems). On URGENT, we train a contract-aware neural auditor on self-supervised WavLM embeddings; on SingMOS, we perform contract-driven graph auditing using released rating vectors and metadata without decoding audio. Empirically, our auditor matches strong MOS predictors in MOS accuracy while providing calibrated contract probabilities; on SingMOS, Q(G, C) exhibits substantially smaller cross-view drift than raw MOS and graph-only baselines; on URGENT, difficulty curves reveal that mis-specified "simple" contracts can be harder to learn than richer but better aligned contract sets.
Primary: Mahanakorn University of Technology
All Institutions: Mahanakorn University of Technology
The paper presents a contract-driven QoE auditing framework that enhances traditional MOS evaluation methods by introducing a more interpretable and stable approach to quality assessment. This contribution is significant as it addresses key limitations in existing methodologies, paving the way for more effective quality evaluation in speech and multimedia services.
The paper introduces a novel contract-driven QoE auditing framework that addresses the limitations of traditional MOS by providing a vector of satisfaction rates that is interpretable and stable under various transformations. The methodology is well-structured, utilizing human-interpretable experience contracts and neural network architectures to enhance the quality assessment process. The formalization of experience contracts as Boolean predicates over rating vectors is a significant contribution, as it allows for a richer representation of user experience beyond scalar values.
The experimental setup is robust, utilizing two substantial datasets (URGENT2024_MOS and SingMOS) to validate the proposed framework. The results demonstrate that the contract-driven approach outperforms traditional MOS methods in terms of stability and interpretability, with empirical evidence supporting the claims made in the hypotheses. The evaluation metrics used are appropriate and provide a comprehensive view of the performance of the proposed models.
While the paper provides a detailed description of the methodology and experimental setup, it lacks specific implementation details or links to code repositories that would facilitate reproducibility. The absence of a project URL or demo also limits the ability for other researchers to replicate the findings.
The paper acknowledges limitations, including the narrow focus on two datasets and the need for broader validation across different contexts. Additionally, the reliance on hand-crafted contracts may limit the generalizability of the approach, and the theoretical aspects of contract satisfaction under graph homomorphisms remain unexplored.
The proposed framework has significant potential implications for the field of quality assessment in multimedia services, particularly in enhancing user experience evaluation methods. By providing a more nuanced understanding of user satisfaction, this work could influence the design of future audio and speech processing systems, making them more aligned with user expectations and service-level objectives. The paper presents a contract-driven QoE auditing framework that enhances traditional MOS evaluation methods by introducing a more interpretable and stable approach to quality assessment. This contribution is significant as it addresses key limitations in existing methodologies, paving the way for more effective quality evaluation in speech and multimedia services.
Transformer-based audio SSL (self-supervised learning) models often treat spectrograms as images, applying convolutional patchification with heavy temporal downsampling. This lowers the effective Nyquist frequency and introduces aliasing, while naĂŻve low-pass filtering removes task-relevant high-frequency cues. In this study, we present Aliasing-aware Patch Embedding (AaPE), a drop-in patch stem that mitigates aliasing while preserving high-frequency information. AaPE augments standard patch tokens with features produced by a band-limited complex sinusoidal kernel using a two-sided exponential window that dynamically targets alias-prone bands. Frequency and decay parameters of the kernel are estimated from the input, enabling parallel, adaptive subband analysis whose outputs are fused with the standard patch tokens. AaPE integrates seamlessly into the masked teacher-student self-supervised learning. In addition, we combine a multi-mask strategy with a contrastive objective to enforce consistency across diverse mask patterns, stabilizing training. Pre-training on AudioSet followed by fine-tuning evaluation across diverse downstream benchmarks, which spanned categories, such as environmental sounds and other common audio domains. This approach yields state-of-the-art performance on a subset of tasks and competitive results across the remainder. Complementary linear probing evaluation mirrors this pattern, yielding clear gains on several benchmarks and strong performance elsewhere. The collective analysis of these results indicates that AaPE serves to mitigate the effects of aliasing without discarding of informative high-frequency content.
Primary: Chuo University
All Institutions: Oki Electric Industry Co., Ltd., Chuo University
The paper presents a novel approach to audio representation learning that effectively addresses aliasing issues in transformer-based models. The technical contributions, particularly the adaptive frequency analysis and integration of multi-mask strategies, position this work as a valuable advancement in the field of self-supervised learning for audio.
The methodology presented in this paper introduces the Aliasing-aware Patch Embedding (AaPE) as a novel approach to mitigate aliasing effects in audio representation learning. The use of a Structured Bilateral Laplace Unit (SBLU) that dynamically estimates frequency and decay parameters from input spectrograms is innovative. This allows for adaptive frequency analysis that is crucial for audio tasks where high-frequency information is often lost during aggressive downsampling. The integration of a multi-mask strategy with a contrastive objective further enhances the robustness of the training process. The approach is well-structured and builds on existing self-supervised learning frameworks, making it a significant contribution to the field.
The experimental evaluation is thorough, utilizing a variety of datasets (AudioSet, ESC-50, SCV2, etc.) to assess the performance of AaPE. The results indicate state-of-the-art performance on several benchmarks, demonstrating the effectiveness of the proposed method. The use of both fine-tuning and linear probing provides a comprehensive assessment of the model's capabilities across different tasks. The ablation studies conducted further validate the contributions of each component of the AaPE framework, showcasing the importance of adaptive frequency analysis and the multi-mask strategy.
While the paper provides a detailed description of the methodology and experimental setup, it lacks specific implementation details that would facilitate reproducibility, such as hyperparameter settings and code availability. The absence of a project URL or demo limits the ability of other researchers to replicate the findings directly. However, the clear presentation of the methodology allows for a reasonable understanding of how to implement the proposed approach.
The paper acknowledges some limitations, such as reliance on log-mel inputs and potential reduced gains on tasks dominated by fine-grained pitch dynamics or very short contexts. The focus on aliasing-prone bands may not address all aspects of audio representation learning, and the performance on certain tasks may still be suboptimal. Additionally, the increased complexity of the model due to the adaptive components may pose challenges in resource-constrained environments.
The proposed method has significant implications for audio representation learning, particularly in applications such as environmental sound classification, speech recognition, and music analysis. By effectively mitigating aliasing while preserving high-frequency information, AaPE can enhance the performance of various audio processing systems. The approach could lead to improvements in real-world applications where audio quality and accuracy are critical, such as in assistive technologies and multimedia content analysis. The paper presents a novel approach to audio representation learning that effectively addresses aliasing issues in transformer-based models. The technical contributions, particularly the adaptive frequency analysis and integration of multi-mask strategies, position this work as a valuable advancement in the field of self-supervised learning for audio.
Discrete speech tokens have gained attention for their storage efficiency and integration with Large Language Models (LLMs). They are commonly categorized into acoustic and semantic tokens, with the latter being more advantageous for Automatic Speech Recognition (ASR). Traditionally, unsupervised K-means clustering has been used to extract semantic speech tokens from Speech Foundation Models (SFMs). Recently, supervised methods, such as finite scalar quantization (FSQ) trained with ASR loss, have emerged for speech generation. Both approaches leverage pre-trained SFMs, benefiting low-resource tasks such as child ASR. This paper systematically compares supervised and unsupervised semantic speech tokens for child ASR. Results show that supervised methods not only outperform unsupervised ones but even unexpectedly surpass continuous representations, and they perform well even in ultra-low bitrate settings. These findings highlight the advantages of supervised semantic tokens and offer insights for improving discrete speech tokenization.
Primary: University of California Los Angeles
All Institutions: University of California Los Angeles
This paper presents a comprehensive analysis of supervised versus unsupervised semantic speech tokens for child ASR, revealing that supervised methods can outperform traditional approaches in both performance and efficiency. The innovative methodology and robust experimental evaluation contribute meaningfully to the field of machine learning, particularly in audio processing and speech recognition.
The paper presents a systematic comparison between unsupervised K-means clustering and supervised finite scalar quantization (FSQ) for extracting semantic speech tokens, specifically in the context of child ASR. The methodology is well-structured, leveraging pre-trained Speech Foundation Models (SFMs) to derive both types of tokens and evaluating their performance against continuous representations. The use of ASR loss to optimize the FSQ method is a significant advancement, as it aligns the tokenization process more closely with the downstream ASR task. The experimental design is robust, incorporating multiple datasets and a variety of performance metrics, which adds credibility to the findings.
The experiments are comprehensive, utilizing two distinct child speech corpora (MyST and OGI) to evaluate the performance of both tokenization methods. The results demonstrate that supervised FSQ tokens outperform both unsupervised K-means tokens and continuous representations, which is a noteworthy finding. The paper also explores bitrate efficiency and generalization across different domains and age groups, providing a thorough analysis of the trade-offs involved in tokenization methods. Statistical significance tests further strengthen the validity of the results.
While the paper provides a detailed description of the experimental setup, including the configurations for both K-means and FSQ tokenizers, it lacks specific implementation details or code availability that would facilitate reproducibility. The absence of a project URL or demo limits the ability for other researchers to replicate the findings directly.
The paper acknowledges limitations in the generalizability of the supervised FSQ tokenizer across different speaking styles and age groups, suggesting that it may be overfitted to the training data. Additionally, while the findings are compelling, the reliance on specific datasets may not fully capture the diversity of child speech in broader contexts.
The findings have significant implications for the development of efficient ASR systems, particularly in low-resource settings such as child speech recognition. The insights gained from comparing tokenization methods could inform future research and applications in speech processing, potentially leading to improvements in accessibility and usability of ASR technologies for children. This paper presents a comprehensive analysis of supervised versus unsupervised semantic speech tokens for child ASR, revealing that supervised methods can outperform traditional approaches in both performance and efficiency. The innovative methodology and robust experimental evaluation contribute meaningfully to the field of machine learning, particularly in audio processing and speech recognition.
Earables, such as True Wireless Stereo earphones and VR/AR headsets, are increasingly popular, yet their compact design poses challenges for robust voice-related applications like telecommunication and voice assistant interactions in noisy environments. Existing speech enhancement systems, reliant solely on omnidirectional microphones, struggle with ambient noise like competing speakers. To address these issues, we propose VibOmni, a lightweight, end-to-end multi-modal speech enhancement system for earables that leverages bone-conducted vibrations captured by widely available Inertial Measurement Units (IMUs). VibOmni integrates a two-branch encoder-decoder deep neural network to fuse audio and vibration features. To overcome the scarcity of paired audio-vibration datasets, we introduce a novel data augmentation technique that models Bone Conduction Functions (BCFs) from limited recordings, enabling synthetic vibration data generation with only 4.5% spectrogram similarity error. Additionally, a multi-modal SNR estimator facilitates continual learning and adaptive inference, optimizing performance in dynamic, noisy settings without on-device back-propagation. Evaluated on real-world datasets from 32 volunteers with different devices, VibOmni achieves up to 21% improvement in Perceptual Evaluation of Speech Quality (PESQ), 26% in Signal-to-Noise Ratio (SNR), and about 40% WER reduction with much less latency on mobile devices. A user study with 35 participants showed 87% preferred VibOmni over baselines, demonstrating its effectiveness for deployment in diverse acoustic environments.
Primary: The Chinese University of Hong Kong
All Institutions: The Chinese University of Hong Kong
The main contribution of this paper is the introduction of VibOmni, a novel multi-modal speech enhancement system that effectively utilizes bone-conducted vibrations to improve speech quality in noisy environments. This work represents a meaningful advancement in the field of audio processing for wearable technology, offering a unique approach to overcoming the limitations of traditional speech enhancement systems.
The paper introduces VibOmni, a multi-modal speech enhancement system that innovatively combines audio and bone-conducted vibration data using a two-branch encoder-decoder deep neural network. The methodology is well-structured, addressing the challenge of limited paired audio-vibration datasets through a novel data augmentation technique that models Bone Conduction Functions (BCFs). This approach is significant as it enables the generation of synthetic vibration data, which is crucial for training the model effectively. The integration of a multi-modal SNR estimator for continual learning and adaptive inference further enhances the robustness of the system in dynamic environments.
The experiments conducted on real-world datasets from 32 volunteers provide a solid foundation for evaluating the effectiveness of VibOmni. The reported improvements in PESQ, SNR, and WER demonstrate the system's capability to outperform existing baselines significantly. The user study with 35 participants adds a valuable subjective dimension to the evaluation, indicating a strong preference for VibOmni over traditional methods. However, details on the specific datasets used and the statistical significance of the results could enhance the evaluation's rigor.
The paper lacks detailed information on the implementation specifics, such as the architecture of the neural network, hyperparameter settings, and training procedures, which are essential for reproducibility. Additionally, the absence of a publicly available code repository or demo limits the ability of other researchers to replicate the results.
One notable limitation is the reliance on a relatively small dataset for training and evaluation, which may affect the generalizability of the model across diverse acoustic environments. Furthermore, while the paper addresses the challenge of noise in telecommunication, it does not explore the potential impact of varying user conditions (e.g., different head shapes or ear canal geometries) on the performance of the bone-conduction approach.
The development of VibOmni has significant implications for the usability of earables in noisy environments, potentially enhancing communication for users in various settings, such as crowded public spaces or during physical activities. The integration of bone-conduction technology could also pave the way for future innovations in wearable devices, improving user experience and interaction with voice assistants. The main contribution of this paper is the introduction of VibOmni, a novel multi-modal speech enhancement system that effectively utilizes bone-conducted vibrations to improve speech quality in noisy environments. This work represents a meaningful advancement in the field of audio processing for wearable technology, offering a unique approach to overcoming the limitations of traditional speech enhancement systems.
Existing methods for expressive music performance rendering rely on supervised learning over small labeled datasets, which limits scaling of both data volume and model size, despite the availability of vast unlabeled music, as in vision and language. To address this gap, we introduce Pianist Transformer, with four key contributions: 1) a unified Musical Instrument Digital Interface (MIDI) data representation for learning the shared principles of musical structure and expression without explicit annotation; 2) an efficient asymmetric architecture, enabling longer contexts and faster inference without sacrificing rendering quality; 3) a self-supervised pre-training pipeline with 10B tokens and 135M-parameter model, unlocking data and model scaling advantages for expressive performance rendering; 4) a state-of-the-art performance model, which achieves strong objective metrics and human-level subjective ratings. Overall, Pianist Transformer establishes a scalable path toward human-like performance synthesis in the music domain.
Primary: Nanjing University
All Institutions: Nanjing University
The Pianist Transformer establishes a scalable path toward human-like performance synthesis in the music domain. The paper's innovative methodology and strong experimental results position it as a noteworthy contribution to the field of machine learning in audio processing.
The paper introduces the Pianist Transformer, which leverages a unified MIDI data representation to learn musical structure and expression without explicit annotations. This self-supervised approach is innovative, particularly in the context of music performance rendering, where labeled datasets are typically scarce. The architecture is designed to be efficient, allowing for longer context handling and faster inference, which is crucial for real-time applications. The self-supervised pre-training pipeline is robust, utilizing a substantial amount of data (10 billion tokens) and a model with 135 million parameters, showcasing a thoughtful approach to scaling both data and model size.
The experiments conducted demonstrate the effectiveness of the proposed model through strong objective metrics and human-level subjective ratings. The authors provide a comprehensive evaluation of the model's performance against existing methods, which is essential for establishing its superiority. However, the paper could benefit from a more detailed breakdown of the experimental setup, including data sources and specific evaluation metrics used.
The paper includes links to the code repository and demo page, which is a positive aspect for reproducibility. However, the details regarding the training process, hyperparameters, and specific datasets used are somewhat limited. More transparency in these areas would enhance reproducibility.
One limitation is the reliance on MIDI data, which may not capture the full expressive nuances of piano performance as compared to audio recordings. Additionally, while the model achieves state-of-the-art results, the subjective nature of music performance rendering means that human evaluations can vary widely, which could affect the perceived quality of the outputs.
The implications of this research are significant, as it opens up new avenues for expressive music synthesis and performance rendering. The ability to generate human-like piano performances could have applications in music education, composition, and entertainment. Furthermore, the self-supervised approach could inspire similar methodologies in other domains where labeled data is limited. The Pianist Transformer establishes a scalable path toward human-like performance synthesis in the music domain. The paper's innovative methodology and strong experimental results position it as a noteworthy contribution to the field of machine learning in audio processing.
Advanced deep learning architectures, particularly recurrent neural networks (RNNs), have been widely applied in audio, bioacoustic, and biomedical signal analysis, especially in data-scarce environments. While gated RNNs remain effective, they can be relatively over-parameterised and less training-efficient in some regimes, while linear RNNs tend to fall short in capturing the complexity inherent in bio-signals. To address these challenges, we propose the Parallel Delayed Memory Unit (PDMU), a {delay-gated state-space module for short-term temporal credit assignment} targeting audio and bioacoustic signals, which enhances short-term temporal state interactions and memory efficiency via a gated delay-line mechanism. Unlike previous Delayed Memory Units (DMU) that embed temporal dynamics into the delay-line architecture, the PDMU further compresses temporal information into vector representations using Legendre Memory Units (LMU). This design serves as a form of causal attention, allowing the model to dynamically adjust its reliance on past states and improve real-time learning performance. Notably, in low-information scenarios, the gating mechanism behaves similarly to skip connections by bypassing state decay and preserving early representations, thereby facilitating long-term memory retention. The PDMU is modular, supporting parallel training and sequential inference, and can be easily integrated into existing linear RNN frameworks. Furthermore, we introduce bidirectional, efficient, and spiking variants of the architecture, each offering additional gains in performance or energy efficiency. Experimental results on diverse audio and biomedical benchmarks demonstrate that the PDMU significantly enhances both memory capacity and overall model performance.
Primary: Ghent University
All Institutions: Institute for Infocomm Research (IR), Agency for Science, Technology and Research (A*STAR), Ghent University, Department of Electrical and Electronic Engineering
The main contribution of this paper is the introduction of the Parallel Delayed Memory Unit (PDMU), which enhances temporal modeling in audio and biomedical signal analysis through a novel delay-gated architecture. This work represents a significant advancement in the efficiency and effectiveness of RNNs for processing complex temporal data, with potential applications in real-time healthcare solutions and audio processing technologies.
The proposed Parallel Delayed Memory Unit (PDMU) introduces a novel architecture that effectively combines delay-gated mechanisms with Legendre Memory Units to enhance temporal modeling in audio and biomedical signal processing. The methodology is well-structured, leveraging existing frameworks while innovatively addressing the limitations of traditional RNNs and linear models. The introduction of various PDMU variants (bi-directional, efficient, and spiking) demonstrates a comprehensive approach to optimizing performance and energy efficiency, which is particularly relevant for real-time applications.
The experimental evaluation is robust, utilizing a diverse set of benchmarks across audio and biomedical domains. The results demonstrate significant performance improvements over existing models, particularly in low-information scenarios, which is a critical aspect of real-world applications. The ablation studies further validate the contributions of individual components of the PDMU, providing clear evidence of its effectiveness.
The paper includes sufficient implementation details, such as the use of the PyTorch library and specific training configurations, which enhance reproducibility. However, the absence of a publicly available code repository limits the ease with which other researchers can replicate the results.
While the PDMU shows promise, the paper does not extensively discuss potential limitations, such as the scalability of the model to larger datasets or its performance in highly variable real-world conditions. Additionally, the reliance on specific datasets may limit generalizability.
The PDMU has significant implications for fields requiring efficient processing of temporal data, particularly in healthcare and audio signal analysis. Its ability to enhance real-time learning and memory retention could lead to advancements in medical diagnostics and monitoring technologies. The main contribution of this paper is the introduction of the Parallel Delayed Memory Unit (PDMU), which enhances temporal modeling in audio and biomedical signal analysis through a novel delay-gated architecture. This work represents a significant advancement in the efficiency and effectiveness of RNNs for processing complex temporal data, with potential applications in real-time healthcare solutions and audio processing technologies.
Recent neural audio codecs have achieved impressive reconstruction quality, typically relying on quantization methods such as Residual Vector Quantization (RVQ), Vector Quantization (VQ) and Finite Scalar Quantization (FSQ). However, these quantization techniques limit the geometric structure of the latent space, make it harder to capture correlations between features leading to inefficiency in representation learning, codebook utilization and token rate. In this paper we introduce Two Dimensional Quantization (Q2D2), a quantization scheme in which feature pairs are projected onto structured 2D grids such as hexagonal, rhombic, or rectangular tiling and quantized to the nearest grid values, yielding an implicit codebook defined by the product of grid levels, with codebook sizes comparable to conventional methods. Despite its simple geometric formulation, Q2D2 improves audio compression efficiency, with low token rates and high codebook utilization while maintaining state of the art reconstruction quality. Specifically, Q2D2 achieves competitive to superior performance in various objective and subjective reconstruction metrics, across extensive experiments in speech domain compared to state of the art models. Comprehensive ablation studies further confirm the effectiveness of our design choices.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of Q2D2, a geometry-aware quantization scheme that enhances audio codec performance by effectively capturing feature correlations through structured two-dimensional grids. This innovative approach not only improves reconstruction quality but also maintains high codebook utilization, positioning Q2D2 as a promising alternative to traditional quantization methods in audio processing.
The proposed Q2D2 quantization method introduces a novel approach to audio codec design by utilizing two-dimensional geometric structures for quantization. This method addresses limitations in existing quantization techniques, such as RVQ and FSQ, by capturing correlations between features more effectively. The methodology is well-structured, with clear explanations of the geometric tiling strategies and their implications for audio representation. The use of lightweight linear projections and Straight-Through Estimators (STE) enhances the differentiability and stability of the quantization process, making it suitable for end-to-end training.
The experimental evaluation is comprehensive, involving extensive datasets and multiple state-of-the-art (SOTA) models for comparison. The results demonstrate that Q2D2 achieves competitive to superior performance in reconstruction quality across various metrics, including UTMOS, PESQ, and STOI. The paper includes ablation studies that effectively highlight the impact of design choices, such as grid type and quantization levels, on performance. The thoroughness of the experiments lends credibility to the claims made regarding the advantages of Q2D2.
The paper provides detailed implementation and experimental setup information, which is crucial for reproducibility. However, the absence of a specific project or code repository limits the ability for others to fully replicate the results. The authors mention using the WavTokenizer framework, which is a positive aspect as it allows for some level of reproducibility if the framework is accessible.
One limitation is the lack of a clear primary institution and the absence of a demo or project URL, which could enhance the visibility and accessibility of the research. Additionally, while the paper focuses on speech reconstruction, the generalizability of Q2D2 to other audio domains remains to be explored in future work.
The introduction of Q2D2 has the potential to significantly impact the field of audio processing, particularly in applications requiring efficient audio compression without sacrificing quality. Its implications extend to areas such as speech synthesis, music generation, and multimodal systems that integrate audio with other modalities. The main contribution of this paper is the introduction of Q2D2, a geometry-aware quantization scheme that enhances audio codec performance by effectively capturing feature correlations through structured two-dimensional grids. This innovative approach not only improves reconstruction quality but also maintains high codebook utilization, positioning Q2D2 as a promising alternative to traditional quantization methods in audio processing.