Voice style conversion aims to transform an input utterance to match a target speaker's timbre, accent, and emotion, with a central challenge being the disentanglement of linguistic content from style. While prior work has explored this problem, conversion quality remains limited, and real-time voice style conversion has not been addressed. We propose StyleStream, the first streamable zero-shot voice style conversion system that achieves state-of-the-art performance. StyleStream consists of two components: a Destylizer, which removes style attributes while preserving linguistic content, and a Stylizer, a diffusion transformer (DiT) that reintroduces target style conditioned on reference speech. Robust content-style disentanglement is enforced through text supervision and a highly constrained information bottleneck. This design enables a fully non-autoregressive architecture, achieving real-time voice style conversion with an end-to-end latency of 1 second. Samples and real-time demo: https://berkeley-speech-group.github.io/StyleStream/.
Primary: UC Berkeley
All Institutions: UC Berkeley
StyleStream introduces the first real-time zero-shot voice style conversion system capable of modifying timbre, accent, and emotion with an end-to-end latency of approximately 1 second. This paper represents a significant technical contribution to the field, addressing key challenges in voice style conversion through innovative methodology and rigorous experimental validation, thus paving the way for practical applications in various domains.
The methodology presented in StyleStream is innovative, combining a Destylizer for content-style disentanglement with a Stylizer based on a diffusion transformer. The use of ASR loss and a compact finite scalar quantization (FSQ) bottleneck is a significant advancement over previous methods, allowing for cleaner disentanglement of linguistic content from style attributes. The non-autoregressive architecture enables real-time processing, which is a notable improvement in the field of voice style conversion. The paper provides a comprehensive description of the architecture, training procedures, and the rationale behind design choices, demonstrating a solid understanding of the challenges in voice style conversion.
The experimental evaluation is thorough, utilizing a diverse dataset of 50k hours of English speech for training and a well-structured test set for evaluation. The results show that StyleStream outperforms existing methods in terms of intelligibility and style fidelity across multiple metrics, including WER and similarity scores. The paper effectively communicates the performance improvements over baseline models, providing both objective and subjective evaluations, which strengthen the claims of superior performance.
The paper includes detailed descriptions of the architecture, training configurations, and evaluation metrics, which facilitate reproducibility. However, the lack of a publicly available code repository may hinder full reproducibility for some researchers. The authors could enhance reproducibility by providing access to their trained models and detailed implementation instructions.
One limitation is the reliance on a large amount of training data, which may not be readily available for all researchers. Additionally, while the paper claims real-time processing capabilities, the actual latency may vary depending on hardware, which could limit practical applications in certain environments. The model's performance with shorter reference utterances is also noted to degrade, indicating a potential limitation in flexibility.
The advancements presented in StyleStream have significant implications for applications in voice synthesis, dubbing, and personalized voice assistants, where real-time style conversion can enhance user experience. The ability to modify timbre, accent, and emotion in real-time opens up new avenues for interactive applications in entertainment, education, and accessibility, potentially impacting how voice technologies are integrated into daily life. StyleStream introduces the first real-time zero-shot voice style conversion system capable of modifying timbre, accent, and emotion with an end-to-end latency of approximately 1 second. This paper represents a significant technical contribution to the field, addressing key challenges in voice style conversion through innovative methodology and rigorous experimental validation, thus paving the way for practical applications in various domains.
Reconstructing the speech audio envelope from scalp neural recordings (EEG) is a central task for decoding a listener's attentional focus in applications like neuro-steered hearing aids. Current methods for this reconstruction, however, face challenges with fidelity and noise. Prevailing approaches treat it as a static regression problem, processing each EEG window in isolation and ignoring the rich temporal structure inherent in continuous speech. This study introduces a new, dynamic framework for envelope reconstruction that leverages this structure as a predictive temporal prior. We propose a state-space fusion model that combines direct neural estimates from EEG with predictions from recent speech context, using a learned gating mechanism to adaptively balance these cues. To validate this approach, we evaluate our model on the ICASSP 2023 Stimulus Reconstruction benchmark demonstrating significant improvements over static, EEG-only baselines. Our analyses reveal a powerful synergy between the neural and temporal information streams. Ultimately, this work reframes envelope reconstruction not as a simple mapping, but as a dynamic state-estimation problem, opening a new direction for developing more accurate and coherent neural decoding systems.
Primary: Johns Hopkins University
All Institutions: Johns Hopkins University, Laboratory for Computational Audio Perception
The main contribution of this work is the introduction of the DECAF model, which reframes speech envelope reconstruction from EEG as a dynamic state-estimation problem, significantly improving reconstruction accuracy through the integration of temporal context. This innovative approach not only advances the state-of-the-art in auditory attention decoding but also opens new avenues for research in brain-computer interfaces and neurotechnology.
The proposed DECAF model introduces a novel dynamic framework for reconstructing speech envelopes from EEG data by integrating a predictive temporal prior with direct neural estimates. This approach is innovative as it shifts the paradigm from static regression to dynamic state estimation, leveraging temporal dependencies inherent in speech signals. The architecture is modular, consisting of three core components: the EEG to Envelope decoder, the Envelope Forecaster, and the Dynamic Fusion gate, which work together to enhance reconstruction accuracy. The use of a learned gating mechanism to balance the contributions of neural evidence and temporal context is particularly noteworthy, as it allows for adaptive integration of information.
The authors validate their model using the ICASSP 2023 Stimulus Reconstruction benchmark, demonstrating significant improvements over static EEG-only baselines and achieving state-of-the-art performance. The experiments are well-structured, comparing DECAF against established baselines, including traditional methods and contemporary deep learning architectures. The results are quantitatively supported by statistical significance tests, and the ablation studies provide insights into the contributions of each component of the model.
The paper provides sufficient details regarding the dataset, experimental setup, and model training procedures, which enhances reproducibility. The authors adhere to established protocols and publicly share their code repository, facilitating further experimentation and validation by other researchers.
While the model shows promise, it may face challenges in real-world applications where EEG signals are subject to high noise levels. The performance of DECAF under extreme noise conditions aligns with baseline models, indicating a potential limitation in robustness. Additionally, the reliance on past predictions may introduce biases if the initial estimates are inaccurate.
The implications of this research extend to neuro-steered hearing aids and auditory attention decoding systems, potentially improving the quality of life for individuals with hearing impairments. By enhancing the accuracy of speech envelope reconstruction, the DECAF framework could lead to more effective auditory processing technologies, making it a significant contribution to both machine learning and assistive technologies. The main contribution of this work is the introduction of the DECAF model, which reframes speech envelope reconstruction from EEG as a dynamic state-estimation problem, significantly improving reconstruction accuracy through the integration of temporal context. This innovative approach not only advances the state-of-the-art in auditory attention decoding but also opens new avenues for research in brain-computer interfaces and neurotechnology.
We propose CC-G2PnP, a streaming grapheme-to-phoneme and prosody (G2PnP) model to connect large language model and text-to-speech in a streaming manner. CC-G2PnP is based on Conformer-CTC architecture. Specifically, the input grapheme tokens are processed chunk by chunk, which enables streaming inference of phonemic and prosodic (PnP) labels. By guaranteeing minimal look-ahead size to each input token, the proposed model can consider future context in each token, which leads to stable PnP label prediction. Unlike previous streaming methods that depend on explicit word boundaries, the CTC decoder in CC-G2PnP effectively learns the alignment between graphemes and phonemes during training, making it applicable to unsegmented languages. Experiments on a Japanese dataset, which has no explicit word boundaries, show that CC-G2PnP significantly outperforms the baseline streaming G2PnP model in the accuracy of PnP label prediction.
Primary: LY Corporation
All Institutions: LY Corporation
The paper presents CC-G2PnP, a novel streaming model for grapheme-to-phoneme and prosody conversion that addresses the challenges of unsegmented languages. Its innovative methodology and robust experimental results position it as a significant contribution to the field of audio processing and speech synthesis.
The proposed CC-G2PnP model employs a Conformer-CTC architecture that innovatively processes grapheme tokens in chunks, allowing for streaming inference of phonemic and prosodic labels. The introduction of minimum look-ahead (MLA) is a significant methodological advancement, as it addresses the limitations of previous streaming models that rely on explicit word boundaries. This approach is particularly beneficial for unsegmented languages like Japanese, where word boundaries are not clearly defined. The integration of self-conditioned CTC into the architecture further enhances the model's performance by allowing dynamic learning of alignments between graphemes and phonemes.
The experiments conducted on a Japanese dataset demonstrate the effectiveness of CC-G2PnP, showing significant improvements in character error rate (CER) and sentence error rate (SER) compared to baseline models. The use of both objective metrics and subjective assessments of TTS naturalness provides a comprehensive evaluation of the model's performance. The dataset preparation and experimental conditions are well-documented, allowing for a clear understanding of the model's capabilities and limitations.
While the paper provides detailed descriptions of the model architecture and training procedures, the lack of a publicly available code repository or demo URL limits reproducibility. The absence of specific hyperparameters and training configurations in a readily accessible format could hinder other researchers from replicating the results.
One limitation noted is the reliance on a large amount of training data to achieve optimal performance, which may not be feasible for all applications. Additionally, while the model performs well in terms of accuracy, the subjective evaluation of TTS naturalness could vary based on the speaker used during testing, which may not generalize across different voices.
The CC-G2PnP model has the potential to significantly enhance text-to-speech systems, particularly for languages without explicit word boundaries. This could lead to more natural and efficient human-machine interactions in various applications, including virtual assistants, language learning tools, and accessibility technologies for the visually impaired. The advancements in streaming G2PnP could also inspire further research in related areas, such as real-time speech synthesis and multilingual processing. The paper presents CC-G2PnP, a novel streaming model for grapheme-to-phoneme and prosody conversion that addresses the challenges of unsegmented languages. Its innovative methodology and robust experimental results position it as a significant contribution to the field of audio processing and speech synthesis.
Voice style conversion aims to transform an input utterance to match a target speaker's timbre, accent, and emotion, with a central challenge being the disentanglement of linguistic content from style. While prior work has explored this problem, conversion quality remains limited, and real-time voice style conversion has not been addressed. We propose StyleStream, the first streamable zero-shot voice style conversion system that achieves state-of-the-art performance. StyleStream consists of two components: a Destylizer, which removes style attributes while preserving linguistic content, and a Stylizer, a diffusion transformer (DiT) that reintroduces target style conditioned on reference speech. Robust content-style disentanglement is enforced through text supervision and a highly constrained information bottleneck. This design enables a fully non-autoregressive architecture, achieving real-time voice style conversion with an end-to-end latency of 1 second. Samples and real-time demo: https://berkeley-speech-group.github.io/StyleStream/.
Primary: UC Berkeley
All Institutions: UC Berkeley
StyleStream introduces the first real-time zero-shot voice style conversion system capable of modifying timbre, accent, and emotion with an end-to-end latency of approximately 1 second. This paper represents a significant technical contribution to the field, addressing key challenges in voice style conversion through innovative methodology and rigorous experimental validation, thus paving the way for practical applications in various domains.
The methodology presented in StyleStream is innovative, combining a Destylizer for content-style disentanglement with a Stylizer based on a diffusion transformer. The use of ASR loss and a compact finite scalar quantization (FSQ) bottleneck is a significant advancement over previous methods, allowing for cleaner disentanglement of linguistic content from style attributes. The non-autoregressive architecture enables real-time processing, which is a notable improvement in the field of voice style conversion. The paper provides a comprehensive description of the architecture, training procedures, and the rationale behind design choices, demonstrating a solid understanding of the challenges in voice style conversion.
The experimental evaluation is thorough, utilizing a diverse dataset of 50k hours of English speech for training and a well-structured test set for evaluation. The results show that StyleStream outperforms existing methods in terms of intelligibility and style fidelity across multiple metrics, including WER and similarity scores. The paper effectively communicates the performance improvements over baseline models, providing both objective and subjective evaluations, which strengthen the claims of superior performance.
The paper includes detailed descriptions of the architecture, training configurations, and evaluation metrics, which facilitate reproducibility. However, the lack of a publicly available code repository may hinder full reproducibility for some researchers. The authors could enhance reproducibility by providing access to their trained models and detailed implementation instructions.
One limitation is the reliance on a large amount of training data, which may not be readily available for all researchers. Additionally, while the paper claims real-time processing capabilities, the actual latency may vary depending on hardware, which could limit practical applications in certain environments. The model's performance with shorter reference utterances is also noted to degrade, indicating a potential limitation in flexibility.
The advancements presented in StyleStream have significant implications for applications in voice synthesis, dubbing, and personalized voice assistants, where real-time style conversion can enhance user experience. The ability to modify timbre, accent, and emotion in real-time opens up new avenues for interactive applications in entertainment, education, and accessibility, potentially impacting how voice technologies are integrated into daily life. StyleStream introduces the first real-time zero-shot voice style conversion system capable of modifying timbre, accent, and emotion with an end-to-end latency of approximately 1 second. This paper represents a significant technical contribution to the field, addressing key challenges in voice style conversion through innovative methodology and rigorous experimental validation, thus paving the way for practical applications in various domains.
Reconstructing the speech audio envelope from scalp neural recordings (EEG) is a central task for decoding a listener's attentional focus in applications like neuro-steered hearing aids. Current methods for this reconstruction, however, face challenges with fidelity and noise. Prevailing approaches treat it as a static regression problem, processing each EEG window in isolation and ignoring the rich temporal structure inherent in continuous speech. This study introduces a new, dynamic framework for envelope reconstruction that leverages this structure as a predictive temporal prior. We propose a state-space fusion model that combines direct neural estimates from EEG with predictions from recent speech context, using a learned gating mechanism to adaptively balance these cues. To validate this approach, we evaluate our model on the ICASSP 2023 Stimulus Reconstruction benchmark demonstrating significant improvements over static, EEG-only baselines. Our analyses reveal a powerful synergy between the neural and temporal information streams. Ultimately, this work reframes envelope reconstruction not as a simple mapping, but as a dynamic state-estimation problem, opening a new direction for developing more accurate and coherent neural decoding systems.
Primary: Johns Hopkins University
All Institutions: Johns Hopkins University, Laboratory for Computational Audio Perception
The main contribution of this work is the introduction of the DECAF model, which reframes speech envelope reconstruction from EEG as a dynamic state-estimation problem, significantly improving reconstruction accuracy through the integration of temporal context. This innovative approach not only advances the state-of-the-art in auditory attention decoding but also opens new avenues for research in brain-computer interfaces and neurotechnology.
The proposed DECAF model introduces a novel dynamic framework for reconstructing speech envelopes from EEG data by integrating a predictive temporal prior with direct neural estimates. This approach is innovative as it shifts the paradigm from static regression to dynamic state estimation, leveraging temporal dependencies inherent in speech signals. The architecture is modular, consisting of three core components: the EEG to Envelope decoder, the Envelope Forecaster, and the Dynamic Fusion gate, which work together to enhance reconstruction accuracy. The use of a learned gating mechanism to balance the contributions of neural evidence and temporal context is particularly noteworthy, as it allows for adaptive integration of information.
The authors validate their model using the ICASSP 2023 Stimulus Reconstruction benchmark, demonstrating significant improvements over static EEG-only baselines and achieving state-of-the-art performance. The experiments are well-structured, comparing DECAF against established baselines, including traditional methods and contemporary deep learning architectures. The results are quantitatively supported by statistical significance tests, and the ablation studies provide insights into the contributions of each component of the model.
The paper provides sufficient details regarding the dataset, experimental setup, and model training procedures, which enhances reproducibility. The authors adhere to established protocols and publicly share their code repository, facilitating further experimentation and validation by other researchers.
While the model shows promise, it may face challenges in real-world applications where EEG signals are subject to high noise levels. The performance of DECAF under extreme noise conditions aligns with baseline models, indicating a potential limitation in robustness. Additionally, the reliance on past predictions may introduce biases if the initial estimates are inaccurate.
The implications of this research extend to neuro-steered hearing aids and auditory attention decoding systems, potentially improving the quality of life for individuals with hearing impairments. By enhancing the accuracy of speech envelope reconstruction, the DECAF framework could lead to more effective auditory processing technologies, making it a significant contribution to both machine learning and assistive technologies. The main contribution of this work is the introduction of the DECAF model, which reframes speech envelope reconstruction from EEG as a dynamic state-estimation problem, significantly improving reconstruction accuracy through the integration of temporal context. This innovative approach not only advances the state-of-the-art in auditory attention decoding but also opens new avenues for research in brain-computer interfaces and neurotechnology.
Cover songs constitute a vital aspect of musical culture, preserving the core melody of an original composition while reinterpreting it to infuse novel emotional depth and thematic emphasis. Although prior research has explored the reinterpretation of instrumental music through melody-conditioned text-to-music models, the task of cover song generation remains largely unaddressed. In this work, we reformulate our cover song generation as a conditional generation, which simultaneously generates new vocals and accompaniment conditioned on the original vocal melody and text prompts. To this end, we present SongEcho, which leverages Instance-Adaptive Element-wise Linear Modulation (IA-EiLM), a framework that incorporates controllable generation by improving both conditioning injection mechanism and conditional representation. To enhance the conditioning injection mechanism, we extend Feature-wise Linear Modulation (FiLM) to an Element-wise Linear Modulation (EiLM), to facilitate precise temporal alignment in melody control. For conditional representations, we propose Instance-Adaptive Condition Refinement (IACR), which refines conditioning features by interacting with the hidden states of the generative model, yielding instance-adaptive conditioning. Additionally, to address the scarcity of large-scale, open-source full-song datasets, we construct Suno70k, a high-quality AI song dataset enriched with comprehensive annotations. Experimental results across multiple datasets demonstrate that our approach generates superior cover songs compared to existing methods, while requiring fewer than 30% of the trainable parameters. The code, dataset, and demos are available at https://github.com/lsfhuihuiff/SongEcho_ICLR2026.
Primary: National Natural Science Foundation of China
All Institutions: National Natural Science Foundation of China, China Scholarship Council, German Research Foundation, National Science and Technology Council, Taiwan
The paper presents a novel approach to cover song generation through the introduction of IA-EiLM and IACR, significantly advancing the field of audio machine learning. The methodology and experimental results indicate a strong potential for practical applications in music generation, although further work is needed to enhance reproducibility and address limitations.
The methodology presented in the paper is innovative, particularly with the introduction of Instance-Adaptive Element-wise Linear Modulation (IA-EiLM) and Instance-Adaptive Condition Refinement (IACR). The extension of Feature-wise Linear Modulation (FiLM) to EiLM is a notable advancement that addresses the challenge of temporal alignment in melody control. The dual focus on generating both vocals and accompaniment conditioned on the original melody and text prompts is a significant step forward in cover song generation. However, the paper could benefit from a more detailed explanation of the underlying mechanics of IA-EiLM and IACR, particularly how they interact with the generative model's hidden states.
The experimental results are compelling, demonstrating that SongEcho outperforms existing methods while utilizing fewer parameters, which suggests a more efficient model. The construction of the Suno70k dataset is a valuable contribution, addressing a critical gap in the availability of high-quality, annotated datasets for song generation tasks. However, the paper should provide more comprehensive comparisons with a wider range of baseline methods to strengthen the claims of superiority.
The authors have made the code and dataset publicly available, which is a positive aspect for reproducibility. However, the paper lacks detailed implementation instructions and hyperparameter settings, which could hinder other researchers from replicating the results accurately.
One limitation of the study is the potential overfitting due to the small size of the dataset relative to the complexity of the task. Additionally, the subjective nature of music generation means that quantitative metrics may not fully capture the quality of the generated songs. The paper could also explore the limitations of the IA-EiLM and IACR methods in different musical contexts or genres.
The proposed framework has significant implications for the music industry, particularly in automated music composition and cover song generation. By enabling more nuanced and emotionally resonant reinterpretations of existing songs, this research could enhance creative processes in music production. Furthermore, the availability of the Suno70k dataset could spur further research in music generation and related fields. The paper presents a novel approach to cover song generation through the introduction of IA-EiLM and IACR, significantly advancing the field of audio machine learning. The methodology and experimental results indicate a strong potential for practical applications in music generation, although further work is needed to enhance reproducibility and address limitations.
Remote monitoring of heart failure (HF) via speech signals provides a non-invasive and cost-effective solution for long-term patient management. However, substantial inter-individual heterogeneity in vocal characteristics often limits the accuracy of traditional cross-sectional classification models. To address this, we propose a Longitudinal Intra-Patient Tracking (LIPT) scheme designed to capture the trajectory of relative symptomatic changes within individuals. Central to this framework is a Personalised Sequential Encoder (PSE), which transforms longitudinal speech recordings into context-aware latent representations. By incorporating historical data at each timestamp, the PSE facilitates a holistic assessment of the clinical trajectory rather than modelling discrete visits independently. Experimental results from a cohort of 225 patients demonstrate that the LIPT paradigm significantly outperforms the classic cross-sectional approaches, achieving a recognition accuracy of 99.7% for clinical status transitions. The model's high sensitivity was further corroborated by additional follow-up data, confirming its efficacy in predicting HF deterioration and its potential to secure patient safety in remote, home-based settings. Furthermore, this work addresses the gap in existing literature by providing a comprehensive analysis of different speech task designs and acoustic features. Taken together, the superior performance of the LIPT framework and PSE architecture validates their readiness for integration into long-term telemonitoring systems, offering a scalable solution for remote heart failure management.
Primary: Taizhou People’s Hospital
All Institutions: Taizhou People’s Hospital, Jiangsu, China
The main contribution of this paper is the development of a personalized speech-based monitoring system for heart failure that utilizes a longitudinal approach to track individual patient trajectories. This innovative methodology and its promising results highlight the potential for speech dynamics to serve as effective biomarkers in the management of chronic health conditions.
The proposed methodology introduces a novel Longitudinal Intra-Patient Tracking (LIPT) framework that leverages a Personalised Sequential Encoder (PSE) to model heart failure (HF) progression through speech dynamics. This approach is innovative as it shifts the focus from traditional cross-sectional models to a longitudinal perspective, capturing individual patient trajectories over time. The methodology is well-structured, with clear stages for feature extraction, statistical screening, and longitudinal tracking, which collectively enhance the model's ability to account for inter-individual variability in speech characteristics. The integration of both global and frame-level features, particularly the emphasis on RASTA features, demonstrates a thoughtful approach to feature selection that aligns with the clinical context of HF monitoring.
The experimental evaluation is robust, utilizing a substantial cohort of 225 patients and multiple speech tasks to assess the model's performance. The results indicate a significant improvement in classification accuracy (99.7%) compared to traditional methods, underscoring the effectiveness of the LIPT framework. The use of follow-up data to validate model performance further strengthens the findings, although the paper could benefit from additional comparative analyses with more diverse datasets to enhance generalizability.
The paper provides a comprehensive description of the data collection process, feature extraction methods, and model architecture, which are essential for reproducibility. However, the lack of detailed hyperparameter settings and training procedures may pose challenges for other researchers attempting to replicate the results. The availability of code and models on GitHub is a positive aspect that facilitates reproducibility.
One notable limitation is the potential for high false-positive rates in identifying stable patients, which could affect clinical applicability. The model's reliance on specific speech tasks may also limit its generalizability across different populations or settings. Additionally, the study's focus on a single institution may restrict the diversity of the patient cohort, impacting the external validity of the findings.
This research has significant implications for remote patient monitoring in heart failure management, particularly in resource-limited settings. The ability to accurately track HF status through non-invasive speech analysis could enhance patient safety and reduce healthcare costs. The findings may pave the way for integrating such technologies into routine clinical practice, thereby improving patient outcomes and access to care. The main contribution of this paper is the development of a personalized speech-based monitoring system for heart failure that utilizes a longitudinal approach to track individual patient trajectories. This innovative methodology and its promising results highlight the potential for speech dynamics to serve as effective biomarkers in the management of chronic health conditions.
Long-context modeling is essential for symbolic music generation, since motif repetition and developmental variation can span thousands of musical events. However, practical composition and performance workflows frequently rely on resource-limited devices (e.g., electronic instruments and portable computers), making heavy memory and attention computation difficult to deploy. We introduce Depth-Structured Music Recurrence (DSMR), a recurrent long-context Transformer for full-piece symbolic music modeling that extends context beyond fixed-length excerpts via segment-level recurrence with detached cross-segment states, featuring a layer-wise memory-horizon schedule that budgets recurrent KV states across depth. DSMR is trained in a single left-to-right pass over each complete composition, akin to how a musician experiences it from beginning to end, while carrying recurrent cross-segment states forward. Within this recurrent framework, we systematically study how depth-wise horizon allocations affect optimization, best-checkpoint perplexity, and efficiency. By allocating different history-window lengths across layers while keeping the total recurrent-state budget fixed, DSMR creates depth-dependent temporal receptive fields within a recurrent attention stack without reducing compute depth. Our main instantiation is a two-scale DSMR schedule that allocates long history windows to lower layers and a uniform short window to the remaining layers. Experiments on the piano performance dataset MAESTRO demonstrate that two-scale DSMR provides a practical quality--efficiency recipe for full-length long-context symbolic music modeling with recurrent attention under limited computational resources.
Primary: Auckland University of Technology
All Institutions: Auckland University of Technology
The paper presents a novel approach to long-context modeling in symbolic music generation through the Depth-Structured Music Recurrence framework, effectively balancing computational efficiency with the need for extensive contextual information. The comprehensive methodology and rigorous experimental evaluation contribute to its significance in advancing the field of machine learning for music generation.
The proposed Depth-Structured Music Recurrence (DSMR) framework innovatively addresses the challenges of long-context modeling in symbolic music generation by implementing a recurrent long-context Transformer that utilizes segment-level recurrence with detached cross-segment states. The methodology is well-structured, allowing for depth-dependent temporal receptive fields by varying memory horizons across layers. This approach is particularly relevant for resource-constrained environments, as it balances computational efficiency with the need for extensive contextual information in music generation.
The experiments conducted on the MAESTRO dataset are rigorous and well-documented, demonstrating the effectiveness of the two-scale DSMR model in achieving lower perplexity compared to other methods under similar memory constraints. The evaluation metrics, including perplexity and efficiency (tokens processed per second, peak memory usage), provide a comprehensive view of the model's performance. The comparative analysis with other models, including full-attention references, adds credibility to the findings.
The paper provides sufficient details regarding the experimental setup, model architecture, and training protocols, which enhances reproducibility. However, the absence of a publicly available code repository or demo limits the ease with which other researchers can replicate the results.
The study acknowledges certain limitations, such as the potential impact of the chosen model scale and memory settings on performance. Additionally, the findings may not generalize to all long-context domains, and the exploration of the design space for memory retention and gating could be expanded.
The implications of this research are significant for the field of symbolic music generation, particularly in enabling more efficient models that can operate on consumer-grade hardware. This advancement could facilitate real-time applications in music composition and performance, making sophisticated music generation tools more accessible to creators. The paper presents a novel approach to long-context modeling in symbolic music generation through the Depth-Structured Music Recurrence framework, effectively balancing computational efficiency with the need for extensive contextual information. The comprehensive methodology and rigorous experimental evaluation contribute to its significance in advancing the field of machine learning for music generation.
Automatic Chord Recognition (ACR) is constrained by the scarcity of aligned chord labels, as well-aligned annotations are costly to acquire. At the same time, open-weight pre-trained models are currently more accessible than their proprietary training data. In this work, we present a two-stage training pipeline that leverages pre-trained models together with unlabeled audio. The proposed method decouples training into two stages. In the first stage, we use a pre-trained BTC model as a teacher to generate pseudo-labels for over 1,000 hours of diverse unlabeled audio and train a student model solely on these pseudo-labels. In the second stage, the student is continually trained on ground-truth labels as they become available, with selective knowledge distillation (KD) from the teacher applied as a regularizer to prevent catastrophic forgetting of the representations learned in the first stage. In our experiments, two models (BTC, 2E1D) were used as students. In stage 1, using only pseudo-labels, the BTC student achieves over 98% of the teacher's performance, while the 2E1D model achieves about 96% across seven standard mir_eval metrics. After a single training run for both students in stage 2, the resulting BTC student model surpasses the traditional supervised learning baseline by 2.5% and the original pre-trained teacher model by 1.55% on average across all metrics. And the resulting 2E1D student model improves from the traditional supervised learning baseline by 3.79% on average and achieves almost the same performance as the teacher. Both cases show the large gains on rare chord qualities.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of a two-stage training pipeline that leverages pseudo-labeling and knowledge distillation to enhance automatic chord recognition, particularly in scenarios with limited labeled data. This work presents a significant advancement in the field, offering a practical solution to a common challenge in music information retrieval.
The paper proposes a two-stage training pipeline that effectively utilizes pseudo-labeling and knowledge distillation to enhance automatic chord recognition. The methodology is well-structured, with a clear distinction between the two training phases: the first leverages a pre-trained teacher model to generate pseudo-labels from unlabeled audio, while the second phase incorporates ground-truth labels with selective knowledge distillation to mitigate catastrophic forgetting. This approach is innovative in its decoupling of labeled and unlabeled data training, allowing for improved model performance even when labeled data is scarce.
The experiments are comprehensive, utilizing over 1,000 hours of unlabeled audio across various datasets. The results demonstrate significant improvements in performance metrics, particularly for rare chord qualities, which is a critical aspect of chord recognition. The use of standard mir_eval metrics adds rigor to the evaluation, and the comparative analysis against traditional supervised learning baselines highlights the effectiveness of the proposed method. However, the paper could benefit from more detailed ablation studies to further validate the impact of each component in the training pipeline.
The paper provides sufficient detail regarding the training configurations, datasets, and evaluation metrics, which supports reproducibility. However, the absence of a clear description of the experimental setup and hyperparameter tuning could pose challenges for other researchers attempting to replicate the results. The provided GitHub link to the project may aid in this regard, assuming it contains the necessary code and documentation.
One limitation identified is the reliance on the quality of the teacher model for generating pseudo-labels. If the teacher model is biased or poorly generalizable, it could negatively impact the performance of the student model. Additionally, the paper does not address potential issues related to the scalability of the method when applied to larger datasets or more complex chord recognition tasks.
The proposed methodology has significant implications for the field of music information retrieval, particularly in enhancing automatic chord recognition capabilities. By effectively utilizing unlabeled data, this approach could lower the barriers to developing robust ACR systems, making them more accessible for various applications in music analysis, education, and automated music transcription. The focus on improving recognition of rare chord qualities also addresses a critical gap in existing ACR systems. The main contribution of this paper is the introduction of a two-stage training pipeline that leverages pseudo-labeling and knowledge distillation to enhance automatic chord recognition, particularly in scenarios with limited labeled data. This work presents a significant advancement in the field, offering a practical solution to a common challenge in music information retrieval.
Accent normalization (AN) systems often struggle with unnatural outputs and undesired content distortion, stemming from both suboptimal training data and rigid duration modeling. In this paper, we propose a "source-synthesis" methodology for training data construction. By generating source L2 speech and using authentic native speech as the training target, our approach avoids learning from TTS artifacts and, crucially, requires no real L2 data in training. Alongside this data strategy, we introduce CosyAccent, a non-autoregressive model that resolves the trade-off between prosodic naturalness and duration control. CosyAccent implicitly models rhythm for flexibility yet offers explicit control over total output duration. Experiments show that, despite being trained without any real L2 speech, CosyAccent achieves significantly improved content preservation and superior naturalness compared to strong baselines trained on real-world data.
Primary: Shenzhen Research Institute of Big Data
All Institutions: Shenzhen Research Institute of Big Data
The paper presents CosyAccent, a novel duration-controllable accent normalization model that utilizes a unique source-synthesis training data strategy to improve the naturalness and content preservation of accent conversion systems. This work represents a meaningful advancement in the field, addressing critical challenges in accent normalization and offering a scalable solution for future research and applications.
The paper introduces a novel "source-synthesis" methodology for constructing training data, which synthesizes L2 source speech from a high-quality L1 corpus, thereby avoiding TTS artifacts. This innovative approach allows the model to be trained without real L2 data, which is a significant advancement in the field of accent normalization. The CosyAccent model itself is a non-autoregressive architecture that effectively balances prosodic naturalness and explicit duration control, addressing a critical limitation in existing models.
The experiments are well-structured, comparing CosyAccent against strong baselines trained on real L2 data. The results demonstrate significant improvements in content preservation and naturalness, validated through both subjective and objective metrics. The use of a diverse dataset covering multiple accents adds robustness to the evaluation, although the paper could benefit from a more extensive discussion on the statistical significance of the results.
The paper provides adequate details regarding the model architecture, training data construction, and evaluation metrics, which supports reproducibility. The inclusion of a GitHub repository for code and data further enhances the potential for other researchers to replicate the findings.
The primary limitation noted is the model's robustness to acoustic noise and control over paralinguistic features, as the synthetic data used for training is very clean. Additionally, the reliance on a specific TTS model for data generation may limit the generalizability of the approach to other languages or accents.
The implications of this research are significant, particularly for applications in language learning, dubbing, and personalized TTS systems. By reducing the dependency on real L2 data, the proposed method could facilitate the development of accent normalization systems in resource-scarce languages, potentially broadening access to language education and media. The paper presents CosyAccent, a novel duration-controllable accent normalization model that utilizes a unique source-synthesis training data strategy to improve the naturalness and content preservation of accent conversion systems. This work represents a meaningful advancement in the field, addressing critical challenges in accent normalization and offering a scalable solution for future research and applications.
In sequence-to-sequence Transformer ASR, autoregressive (AR) models achieve strong accuracy but suffer from slow decoding, while non-autoregressive (NAR) models enable parallel decoding at the cost of degraded performance. We propose a principled NAR ASR framework based on Masked Diffusion Models to reduce this gap. A pre-trained speech encoder is coupled with a Transformer diffusion decoder conditioned on acoustic features and partially masked transcripts for parallel token prediction. To mitigate the training-inference mismatch, we introduce Iterative Self-Correction Training that exposes the model to its own intermediate predictions. We also design a Position-Biased Entropy-Bounded Confidence-based sampler with positional bias to further boost results. Experiments across multiple benchmarks demonstrate consistent gains over prior NAR models and competitive performance with strong AR baselines, while retaining parallel decoding efficiency.
Primary: Georgia Institute of Technology
All Institutions: Georgia Institute of Technology, UniversitĂ degli Studi di Palermo
The paper presents MDM-ASR, a novel approach that leverages masked diffusion models for efficient and accurate non-autoregressive automatic speech recognition. The integration of innovative methodologies and comprehensive experimental validation positions this work as a meaningful contribution to the field of machine learning and speech processing.
The proposed MDM-ASR framework innovatively integrates masked diffusion models into the ASR domain, addressing the limitations of autoregressive and traditional non-autoregressive models. The use of Iterative Self-Correction Training (ISCT) to align training with inference is a significant methodological advancement, as it allows the model to learn from its own predictions, thereby enhancing robustness. The introduction of Position-Biased Entropy-Bounded Confidence-based samplers further refines the decoding process, showcasing a well-thought-out approach to improving efficiency and accuracy.
The experiments are comprehensive, covering multiple English and multilingual datasets, and the results demonstrate that MDM-ASR outperforms existing models in both accuracy and decoding efficiency. The ablation studies provide valuable insights into the contributions of various components, reinforcing the robustness of the findings. However, the reliance on specific datasets may limit the generalizability of the results.
The paper provides sufficient details regarding the experimental setup, including model architecture and training procedures, which enhances reproducibility. However, the absence of publicly available code or a demo limits the practical reproducibility of the results.
The paper acknowledges limitations in terms of dataset diversity and the need for further exploration of alternative model configurations. Additionally, the evaluation is primarily based on benchmark datasets, which may not fully capture real-world performance across varied conditions.
The advancements in ASR technology presented in this paper have significant implications for real-time applications, such as virtual assistants and transcription services, where efficiency and accuracy are paramount. The proposed methods could pave the way for more scalable and effective ASR systems across different languages and domains. The paper presents MDM-ASR, a novel approach that leverages masked diffusion models for efficient and accurate non-autoregressive automatic speech recognition. The integration of innovative methodologies and comprehensive experimental validation positions this work as a meaningful contribution to the field of machine learning and speech processing.
This paper highlights the critical importance of multi-channel speech enhancement (MCSE) for speech emotion recognition (ER) in cocktail party scenarios. A multi-channel speech dereverberation and separation front-end integrating DNN-WPE and mask-based MVDR is used to extract the target speaker's speech from the mixture speech, before being fed into the downstream ER back-end using HuBERT- and ViT-based speech and visual features. Experiments on mixture speech constructed using the IEMOCAP and MSP-FACE datasets suggest the MCSE output consistently outperforms domain fine-tuned single-channel speech representations produced by: a) Conformer-based metric GANs; and b) WavLM SSL features with optional SE-ER dual task fine-tuning. Statistically significant increases in weighted, unweighted accuracy and F1 measures by up to 9.5%, 8.5% and 9.1% absolute (17.1%, 14.7% and 16.0% relative) are obtained over the above single-channel baselines. The generalization of IEMOCAP trained MCSE front-ends are also shown when being zero-shot applied to out-of-domain MSP-FACE data.
Primary: Institute of Software, Chinese Academy of Sciences
All Institutions: Institute of Software, Chinese Academy of Sciences, National Research Council Canada, The Chinese University of Hong Kong
This paper makes a significant contribution by demonstrating the effectiveness of multi-channel speech enhancement techniques for improving emotion recognition in challenging acoustic environments. The innovative methodology and strong experimental results highlight its potential impact on future research and applications in the field.
The paper presents a robust methodology that integrates multi-channel speech enhancement (MCSE) with emotion recognition (ER) in cocktail party scenarios. The use of a DNN-WPE based dereverberation and mask-based MVDR separation front-end is innovative, particularly in its application to ER, which has traditionally relied on single-channel inputs. The integration of HuBERT and ViT for feature extraction further enhances the approach, making it suitable for both audio-only and audio-visual ER systems. The detailed ablation studies provide insights into the contributions of each component, showcasing a comprehensive understanding of the problem space.
The experiments are well-structured, utilizing two established datasets (IEMOCAP and MSP-FACE) to evaluate the proposed system's performance. The results demonstrate statistically significant improvements in accuracy and F1 scores compared to single-channel baselines, indicating the effectiveness of the MCSE approach. The zero-shot application of the MCSE front-end to out-of-domain data is particularly noteworthy, suggesting good generalization capabilities.
The paper provides sufficient details regarding the experimental setup, including model configurations and training strategies, which enhances reproducibility. However, the absence of a publicly available code repository may hinder full reproducibility for other researchers.
While the paper addresses a significant gap in the literature, it does not explore the potential computational costs and real-time applicability of the proposed MCSE front-end in practical scenarios. Additionally, the reliance on simulated data for training may limit the model's performance in real-world applications.
The findings of this research have the potential to significantly advance the field of emotion recognition in noisy environments, particularly in applications such as human-computer interaction, assistive technologies, and surveillance systems. The integration of multi-channel processing could lead to more robust systems capable of understanding human emotions in complex auditory scenes. This paper makes a significant contribution by demonstrating the effectiveness of multi-channel speech enhancement techniques for improving emotion recognition in challenging acoustic environments. The innovative methodology and strong experimental results highlight its potential impact on future research and applications in the field.
Self-supervised speech models (S3Ms) are known to encode rich phonetic information, yet how this information is structured remains underexplored. We conduct a comprehensive study across 96 languages to analyze the underlying structure of S3M representations, with particular attention to phonological vectors. We first show that there exist linear directions within the model's representation space that correspond to phonological features. We further demonstrate that the scale of these phonological vectors correlate to the degree of acoustic realization of their corresponding phonological features in a continuous manner. For example, the difference between [d] and [t] yields a voicing vector: adding this vector to [p] produces [b], while scaling it results in a continuum of voicing. Together, these findings indicate that S3Ms encode speech using phonologically interpretable and compositional vectors, demonstrating phonological vector arithmetic. All code and interactive demos are available at https://github.com/juice500ml/phonetic-arithmetic .
Primary: Unknown
All Institutions: Unknown
The main contribution of this paper is the demonstration that self-supervised speech models encode phonologically interpretable and compositional vectors, revealing a structured representation of phonological features. This work significantly advances the understanding of S3M representations and their potential applications in speech technology and linguistics.
The paper presents a novel approach to understanding self-supervised speech models (S3Ms) by investigating the linear structure of phonological features within their representation space. The methodology involves analyzing phonological vectors across 96 languages, establishing a framework for phonological vector arithmetic. The use of cosine similarity to evaluate phonological analogies and the introduction of a vocoder to assess the scaling of phonological vectors are innovative aspects that enhance the understanding of S3M representations.
The experiments are well-structured, utilizing two datasets (TIMIT and VoxAngeles) to validate the hypotheses regarding phonological vector arithmetic and scaling. The results demonstrate a strong correlation between the scale of phonological vectors and acoustic measurements, providing empirical support for the proposed theories. The analysis of phonological features across different languages adds to the robustness of the findings, although the paper could benefit from a broader range of S3Ms to validate the generalizability of the results.
The authors have made their code and interactive demos publicly available, which is a positive aspect for reproducibility. However, the paper could improve by providing more detailed implementation specifics, particularly regarding the training of the vocoder and the exact configurations used for the S3Ms.
The study is limited by its focus on a specific set of phonological features as defined by PanPhon, which may not capture the full complexity of phonological systems across all languages. Additionally, the results are influenced by the choice of vocoder, and the authors acknowledge that different vocoders may yield varying synthesis results. The paper also notes that it does not explore all possible S3Ms, which could limit the generalizability of the findings.
The findings have significant implications for both speech processing and linguistic theory. By demonstrating that S3Ms can learn interpretable phonological structures, the research opens avenues for more intuitive speech synthesis and understanding of phonological features as continuous rather than binary. This could enhance applications in speech recognition, synthesis, and language learning technologies. The main contribution of this paper is the demonstration that self-supervised speech models encode phonologically interpretable and compositional vectors, revealing a structured representation of phonological features. This work significantly advances the understanding of S3M representations and their potential applications in speech technology and linguistics.
Voice-based digital biomarkers can enable scalable, non-invasive screening and monitoring of Parkinson's disease (PD) and Amyotrophic Lateral Sclerosis (ALS). However, models trained on one cohort or device often fail on new acquisition settings due to cross-device and cross-cohort domain shift. This challenge is amplified in real-world scenarios with partial-label mismatch, where datasets may contain different disease labels and only partially overlap in class space. In addition, voice-based models may exploit demographic cues, raising concerns about gender-related unfairness, particularly when deployed across heterogeneous cohorts. To tackle these challenges, we propose a hybrid framework for unified three-class (healthy/PD/ALS) cross-domain voice classification from partially overlapping cohorts. The method combines style-based domain generalization with conditional adversarial alignment tailored to partial-label settings, reducing negative transfer. An additional adversarial gender branch promotes gender-invariant representations. We conduct a comprehensive evaluation across four heterogeneous sustained-vowel datasets, spanning distinct acquisition settings and devices, under both domain generalization and unsupervised domain adaptation protocols. The proposed approach is compared against twelve state-of-the-art machine learning and deep learning methods, and further evaluated through three targeted ablations, providing the first cross-cohort benchmark and end-to-end domain-adaptive framework for unified healthy/PD/ALS voice classification under partial-label mismatch and fairness constraints. Across all experimental settings, our method consistently achieves the best external generalization over the considered evaluation metrics, while maintaining reduced gender disparities. Notably, no competing method shows statistically significant gains in external performance.
Primary: Ecole Polytechnique Federale de Lausanne
All Institutions: Ecole Polytechnique Federale de Lausanne, UniversitĂ Campus Bio-Medico di Roma, Eustema S.p.A., UmeĂĄ University, UniCamillus-Saint Camillus International University of Health Sciences
The paper presents a novel framework for voice classification of Parkinson's and ALS that effectively addresses challenges of domain adaptation and fairness. The comprehensive evaluation and innovative methodology contribute significantly to the fields of machine learning and medical AI.
The proposed FairPDA framework integrates multiple advanced techniques such as style-based domain generalization, conditional adversarial alignment, and adversarial gender debiasing. This hybrid approach is well-structured and addresses the complex problem of partial-label domain adaptation while considering fairness in voice classification tasks. The methodology is innovative in its combination of techniques and is well-justified through the literature review, although it could benefit from clearer explanations of the specific contributions of each component.
The experiments are comprehensive, utilizing four heterogeneous datasets and comparing the proposed method against twelve state-of-the-art approaches. The evaluation metrics are appropriate for the task, including Balanced Accuracy, Matthews Correlation Coefficient, and fairness metrics. The results demonstrate that FairPDA consistently outperforms competing methods, although the absolute performance levels are moderate, indicating the difficulty of the task.
The paper provides sufficient details regarding the methodology, datasets, and evaluation metrics, which facilitates reproducibility. However, the absence of a publicly available code repository or demo limits the ease with which others can replicate the findings.
The study is limited by its focus on binary gender labels for fairness analysis, which restricts the scope of the fairness evaluation. Additionally, the performance metrics indicate that while FairPDA outperforms competitors, the overall accuracy remains moderate, suggesting that further improvements are needed for practical deployment.
The research has significant implications for the field of medical AI, particularly in the development of voice-based diagnostic tools for neurodegenerative diseases. By addressing fairness and domain adaptation, this work contributes to the ethical deployment of AI in healthcare, potentially leading to better patient outcomes across diverse populations. The paper presents a novel framework for voice classification of Parkinson's and ALS that effectively addresses challenges of domain adaptation and fairness. The comprehensive evaluation and innovative methodology contribute significantly to the fields of machine learning and medical AI.
In voice conversion (VC) applications, diffusion and flow-matching models have exhibited exceptional speech quality and speaker similarity performances. However, they are limited by slow conversion owing to their iterative inference. Consequently, we propose MeanVoiceFlow, a novel one-step nonparallel VC model based on mean flows, which can be trained from scratch without requiring pretraining or distillation. Unlike conventional flow matching that uses instantaneous velocity, mean flows employ average velocity to more accurately compute the time integral along the inference path in a single step. However, training the average velocity requires its derivative to compute the target velocity, which can cause instability. Therefore, we introduce a structural margin reconstruction loss as a zero-input constraint, which moderately regularizes the input-output behavior of the model without harmful statistical averaging. Furthermore, we propose conditional diffused-input training in which a mixture of noise and source data is used as input to the model during both training and inference. This enables the model to effectively leverage source information while maintaining consistency between training and inference. Experimental results validate the effectiveness of these techniques and demonstrate that MeanVoiceFlow achieves performance comparable to that of previous multi-step and distillation-based models, even when trained from scratch. Audio samples are available at https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/meanvoiceflow/.
Primary: NTT Corporation
All Institutions: NTT Corporation
The paper presents MeanVoiceFlow, a novel one-step nonparallel voice conversion model that significantly enhances conversion speed and efficiency. The technical contributions, particularly in addressing training stability and maintaining consistency between training and inference, are well-founded and have the potential to influence future work in voice conversion and related audio applications.
The proposed MeanVoiceFlow model introduces a novel approach to voice conversion by utilizing mean flows instead of traditional instantaneous velocities, which significantly enhances the speed and efficiency of the conversion process. The introduction of a structural margin reconstruction loss addresses training instability, while the conditional diffused-input training method effectively bridges the gap between training and inference, ensuring consistency in performance. The methodology is well-structured, with clear theoretical foundations and practical implementations that are rigorously justified.
The experimental validation is thorough, employing a variety of datasets and metrics to assess the model's performance. The results demonstrate that MeanVoiceFlow achieves performance on par with existing multi-step and distillation-based models, showcasing its effectiveness even when trained from scratch. The use of both objective and subjective evaluation metrics strengthens the credibility of the findings, although further details on the statistical significance of the results would enhance the robustness of the claims.
The paper provides sufficient implementation details, including the architecture of the neural networks and the training procedures, which should facilitate reproducibility. However, the absence of code availability or a public repository could hinder independent verification of the results. Including a clear description of the experimental setup and hyperparameters is beneficial, yet a shared codebase would greatly enhance reproducibility.
One limitation of the study is the reliance on specific datasets, which may affect the generalizability of the results to other voice conversion tasks or languages. Additionally, while the model performs well in zero-shot scenarios, its performance in more complex voice conversion tasks involving diverse accents or languages remains to be evaluated. The potential for over-smoothing in outputs due to the structural margin reconstruction loss also warrants further investigation.
The advancements presented in this paper have significant implications for real-time voice conversion applications, such as in virtual assistants, gaming, and entertainment. The ability to convert voices quickly and effectively without extensive pretraining could democratize access to high-quality voice synthesis technologies. Furthermore, the methodologies introduced may inspire future research in related fields, such as speech synthesis and audio processing. The paper presents MeanVoiceFlow, a novel one-step nonparallel voice conversion model that significantly enhances conversion speed and efficiency. The technical contributions, particularly in addressing training stability and maintaining consistency between training and inference, are well-founded and have the potential to influence future work in voice conversion and related audio applications.
Flow matching and diffusion bridge models have emerged as leading paradigms in generative speech enhancement, modeling stochastic processes between paired noisy and clean speech signals based on principles such as flow matching, score matching, and Schrödinger bridge. In this paper, we present a framework that unifies existing flow and diffusion bridge models by interpreting them as constructions of Gaussian probability paths with varying means and variances between paired data. Furthermore, we investigate the underlying consistency between the training/inference procedures of these generative models and conventional predictive models. Our analysis reveals that each sampling step of a well-trained flow or diffusion bridge model optimized with a data prediction loss is theoretically analogous to executing predictive speech enhancement. Motivated by this insight, we introduce an enhanced bridge model that integrates an effective probability path design with key elements from predictive paradigms, including improved network architecture, tailored loss functions, and optimized training strategies. Experiments on denoising and dereverberation tasks demonstrate that the proposed method outperforms existing flow and diffusion baselines with fewer parameters and reduced computational complexity. The results also highlight that the inherently predictive nature of this generative framework imposes limitations on its achievable upper-bound performance.
Primary: Nanjing University
All Institutions: Nanjing University
The main contribution of this paper is the introduction of a unified framework for flow and diffusion bridge models in speech enhancement, which enhances performance through innovative methodologies and insights. This work significantly advances the field by bridging generative and predictive modeling approaches, offering a comprehensive solution to challenges in speech enhancement.
The paper presents a unified framework that integrates flow matching and diffusion bridge models for speech enhancement, providing a novel interpretation of these models as Gaussian probability paths. The methodology is robust, combining theoretical insights with practical improvements in network architecture and training strategies. The introduction of a time embedding mechanism and an enhanced loss function demonstrates a thoughtful approach to optimizing performance while reducing complexity.
The experiments are well-structured, utilizing two datasets for denoising and dereverberation tasks. The results show a clear performance advantage over existing baselines, with comprehensive metrics that validate the effectiveness of the proposed model. The ablation studies further strengthen the findings by isolating the impact of various modifications.
The paper includes sufficient implementation details, including dataset descriptions, training configurations, and hyperparameter settings, which enhance reproducibility. The availability of code on GitHub supports this aspect, allowing other researchers to replicate the experiments.
While the proposed model shows significant improvements, the authors acknowledge that its inherently predictive nature may impose an upper limit on performance compared to purely predictive models. Additionally, the reliance on specific architectures may limit generalizability to other tasks or domains.
The research has potential applications in various speech processing tasks, including real-time communication systems, hearing aids, and assistive technologies for the hearing impaired. The integration of predictive paradigms into generative models could inspire further innovations in speech enhancement and related fields. The main contribution of this paper is the introduction of a unified framework for flow and diffusion bridge models in speech enhancement, which enhances performance through innovative methodologies and insights. This work significantly advances the field by bridging generative and predictive modeling approaches, offering a comprehensive solution to challenges in speech enhancement.
Audio-text retrieval is crucial for bridging acoustic signals and natural language. While contrastive dual-encoder architectures like CLAP have shown promise, they are fundamentally limited by the capacity of small-scale encoders. Specifically, the text encoders struggle to understand complex queries that require reasoning or world knowledge. In this paper, we propose AuroLA, a novel contrastive language-audio pre-training framework that re-purposes Multimodal Large Language Models (MLLMs) as a unified backbone for retrieval. Specifically, we make three contributions: (i) we construct a scalable data pipeline that curates diverse audio from multiple sources and generates multi-granular captions, ranging from long descriptions to structured tags, via automated annotation; (ii) we adapt an MLLM for retrieval by prompting it to summarize the audio/text input and using the hidden state of a special token as audio/text embeddings. For model training, we devise a novel Hybrid-NCE loss, which employs multi-granular supervision and hard-negative reweighting to robustly align audio with diverse textual supervision; and (iii) we design an MLLM-based bidirectional re-ranking module that refines retrieval candidates through deep cross-modal interaction. Extensive experiments demonstrate that AuroLA consistently outperforms state-of-the-art models, including the recent PE-AV, while utilizing only approximately 1% of PE-AV's training data. Lastly, we observe clear scaling trends regarding dataset size and model capacity, validating the effectiveness of MLLM as a unified backbone for audio-text retrieval. Code is available at https://github.com/Jazzcharles/AuroLA.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of AuroLA, a novel framework that effectively utilizes Multimodal Large Language Models for audio-text retrieval, demonstrating significant improvements over existing methods. The comprehensive analysis of the technical contributions, innovative methodology, and promising experimental results highlight its potential impact on the field of machine learning and audio processing.
The proposed AuroLA framework introduces a novel approach to audio-text retrieval by leveraging Multimodal Large Language Models (MLLMs) as a unified backbone. The methodology is well-structured, with a focus on creating a scalable data pipeline and a Hybrid-NCE loss that enhances the alignment of audio and text embeddings through multi-granular supervision. The adaptation of MLLMs for retrieval tasks is innovative, particularly the use of a special token's hidden state for embeddings. However, the paper could benefit from a more detailed explanation of the implementation of the Hybrid-NCE loss and its advantages over traditional contrastive losses.
The experiments conducted are extensive, demonstrating the superiority of AuroLA over existing state-of-the-art models, including PE-AV, while using significantly less training data. The results are compelling, showcasing clear scaling trends that validate the proposed framework. However, the paper lacks a thorough comparison with a broader range of models and datasets, which could provide a more comprehensive understanding of AuroLA's performance across different scenarios.
The paper mentions that code is available on GitHub, which is a positive aspect for reproducibility. However, the paper does not provide sufficient implementation details or hyperparameter settings that would allow other researchers to easily replicate the experiments. A more detailed supplementary material or appendix could enhance reproducibility.
One limitation is the reliance on the quality and diversity of the audio data curated for training, which may affect the generalizability of the model. Additionally, while the use of MLLMs is innovative, the computational cost associated with training and deploying such models could be a barrier to practical applications. The paper also does not address potential biases in the data or the model's performance across different languages or dialects.
The implications of this research are significant, particularly in applications such as multimedia search engines, accessibility tools for the hearing impaired, and content-based audio retrieval systems. By improving audio-text retrieval capabilities, this work could enhance user experiences in various domains, including education, entertainment, and information retrieval. The main contribution of this paper is the introduction of AuroLA, a novel framework that effectively utilizes Multimodal Large Language Models for audio-text retrieval, demonstrating significant improvements over existing methods. The comprehensive analysis of the technical contributions, innovative methodology, and promising experimental results highlight its potential impact on the field of machine learning and audio processing.
Despite recent breakthroughs, audio foundation models struggle in processing complex multi-source acoustic scenes. We refer to this challenging domain as audio stories, which can have multiple speakers and background/foreground sound effects. Compared to traditional audio processing tasks, audio stories introduce new layers of semantic, temporal, and physical complexity. To address this challenge, we propose AudioChat, a framework for developing audio foundation models that can generate, edit, and understand audio stories. AudioChat introduces a new paradigm in which LLM-based toolcalling agents simulate interactions between users and the system, and these simulated dialogues are used as training data. We also introduce a novel Audio Transfusion Forcing objective to train the AudioChat model, allowing it to simultaneously decompose high-level instructions via structured chain-of-thought reasoning and perform interactive multi-turn audio understanding/generation. To evaluate generation and editing performance, we develop three new metrics that directly measure task performance instead of relying upon distribution-based scoring. We highly encourage readers to visit our demo to better understand the capabilities of AudioChat: https://wanchichen.github.io/audiochat/.
Primary: Carnegie Mellon University
All Institutions: Carnegie Mellon University, Adobe Research, OpenAI
This paper introduces AudioChat, a pioneering framework for multi-source audio storytelling, editing, and understanding, which utilizes innovative methodologies to advance the field of audio processing in machine learning. The comprehensive evaluation of its technical contributions, methodology, and implications for future research underscores its significance in the domain.
The paper presents a novel framework, AudioChat, which integrates audio generation, editing, and understanding through a unified model. The methodology leverages a tool-calling agent, AudioCopilot, to synthesize training data through simulated user interactions, which is innovative in addressing the data scarcity issue in complex audio scene processing. The introduction of the Audio Transfusion Forcing objective is a significant advancement, allowing the model to perform structured reasoning and multi-turn interactions effectively. The architecture employs a continuous audio tokenizer and a multi-modal language model, which are well-justified and contribute to the model's performance.
The experiments are comprehensive, evaluating AudioChat against various baselines across multiple tasks including storytelling, editing, and understanding. The use of novel evaluation metrics like multiFLAM and editFLAM provides a more nuanced assessment of the model's capabilities compared to traditional metrics. The results indicate that AudioChat outperforms existing models, demonstrating its effectiveness in handling complex audio tasks. However, the paper could benefit from more detailed comparisons with a broader range of existing methods.
The authors provide ample details regarding the training data, hyperparameters, and methodology, which supports reproducibility. However, the proprietary nature of some training data may limit full replication of the results. The paper does a commendable job of outlining the architecture and training process, allowing for potential implementation by other researchers.
One limitation is the reliance on synthetic data generated by AudioCopilot, which may not capture the full diversity of real-world audio scenarios. Additionally, while the model shows promise, its performance in edge cases or highly nuanced audio tasks remains to be thoroughly evaluated. The potential ethical implications of audio generation technologies, such as misuse for impersonation, are acknowledged but not deeply explored.
The development of AudioChat has significant implications for various applications in multimedia, including film, gaming, and virtual reality, where immersive audio storytelling is crucial. The ability to generate and edit complex audio scenes could enhance user experiences in these domains. However, the potential for misuse in creating deceptive audio content raises ethical concerns that need to be addressed by the research community. This paper introduces AudioChat, a pioneering framework for multi-source audio storytelling, editing, and understanding, which utilizes innovative methodologies to advance the field of audio processing in machine learning. The comprehensive evaluation of its technical contributions, methodology, and implications for future research underscores its significance in the domain.
We propose CC-G2PnP, a streaming grapheme-to-phoneme and prosody (G2PnP) model to connect large language model and text-to-speech in a streaming manner. CC-G2PnP is based on Conformer-CTC architecture. Specifically, the input grapheme tokens are processed chunk by chunk, which enables streaming inference of phonemic and prosodic (PnP) labels. By guaranteeing minimal look-ahead size to each input token, the proposed model can consider future context in each token, which leads to stable PnP label prediction. Unlike previous streaming methods that depend on explicit word boundaries, the CTC decoder in CC-G2PnP effectively learns the alignment between graphemes and phonemes during training, making it applicable to unsegmented languages. Experiments on a Japanese dataset, which has no explicit word boundaries, show that CC-G2PnP significantly outperforms the baseline streaming G2PnP model in the accuracy of PnP label prediction.
Primary: LY Corporation
All Institutions: LY Corporation
The paper presents CC-G2PnP, a novel streaming model for grapheme-to-phoneme and prosody conversion that addresses the challenges of unsegmented languages. Its innovative methodology and robust experimental results position it as a significant contribution to the field of audio processing and speech synthesis.
The proposed CC-G2PnP model employs a Conformer-CTC architecture that innovatively processes grapheme tokens in chunks, allowing for streaming inference of phonemic and prosodic labels. The introduction of minimum look-ahead (MLA) is a significant methodological advancement, as it addresses the limitations of previous streaming models that rely on explicit word boundaries. This approach is particularly beneficial for unsegmented languages like Japanese, where word boundaries are not clearly defined. The integration of self-conditioned CTC into the architecture further enhances the model's performance by allowing dynamic learning of alignments between graphemes and phonemes.
The experiments conducted on a Japanese dataset demonstrate the effectiveness of CC-G2PnP, showing significant improvements in character error rate (CER) and sentence error rate (SER) compared to baseline models. The use of both objective metrics and subjective assessments of TTS naturalness provides a comprehensive evaluation of the model's performance. The dataset preparation and experimental conditions are well-documented, allowing for a clear understanding of the model's capabilities and limitations.
While the paper provides detailed descriptions of the model architecture and training procedures, the lack of a publicly available code repository or demo URL limits reproducibility. The absence of specific hyperparameters and training configurations in a readily accessible format could hinder other researchers from replicating the results.
One limitation noted is the reliance on a large amount of training data to achieve optimal performance, which may not be feasible for all applications. Additionally, while the model performs well in terms of accuracy, the subjective evaluation of TTS naturalness could vary based on the speaker used during testing, which may not generalize across different voices.
The CC-G2PnP model has the potential to significantly enhance text-to-speech systems, particularly for languages without explicit word boundaries. This could lead to more natural and efficient human-machine interactions in various applications, including virtual assistants, language learning tools, and accessibility technologies for the visually impaired. The advancements in streaming G2PnP could also inspire further research in related areas, such as real-time speech synthesis and multilingual processing. The paper presents CC-G2PnP, a novel streaming model for grapheme-to-phoneme and prosody conversion that addresses the challenges of unsegmented languages. Its innovative methodology and robust experimental results position it as a significant contribution to the field of audio processing and speech synthesis.
Current audio language models are predominantly text-first, either extending pre-trained text LLM backbones or relying on semantic-only audio tokens, limiting general audio modeling. This paper presents a systematic empirical study of native audio foundation models that apply next-token prediction to audio at scale, jointly modeling semantic content, acoustic details, and text to support both general audio generation and cross-modal capabilities. We provide comprehensive empirical insights for building such models: (1) We systematically investigate design choices -- data sources, text mixture ratios, and token composition -- establishing a validated training recipe. (2) We conduct the first scaling law study for discrete audio models via IsoFLOP analysis on 64 models spanning $3{\times}10^{18}$ to $3{\times}10^{20}$ FLOPs, finding that optimal data grows 1.6$\times$ faster than optimal model size. (3) We apply these lessons to train SODA (Scaling Open Discrete Audio), a suite of models from 135M to 4B parameters on 500B tokens, comparing against our scaling predictions and existing models. SODA serves as a flexible backbone for diverse audio/text tasks -- we demonstrate this by fine-tuning for voice-preserving speech-to-speech translation, using the same unified architecture.
Primary: Stanford University
All Institutions: Stanford University, SCB 10X, OpenAthena, University of Southern California, University of Cambridge
The main contribution of this paper is the introduction of SODA, a scalable audio foundation model that effectively integrates semantic, acoustic, and text tokens, providing a comprehensive framework for advancing audio modeling. This work significantly enhances the understanding of scaling laws in audio models and sets a foundation for future innovations in the field.
The methodology presented in the paper is robust and systematic, focusing on the design choices that influence the performance of audio foundation models. The authors thoroughly investigate various aspects, including data sources, text mixture ratios, and token composition, which are critical for optimizing model performance. The introduction of the SODA model, which integrates semantic, acoustic, and text tokens, represents a significant advancement in audio modeling. The use of next-token prediction at scale is a novel approach that extends the capabilities of existing models.
The paper includes a comprehensive empirical evaluation, particularly through the IsoFLOP analysis that examines scaling laws for discrete audio models. The authors provide extensive experimentation across 64 models, which is a commendable effort to validate their findings. The results indicate that optimal data grows faster than model size, which is a valuable insight for future research in this area. However, the paper could benefit from more detailed comparisons with existing models beyond the scaling predictions.
While the authors mention establishing a validated training recipe, the paper lacks specific implementation details that would facilitate reproducibility. Providing access to code or detailed hyperparameter settings would enhance the paper's contribution to the community and allow for independent verification of results.
One limitation is the reliance on a specific architecture for the SODA model, which may not generalize well to all audio tasks. Additionally, the paper does not address potential biases in the training data or the implications of using large-scale models in real-world applications. The scaling law findings, while insightful, may also be context-dependent and require further validation across diverse datasets.
The implications of this research are significant, as it opens up new avenues for audio generation and cross-modal tasks, such as speech-to-speech translation. The ability to model semantic content alongside acoustic details can enhance applications in various domains, including entertainment, accessibility, and communication technologies. The findings could influence future research directions and encourage the development of more sophisticated audio models. The main contribution of this paper is the introduction of SODA, a scalable audio foundation model that effectively integrates semantic, acoustic, and text tokens, providing a comprehensive framework for advancing audio modeling. This work significantly enhances the understanding of scaling laws in audio models and sets a foundation for future innovations in the field.
Probing is widely adopted in computer vision to faithfully evaluate self-supervised learning (SSL) embeddings, as fine-tuning may misrepresent their inherent quality. In contrast, audio SSL models still rely on fine-tuning because simple probing fails to unlock their full potential and alters their rankings when competing for SOTA on AudioSet. Hence, a robust and efficient probing mechanism is required to guide the trajectory of audio SSL towards reliable and reproducible methods. We introduce Convex Gated Probing (CGP), a prototype-based method that drastically closes the gap between fine-tuning and probing in audio. CGP efficiently utilizes all frozen layers via a gating mechanism and exposes the location of latent task-relevant information. Guided by CGP, we rework the entire SSL pipeline of current SOTA audio models that use legacy implementations of prior SSL methods. By refining data preprocessing, model architecture, and pre-training recipe, we introduce Better Audio Transformer (BAT), and establish new SOTA on audio benchmarks.
Primary: Ghent University
All Institutions: Ghent University, Fraunhofer IEE, University of Kassel
The paper presents a significant advancement in audio self-supervised learning through the introduction of Convex Gated Probing and the Better Audio Transformer, addressing critical gaps in evaluation methodologies and model performance. The comprehensive experimental validation and emphasis on reproducibility enhance its contributions to the field.
The paper introduces Convex Gated Probing (CGP), a novel probing method that leverages a gating mechanism to efficiently utilize all frozen layers of audio SSL models. This approach addresses the limitations of existing probing techniques, which often fail to capture the full potential of audio embeddings. The methodology is well-structured, presenting a clear rationale for the design choices and improvements made to the SSL pipeline, leading to the development of the Better Audio Transformer (BAT). The integration of CGP into the SSL framework is innovative and shows promise in enhancing model evaluation and performance.
The experiments are comprehensive, demonstrating the effectiveness of BAT across various audio benchmarks. The authors provide detailed comparisons against state-of-the-art models, showcasing significant performance improvements in both frozen-feature probing and fine-tuning scenarios. The results are well-documented, with sufficient statistical rigor to support the claims made regarding the superiority of BAT over existing models.
The authors emphasize the importance of reproducibility and provide a new PyTorch implementation to facilitate this. However, the paper mentions challenges in replicating results from existing models, which raises questions about the reliability of previous benchmarks. The authors' efforts to standardize methodologies and hyperparameters contribute positively to the reproducibility aspect, although the lack of a public code repository limits accessibility.
One limitation noted is the reliance on the specific architecture of the Better Audio Transformer, which may not generalize across different audio tasks or datasets. Additionally, while the CGP method shows promise, its effectiveness in more complex audio scenarios or with other model architectures remains to be validated. The paper also acknowledges the challenges of hyperparameter sensitivity in fine-tuning, which could affect the generalizability of results.
The advancements presented in this work have the potential to significantly impact the field of self-supervised audio representation learning. By improving the evaluation methods and model architectures, the research could lead to more efficient and accessible audio models, reducing computational overhead and fostering innovation in audio-related applications. The focus on reproducibility and transparency also aligns with broader efforts to enhance the reliability of machine learning research. The paper presents a significant advancement in audio self-supervised learning through the introduction of Convex Gated Probing and the Better Audio Transformer, addressing critical gaps in evaluation methodologies and model performance. The comprehensive experimental validation and emphasis on reproducibility enhance its contributions to the field.
In audio-related creative tasks, sound designers often seek to extend and morph different sounds from their libraries. Generative audio models, capable of creating audio using examples as references, offer promising solutions. By masking the noisy latents of a DiT and applying a novel variant of classifier-free guidance on such masked latents, we demonstrate that: (i) given an audio reference, we can extend it both forward and backward for a specified duration, and (ii) given two audio references, we can morph them seamlessly for the desired duration. Furthermore, we show that by fine-tuning the model on different types of stationary audio data we mitigate potential hallucinations. The effectiveness of our method is supported by objective metrics, with the generated audio achieving Fréchet Audio Distances (FADs) comparable to those of real samples from the training data. Additionally, we validate our results through a subjective listener test, where subjects gave positive ratings to the proposed model generations. This technique paves the way for more controllable and expressive generative sound frameworks, enabling sound designers to focus less on tedious, repetitive tasks and more on their actual creative process.
Primary: unknown
All Institutions: unknown
The paper presents a novel approach for generating high-quality audio extensions and morphs using Diffusion Transformers and a variant of classifier-free guidance. The technical contributions are significant, addressing real-world challenges faced by sound designers and demonstrating promising results through rigorous evaluation.
The methodology presented in this paper is robust and innovative, leveraging Diffusion Transformers and a novel Audio Prompt Guidance technique to effectively extend and morph audio. The authors provide a clear description of their approach, including the masking function and the fine-tuning strategy using the Noise Floor Dataset to mitigate hallucinations. However, while the methodology is well-structured, it could benefit from a more detailed exploration of the limitations of the masking function and guidance techniques in varying audio contexts.
The experimental evaluation is comprehensive, employing both objective metrics (Fréchet Audio Distance) and subjective listener tests to validate the effectiveness of the proposed model. The use of a large dataset for training and the careful selection of evaluation clips from sound design professionals enhances the credibility of the results. However, the paper could improve by including more diverse audio samples and comparing against a broader range of existing methods.
The paper provides sufficient detail on the architecture, training process, and evaluation metrics, which aids in reproducibility. However, the absence of specific code or model weights limits the ease with which other researchers can replicate the results. Including a GitHub repository or similar resource would significantly enhance reproducibility.
The paper acknowledges the potential for hallucinations in generated audio, particularly with stationary sounds, and discusses the trade-off between reducing hallucinations and maintaining fidelity to the original prompts. However, it does not thoroughly address how the model performs with non-stationary sounds or in complex soundscapes, which could be a significant limitation for practical applications.
The proposed model has the potential to significantly impact the field of sound design by automating tedious tasks and enhancing the creative process for sound designers. The ability to generate high-quality audio extensions and morphs could streamline workflows in various industries, including film, gaming, and virtual reality. Furthermore, the methodology could inspire future research in generative audio models and their applications in other domains. The paper presents a novel approach for generating high-quality audio extensions and morphs using Diffusion Transformers and a variant of classifier-free guidance. The technical contributions are significant, addressing real-world challenges faced by sound designers and demonstrating promising results through rigorous evaluation.
This paper presents virtual upmixing of steering vectors captured by a fewer-channel spherical microphone array. This challenge has conventionally been addressed by recovering the directions and signals of sound sources from first-order ambisonics (FOA) data, and then rendering the higher-order ambisonics (HOA) data using a physics-based acoustic simulator. This approach, however, struggles to handle the mutual dependency between the spatial directivity of source estimation and the spatial resolution of FOA ambisonics data. Our method, named SIRUP, employs a latent diffusion model architecture. Specifically, a variational autoencoder (VAE) is used to learn a compact encoding of the HOA data in a latent space and a diffusion model is then trained to generate the HOA embeddings, conditioned by the FOA data. Experimental results showed that SIRUP achieved a significant improvement compared to FOA systems for steering vector upmixing, source localization, and speech denoising.
Primary: unknown
All Institutions: JSPS KAKENHI, JST FOREST, ANR Project SAROUMANE
The main contribution of this paper is the introduction of SIRUP, a novel diffusion-based approach for enhancing spatial audio representation from FOA to HOA, which addresses critical limitations in existing methods and demonstrates significant improvements in sound source localization and speech denoising. The methodology is innovative, and the experimental results are promising, indicating a strong potential impact on the field of audio processing and machine listening.
The proposed SIRUP method innovatively integrates a variational autoencoder (VAE) with a latent diffusion model to enhance steering vector upmixing from first-order ambisonics (FOA) to higher-order ambisonics (HOA). This approach addresses the limitations of traditional methods by directly learning a latent representation of HOA data, conditioned on FOA inputs, which is a significant departure from the conventional cascaded analysis-rendering pipeline. The use of a composite loss function that combines cosine similarity with MSE is a thoughtful addition that likely contributes to the stability and performance of the model.
The experimental setup is robust, utilizing simulated room impulse responses to evaluate the performance of SIRUP across various conditions, including different signal-to-noise ratios and reverberation times. The metrics chosen for evaluation, such as beamwidth and directivity index, are appropriate for assessing the quality of the upmixed steering vectors. The results indicate that SIRUP significantly outperforms FOA systems, demonstrating its effectiveness in sound source localization and speech denoising.
While the paper provides a detailed description of the methodology, including model architecture and training procedures, it lacks explicit links to code repositories or supplementary materials that would facilitate reproducibility. The absence of a publicly available implementation may hinder other researchers from validating the findings.
One limitation is the reliance on simulated data, which may not fully capture the complexities of real-world acoustic environments. Additionally, the paper does not address the scalability of the method to larger microphone arrays or the potential computational costs associated with training the diffusion model.
The implications of this research are significant for machine listening applications, particularly in augmented reality, robotics, and autonomous systems, where accurate spatial audio representation is crucial. By improving the spatial resolution of sound source localization and enhancing speech denoising, SIRUP could lead to advancements in user experience and system performance in these domains. The main contribution of this paper is the introduction of SIRUP, a novel diffusion-based approach for enhancing spatial audio representation from FOA to HOA, which addresses critical limitations in existing methods and demonstrates significant improvements in sound source localization and speech denoising. The methodology is innovative, and the experimental results are promising, indicating a strong potential impact on the field of audio processing and machine listening.