Voice style conversion aims to transform an input utterance to match a target speaker's timbre, accent, and emotion, with a central challenge being the disentanglement of linguistic content from style. While prior work has explored this problem, conversion quality remains limited, and real-time voice style conversion has not been addressed. We propose StyleStream, the first streamable zero-shot voice style conversion system that achieves state-of-the-art performance. StyleStream consists of two components: a Destylizer, which removes style attributes while preserving linguistic content, and a Stylizer, a diffusion transformer (DiT) that reintroduces target style conditioned on reference speech. Robust content-style disentanglement is enforced through text supervision and a highly constrained information bottleneck. This design enables a fully non-autoregressive architecture, achieving real-time voice style conversion with an end-to-end latency of 1 second. Samples and real-time demo: https://berkeley-speech-group.github.io/StyleStream/.
Primary: UC Berkeley
All Institutions: UC Berkeley
StyleStream introduces the first real-time zero-shot voice style conversion system capable of modifying timbre, accent, and emotion with an end-to-end latency of approximately 1 second. This paper represents a significant technical contribution to the field, addressing key challenges in voice style conversion through innovative methodology and rigorous experimental validation, thus paving the way for practical applications in various domains.
The methodology presented in StyleStream is innovative, combining a Destylizer for content-style disentanglement with a Stylizer based on a diffusion transformer. The use of ASR loss and a compact finite scalar quantization (FSQ) bottleneck is a significant advancement over previous methods, allowing for cleaner disentanglement of linguistic content from style attributes. The non-autoregressive architecture enables real-time processing, which is a notable improvement in the field of voice style conversion. The paper provides a comprehensive description of the architecture, training procedures, and the rationale behind design choices, demonstrating a solid understanding of the challenges in voice style conversion.
The experimental evaluation is thorough, utilizing a diverse dataset of 50k hours of English speech for training and a well-structured test set for evaluation. The results show that StyleStream outperforms existing methods in terms of intelligibility and style fidelity across multiple metrics, including WER and similarity scores. The paper effectively communicates the performance improvements over baseline models, providing both objective and subjective evaluations, which strengthen the claims of superior performance.
The paper includes detailed descriptions of the architecture, training configurations, and evaluation metrics, which facilitate reproducibility. However, the lack of a publicly available code repository may hinder full reproducibility for some researchers. The authors could enhance reproducibility by providing access to their trained models and detailed implementation instructions.
One limitation is the reliance on a large amount of training data, which may not be readily available for all researchers. Additionally, while the paper claims real-time processing capabilities, the actual latency may vary depending on hardware, which could limit practical applications in certain environments. The model's performance with shorter reference utterances is also noted to degrade, indicating a potential limitation in flexibility.
The advancements presented in StyleStream have significant implications for applications in voice synthesis, dubbing, and personalized voice assistants, where real-time style conversion can enhance user experience. The ability to modify timbre, accent, and emotion in real-time opens up new avenues for interactive applications in entertainment, education, and accessibility, potentially impacting how voice technologies are integrated into daily life. StyleStream introduces the first real-time zero-shot voice style conversion system capable of modifying timbre, accent, and emotion with an end-to-end latency of approximately 1 second. This paper represents a significant technical contribution to the field, addressing key challenges in voice style conversion through innovative methodology and rigorous experimental validation, thus paving the way for practical applications in various domains.
Reconstructing the speech audio envelope from scalp neural recordings (EEG) is a central task for decoding a listener's attentional focus in applications like neuro-steered hearing aids. Current methods for this reconstruction, however, face challenges with fidelity and noise. Prevailing approaches treat it as a static regression problem, processing each EEG window in isolation and ignoring the rich temporal structure inherent in continuous speech. This study introduces a new, dynamic framework for envelope reconstruction that leverages this structure as a predictive temporal prior. We propose a state-space fusion model that combines direct neural estimates from EEG with predictions from recent speech context, using a learned gating mechanism to adaptively balance these cues. To validate this approach, we evaluate our model on the ICASSP 2023 Stimulus Reconstruction benchmark demonstrating significant improvements over static, EEG-only baselines. Our analyses reveal a powerful synergy between the neural and temporal information streams. Ultimately, this work reframes envelope reconstruction not as a simple mapping, but as a dynamic state-estimation problem, opening a new direction for developing more accurate and coherent neural decoding systems.
Primary: Johns Hopkins University
All Institutions: Johns Hopkins University, Laboratory for Computational Audio Perception
The main contribution of this work is the introduction of the DECAF model, which reframes speech envelope reconstruction from EEG as a dynamic state-estimation problem, significantly improving reconstruction accuracy through the integration of temporal context. This innovative approach not only advances the state-of-the-art in auditory attention decoding but also opens new avenues for research in brain-computer interfaces and neurotechnology.
The proposed DECAF model introduces a novel dynamic framework for reconstructing speech envelopes from EEG data by integrating a predictive temporal prior with direct neural estimates. This approach is innovative as it shifts the paradigm from static regression to dynamic state estimation, leveraging temporal dependencies inherent in speech signals. The architecture is modular, consisting of three core components: the EEG to Envelope decoder, the Envelope Forecaster, and the Dynamic Fusion gate, which work together to enhance reconstruction accuracy. The use of a learned gating mechanism to balance the contributions of neural evidence and temporal context is particularly noteworthy, as it allows for adaptive integration of information.
The authors validate their model using the ICASSP 2023 Stimulus Reconstruction benchmark, demonstrating significant improvements over static EEG-only baselines and achieving state-of-the-art performance. The experiments are well-structured, comparing DECAF against established baselines, including traditional methods and contemporary deep learning architectures. The results are quantitatively supported by statistical significance tests, and the ablation studies provide insights into the contributions of each component of the model.
The paper provides sufficient details regarding the dataset, experimental setup, and model training procedures, which enhances reproducibility. The authors adhere to established protocols and publicly share their code repository, facilitating further experimentation and validation by other researchers.
While the model shows promise, it may face challenges in real-world applications where EEG signals are subject to high noise levels. The performance of DECAF under extreme noise conditions aligns with baseline models, indicating a potential limitation in robustness. Additionally, the reliance on past predictions may introduce biases if the initial estimates are inaccurate.
The implications of this research extend to neuro-steered hearing aids and auditory attention decoding systems, potentially improving the quality of life for individuals with hearing impairments. By enhancing the accuracy of speech envelope reconstruction, the DECAF framework could lead to more effective auditory processing technologies, making it a significant contribution to both machine learning and assistive technologies. The main contribution of this work is the introduction of the DECAF model, which reframes speech envelope reconstruction from EEG as a dynamic state-estimation problem, significantly improving reconstruction accuracy through the integration of temporal context. This innovative approach not only advances the state-of-the-art in auditory attention decoding but also opens new avenues for research in brain-computer interfaces and neurotechnology.
Cover songs constitute a vital aspect of musical culture, preserving the core melody of an original composition while reinterpreting it to infuse novel emotional depth and thematic emphasis. Although prior research has explored the reinterpretation of instrumental music through melody-conditioned text-to-music models, the task of cover song generation remains largely unaddressed. In this work, we reformulate our cover song generation as a conditional generation, which simultaneously generates new vocals and accompaniment conditioned on the original vocal melody and text prompts. To this end, we present SongEcho, which leverages Instance-Adaptive Element-wise Linear Modulation (IA-EiLM), a framework that incorporates controllable generation by improving both conditioning injection mechanism and conditional representation. To enhance the conditioning injection mechanism, we extend Feature-wise Linear Modulation (FiLM) to an Element-wise Linear Modulation (EiLM), to facilitate precise temporal alignment in melody control. For conditional representations, we propose Instance-Adaptive Condition Refinement (IACR), which refines conditioning features by interacting with the hidden states of the generative model, yielding instance-adaptive conditioning. Additionally, to address the scarcity of large-scale, open-source full-song datasets, we construct Suno70k, a high-quality AI song dataset enriched with comprehensive annotations. Experimental results across multiple datasets demonstrate that our approach generates superior cover songs compared to existing methods, while requiring fewer than 30% of the trainable parameters. The code, dataset, and demos are available at https://github.com/lsfhuihuiff/SongEcho_ICLR2026.
Primary: National Natural Science Foundation of China
All Institutions: National Natural Science Foundation of China, China Scholarship Council, German Research Foundation, National Science and Technology Council, Taiwan
The paper presents a novel approach to cover song generation through the introduction of IA-EiLM and IACR, significantly advancing the field of audio machine learning. The methodology and experimental results indicate a strong potential for practical applications in music generation, although further work is needed to enhance reproducibility and address limitations.
The methodology presented in the paper is innovative, particularly with the introduction of Instance-Adaptive Element-wise Linear Modulation (IA-EiLM) and Instance-Adaptive Condition Refinement (IACR). The extension of Feature-wise Linear Modulation (FiLM) to EiLM is a notable advancement that addresses the challenge of temporal alignment in melody control. The dual focus on generating both vocals and accompaniment conditioned on the original melody and text prompts is a significant step forward in cover song generation. However, the paper could benefit from a more detailed explanation of the underlying mechanics of IA-EiLM and IACR, particularly how they interact with the generative model's hidden states.
The experimental results are compelling, demonstrating that SongEcho outperforms existing methods while utilizing fewer parameters, which suggests a more efficient model. The construction of the Suno70k dataset is a valuable contribution, addressing a critical gap in the availability of high-quality, annotated datasets for song generation tasks. However, the paper should provide more comprehensive comparisons with a wider range of baseline methods to strengthen the claims of superiority.
The authors have made the code and dataset publicly available, which is a positive aspect for reproducibility. However, the paper lacks detailed implementation instructions and hyperparameter settings, which could hinder other researchers from replicating the results accurately.
One limitation of the study is the potential overfitting due to the small size of the dataset relative to the complexity of the task. Additionally, the subjective nature of music generation means that quantitative metrics may not fully capture the quality of the generated songs. The paper could also explore the limitations of the IA-EiLM and IACR methods in different musical contexts or genres.
The proposed framework has significant implications for the music industry, particularly in automated music composition and cover song generation. By enabling more nuanced and emotionally resonant reinterpretations of existing songs, this research could enhance creative processes in music production. Furthermore, the availability of the Suno70k dataset could spur further research in music generation and related fields. The paper presents a novel approach to cover song generation through the introduction of IA-EiLM and IACR, significantly advancing the field of audio machine learning. The methodology and experimental results indicate a strong potential for practical applications in music generation, although further work is needed to enhance reproducibility and address limitations.
Modern Text-to-Speech (TTS) systems increasingly leverage Large Language Model (LLM) architectures to achieve scalable, high-fidelity, zero-shot generation. However, these systems typically rely on fixed-frame-rate acoustic tokenization, resulting in speech sequences that are significantly longer than, and asynchronous with their corresponding text. Beyond computational inefficiency, this sequence length disparity often triggers hallucinations in TTS and amplifies the modality gap in spoken language modeling (SLM). In this paper, we propose a novel tokenization scheme that establishes one-to-one synchronization between continuous acoustic features and text tokens, enabling unified, single-stream modeling within an LLM. We demonstrate that these synchronous tokens maintain high-fidelity audio reconstruction and can be effectively modeled in a latent space by a large language model with a flow matching head. Moreover, the ability to seamlessly toggle speech modality within the context enables text-only guidance--a technique that blends logits from text-only and text-speech modes to flexibly bridge the gap toward text-only LLM intelligence. Experimental results indicate that our approach achieves performance competitive with state-of-the-art TTS and SLM systems while virtually eliminating content hallucinations and preserving linguistic integrity, all at a significantly reduced inference cost.
Primary: Dartmouth College
All Institutions: Dartmouth College, Hume AI
The paper presents TADA, a generative framework that synchronizes text and acoustic features for improved speech modeling. This innovative approach addresses key challenges in traditional TTS systems, offering a significant contribution to the field with its potential for high-fidelity, efficient speech synthesis.
The proposed methodology introduces a novel tokenization scheme that achieves one-to-one synchronization between text and acoustic features, which is a significant advancement over traditional fixed-frame-rate approaches. The use of a dual alignment mechanism and a flow matching head within a large language model framework allows for efficient and high-fidelity speech synthesis. The architecture is well-structured, leveraging a combination of variational autoencoders and transformer-based models, which enhances the model's ability to handle both modalities concurrently. The approach to mitigate the modality gap through Speech Free Guidance (SFG) is particularly innovative, allowing for flexible integration of text and speech modalities.
The experiments conducted demonstrate the effectiveness of the proposed model against state-of-the-art TTS and SLM systems. The authors provide a comprehensive evaluation using multiple datasets and metrics, including character error rate, speaker similarity, and subjective evaluations of naturalness. The results indicate that the proposed method not only matches but often exceeds the performance of existing models, particularly in terms of reducing content hallucinations and improving inference efficiency. The extensive dataset used for training and evaluation further supports the robustness of the findings.
The paper includes a link to the GitHub repository containing the code and pre-trained models, which is a positive aspect for reproducibility. However, the details regarding the training process, hyperparameters, and specific configurations could be elaborated further to enhance clarity for future researchers attempting to replicate the results.
One limitation noted is the potential for speaker drifting during long-form generation, which suggests that while the model performs well in many scenarios, there are still challenges in maintaining speaker consistency over extended outputs. Additionally, the subjective evaluations indicate that while the model performs competitively, there is room for improvement in perceptual audio quality.
The implications of this research are significant for the field of speech synthesis and spoken language modeling. The ability to generate high-fidelity speech with reduced hallucinations and improved efficiency can enhance applications in voice cloning, virtual assistants, and interactive AI systems. The methodology could also pave the way for further advancements in multimodal AI systems, where seamless integration of text and speech is crucial. The paper presents TADA, a generative framework that synchronizes text and acoustic features for improved speech modeling. This innovative approach addresses key challenges in traditional TTS systems, offering a significant contribution to the field with its potential for high-fidelity, efficient speech synthesis.
Although Automatic Speech Recognition (ASR) in Bengali has seen significant progress, processing long-duration audio and performing robust speaker diarization remain critical research gaps. To address the severe scarcity of joint ASR and diarization resources for this language, we introduce Lipi-Ghor-882, a comprehensive 882-hour multi-speaker Bengali dataset. In this paper, detailing our submission to the DL Sprint 4.0 competition, we systematically evaluate various architectures and approaches for long-form Bengali speech. For ASR, we demonstrate that raw data scaling is ineffective; instead, targeted fine-tuning utilizing perfectly aligned annotations paired with synthetic acoustic degradation (noise and reverberation) emerges as the singular most effective approach. Conversely, for speaker diarization, we observed that global open-source state-of-the-art models (such as Diarizen) performed surprisingly poorly on this complex dataset. Extensive model retraining yielded negligible improvements; instead, strategic, heuristic post-processing of baseline model outputs proved to be the primary driver for increasing accuracy. Ultimately, this work outlines a highly optimized dual pipeline achieving a $\sim$0.019 Real-Time Factor (RTF), establishing a practical, empirically backed benchmark for low-resource, long-form speech processing.
Primary: KUET
All Institutions: KUET, BUET
The main contribution of this paper is the introduction of the Lipi-Ghor-882 dataset and the development of optimized ASR and speaker diarization pipelines for Bengali audio. This work addresses critical gaps in the field, providing a foundation for future research and applications in low-resource speech processing. The comprehensive analysis of methodologies and results highlights the significance of targeted fine-tuning and heuristic post-processing in overcoming the challenges of long-form audio processing.
The methodology presented in the paper is robust, focusing on two critical areas: Automatic Speech Recognition (ASR) and speaker diarization for Bengali audio. The authors systematically evaluated various architectures and fine-tuning strategies, demonstrating a clear understanding of the challenges associated with long-form audio processing. The innovative use of perfectly aligned annotations and synthetic acoustic degradation for ASR is particularly noteworthy. The shift from model retraining to heuristic post-processing for diarization reflects a pragmatic approach to overcoming the limitations of existing models in this domain.
The experiments conducted are thorough, with a clear focus on optimizing inference time and accuracy. The introduction of the Lipi-Ghor-882 dataset is a significant contribution, as it addresses the scarcity of resources for Bengali ASR and diarization. The results indicate a well-structured evaluation process, with the authors providing insights into the performance of various models and the effectiveness of their proposed methods. The achievement of a Real-Time Factor (RTF) of 0.019 is impressive and establishes a benchmark for future work.
While the paper outlines the methodologies and experiments in detail, it lacks specific implementation details and code availability, which are crucial for reproducibility. The absence of a project URL or demo limits the ability of other researchers to replicate the findings or build upon this work.
The paper acknowledges several limitations, including compute resource constraints, data quality issues, and the challenges of model optimization for Bengali features. The reliance on heuristic post-processing for diarization, while effective, indicates that further research is needed to develop more robust models for this task.
This research has significant implications for the field of speech processing, particularly for low-resource languages like Bengali. The introduction of a large-scale dataset and the development of optimized pipelines can facilitate advancements in conversational AI, making it more accessible for Bengali speakers. The findings may also inspire similar approaches in other low-resource language contexts. The main contribution of this paper is the introduction of the Lipi-Ghor-882 dataset and the development of optimized ASR and speaker diarization pipelines for Bengali audio. This work addresses critical gaps in the field, providing a foundation for future research and applications in low-resource speech processing. The comprehensive analysis of methodologies and results highlights the significance of targeted fine-tuning and heuristic post-processing in overcoming the challenges of long-form audio processing.
Multi-channel speech separation in dynamic environments is challenging as time-varying spatial and spectral features evolve at different temporal scales. Existing methods typically employ sequential architectures, forcing a single network stream to simultaneously model both feature types, creating an inherent modeling conflict. In this paper, we propose a dual-branch parallel spectral-spatial (PS2) architecture that separately processes spectral and spatial features through parallel streams. The spectral branch uses a bi-directional long short-term memory (BLSTM)-based frequency module, a Mamba-based temporal module, and a self-attention module to model spectral features. The spatial branch employs bi-directional gated recurrent unit (BGRU) networks to process spatial features that encode the evolving geometric relationships between sources and microphones. Features from both branches are integrated through a cross-attention fusion mechanism that adaptively weights their contributions. Experimental results demonstrate that the PS2 outperforms existing state-of-the-art (SOTA) methods by 1.6-2.2 dB in scale-invariant signal-to-distortion ratio (SI-SDR) for moving speaker scenarios, with robust separation quality under different reverberation times (RT60), noise levels, and source movement speeds. Even with fast source movements, the proposed model maintains SI-SDR improvements of over 13 dB. These improvements are consistently observed across multiple datasets, including WHAMR! and our generated WSJ0-Demand-6ch-Move dataset.
Primary: Tampere University
All Institutions: Tampere University, Nokia Technologies
The paper presents a novel dual-branch architecture for moving speaker separation that effectively addresses challenges in multi-channel speech processing. The methodology is well-founded and the experimental results demonstrate substantial improvements over existing methods, highlighting its potential impact in practical applications.
The proposed dual-branch parallel spectral-spatial (PS2) architecture represents a significant methodological advancement in the field of multi-channel speech separation. By separating the processing of spectral and spatial features, the authors effectively address the inherent modeling conflicts present in existing sequential architectures. The use of bi-directional long short-term memory (BLSTM) and bi-directional gated recurrent unit (BGRU) networks, along with a cross-attention fusion mechanism, allows for more nuanced feature extraction and integration. This approach is well-grounded in the theoretical understanding of the different temporal scales at which spectral and spatial features evolve, making it a thoughtful and innovative contribution to the field.
The experimental setup is robust, utilizing multiple datasets including WHAMR! and a newly generated WSJ0-Demand-6ch-Move dataset specifically designed for moving speaker scenarios. The results demonstrate clear improvements over state-of-the-art methods, with significant gains in scale-invariant signal-to-distortion ratio (SI-SDR) across varying acoustic conditions. The ablation studies provide valuable insights into the contributions of each component of the PS2 architecture, reinforcing the importance of the dual-branch design. However, the paper could benefit from a more detailed analysis of the computational efficiency and potential trade-offs involved in the proposed architecture.
The paper provides a comprehensive description of the architecture, training configurations, and datasets used, which supports reproducibility. However, the absence of publicly available code or a project URL limits the ability for others to replicate the results directly. Future work could include releasing the model and training scripts to enhance reproducibility within the research community.
One limitation of the study is the reliance on synthetic datasets for moving speaker scenarios, which may not fully capture the complexities of real-world environments. Additionally, while the model shows strong performance in various conditions, the authors do not extensively discuss its limitations in extreme acoustic scenarios or potential failure modes. The evaluation metrics primarily focus on SI-SDR, which, while important, may not encompass all aspects of speech quality and intelligibility.
The advancements in multi-channel speech separation have significant implications for various applications, including voice recognition systems, hearing aids, and telecommunication technologies. The ability to effectively separate moving speakers in dynamic environments could enhance user experiences in real-world applications, making this research particularly relevant in the context of increasing reliance on audio processing technologies in daily life. The paper presents a novel dual-branch architecture for moving speaker separation that effectively addresses challenges in multi-channel speech processing. The methodology is well-founded and the experimental results demonstrate substantial improvements over existing methods, highlighting its potential impact in practical applications.
A universal audio representation should capture fine-grained speech cues and high-level semantics for environmental sounds and music in a single encoder. Existing encoders often excel in one domain but degrade in others. We propose UniWhisper, an efficient continual multi-task training framework that casts heterogeneous audio tasks into a unified instruction and answer format. This enables standard next-token training without task-specific heads and losses. We train it on 38k hours of public audio and assess the encoder using shallow MLP probes and k-nearest neighbors (kNN) on 20 tasks spanning speech, environmental sound, and music. UniWhisper reaches normalized weighted averages of 0.81 with MLP probes and 0.61 with kNN, compared to 0.64 and 0.46 for Whisper, while retaining strong speech performance.
Primary: Jilin University
All Institutions: Hunan University, Jilin University, Shandong University, University of Electronic Science and Technology of China
UniWhisper presents a novel approach to continual multi-task training for universal audio representation, achieving competitive performance across diverse audio tasks while maintaining strong speech capabilities. This work significantly contributes to the field by addressing the limitations of existing models and offering a streamlined methodology that enhances both efficiency and effectiveness in audio representation learning.
The methodology presented in UniWhisper is innovative, focusing on a unified instruction and answer format for continual multi-task training. This approach effectively addresses the limitations of existing audio encoders that excel in specific domains but struggle with others. By leveraging a single encoder and a compact pretrained language model as the decoder, the authors streamline the training process and reduce redundancy in audio token representation. The decision to utilize shallow MLP probes and kNN for evaluation is appropriate, as it allows for a clear assessment of the model's performance across diverse tasks.
The experimental setup is robust, utilizing a substantial training dataset of 38k hours of public audio, which enhances the generalizability of the results. The evaluation across 20 tasks provides a comprehensive view of the model's capabilities. The results indicate that UniWhisper outperforms existing models like Whisper, HuBERT, and others, particularly in non-speech tasks, which is a significant achievement. The use of normalized weighted averages for performance metrics is a strong point, ensuring comparability across different tasks.
The paper mentions that code and pretrained weights will be released upon acceptance, which is crucial for reproducibility. However, the details provided in the methodology, such as specific hyperparameters and training configurations, are adequately described, allowing other researchers to replicate the experiments. The use of a compact pretrained language model as a decoder is also a noteworthy detail that aids in understanding the architecture.
One limitation is the reliance on a single encoder, which may not capture all nuances across diverse audio tasks as effectively as dual-encoder systems. Additionally, while the results are promising, the paper does not extensively discuss the potential trade-offs in performance when adapting UniWhisper to other audio domains not covered in the evaluation.
The implications of this work are significant, as it proposes a more efficient method for training universal audio representations that can be applied in various applications, including speech recognition, environmental sound classification, and music analysis. The potential for reduced training costs and improved performance across multiple tasks could lead to advancements in audio processing technologies, making them more accessible and effective in real-world applications. UniWhisper presents a novel approach to continual multi-task training for universal audio representation, achieving competitive performance across diverse audio tasks while maintaining strong speech capabilities. This work significantly contributes to the field by addressing the limitations of existing models and offering a streamlined methodology that enhances both efficiency and effectiveness in audio representation learning.
Low-resource automatic speech recognition (ASR) continues to pose significant challenges, primarily due to the limited availability of transcribed data for numerous languages. While a wealth of spoken content is accessible in television dramas and online videos, Taiwanese Hokkien exemplifies this issue, with transcriptions often being scarce and the majority of available subtitles provided only in Mandarin. To address this deficiency, we introduce TG-ASR for Taiwanese Hokkien drama speech recognition, a translation-guided ASR framework that utilizes multilingual translation embeddings to enhance recognition performance in low-resource environments. The framework is centered around the parallel gated cross-attention (PGCA) mechanism, which adaptively integrates embeddings from various auxiliary languages into the ASR decoder. This mechanism facilitates robust cross-linguistic semantic guidance while ensuring stable optimization and minimizing interference between languages. To support ongoing research initiatives, we present YT-THDC, a 30-hour corpus of Taiwanese Hokkien drama speech with aligned Mandarin subtitles and manually verified Taiwanese Hokkien transcriptions. Comprehensive experiments and analyses identify the auxiliary languages that most effectively enhance ASR performance, achieving a 14.77% relative reduction in character error rate and demonstrating the efficacy of translation-guided learning for underrepresented languages in practical applications.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the TG-ASR framework, which effectively utilizes translation-guided learning through a novel PGCA mechanism to enhance automatic speech recognition for low-resource languages. This research addresses critical gaps in ASR technology and provides a valuable resource for future studies in multilingual and low-resource language processing.
The methodology presented in TG-ASR is innovative, particularly the introduction of the Parallel Gated Cross-Attention (PGCA) mechanism, which adaptively integrates multilingual translation embeddings into the ASR decoder. This approach is well-justified, addressing the specific challenges of low-resource languages by leveraging auxiliary languages to improve transcription accuracy. The two-stage training process is clearly articulated, ensuring that the model benefits from both initial fine-tuning and subsequent integration of multilingual embeddings. However, the reliance on pre-trained models for auxiliary language embeddings and the potential noise introduced by machine translations are notable considerations.
The experiments are comprehensive, utilizing the newly created YT-THDC corpus, which is a significant contribution to the field of low-resource ASR. The results demonstrate a substantial reduction in character error rate (CER), validating the effectiveness of the proposed framework. The ablation studies provide insights into the contributions of various components of the PGCA mechanism, reinforcing the robustness of the findings. However, the paper could benefit from additional comparative analyses with state-of-the-art models to contextualize the performance gains more clearly.
The paper provides a detailed description of the model architecture, training procedures, and evaluation metrics, which enhances reproducibility. However, the lack of a publicly accessible code repository or demo limits the ability for other researchers to replicate the results fully. The authors should consider releasing their code and model weights to facilitate further research and validation.
The study acknowledges several limitations, including the size and domain specificity of the YT-THDC corpus, which may restrict generalizability. Additionally, the reliance on auxiliary translations introduces potential noise that could affect performance. The findings are also specific to Taiwanese Hokkien, and the effectiveness of the approach for other low-resource languages remains to be validated.
The work has significant implications for the preservation of endangered languages and the accessibility of media content. By improving ASR for Taiwanese Hokkien, the research contributes to cultural preservation efforts and enhances bilingual accessibility. The methodology could be adapted for other low-resource languages, potentially benefiting a wider range of linguistic communities. The main contribution of this paper is the TG-ASR framework, which effectively utilizes translation-guided learning through a novel PGCA mechanism to enhance automatic speech recognition for low-resource languages. This research addresses critical gaps in ASR technology and provides a valuable resource for future studies in multilingual and low-resource language processing.
Voice style conversion aims to transform an input utterance to match a target speaker's timbre, accent, and emotion, with a central challenge being the disentanglement of linguistic content from style. While prior work has explored this problem, conversion quality remains limited, and real-time voice style conversion has not been addressed. We propose StyleStream, the first streamable zero-shot voice style conversion system that achieves state-of-the-art performance. StyleStream consists of two components: a Destylizer, which removes style attributes while preserving linguistic content, and a Stylizer, a diffusion transformer (DiT) that reintroduces target style conditioned on reference speech. Robust content-style disentanglement is enforced through text supervision and a highly constrained information bottleneck. This design enables a fully non-autoregressive architecture, achieving real-time voice style conversion with an end-to-end latency of 1 second. Samples and real-time demo: https://berkeley-speech-group.github.io/StyleStream/.
Primary: UC Berkeley
All Institutions: UC Berkeley
StyleStream introduces the first real-time zero-shot voice style conversion system capable of modifying timbre, accent, and emotion with an end-to-end latency of approximately 1 second. This paper represents a significant technical contribution to the field, addressing key challenges in voice style conversion through innovative methodology and rigorous experimental validation, thus paving the way for practical applications in various domains.
The methodology presented in StyleStream is innovative, combining a Destylizer for content-style disentanglement with a Stylizer based on a diffusion transformer. The use of ASR loss and a compact finite scalar quantization (FSQ) bottleneck is a significant advancement over previous methods, allowing for cleaner disentanglement of linguistic content from style attributes. The non-autoregressive architecture enables real-time processing, which is a notable improvement in the field of voice style conversion. The paper provides a comprehensive description of the architecture, training procedures, and the rationale behind design choices, demonstrating a solid understanding of the challenges in voice style conversion.
The experimental evaluation is thorough, utilizing a diverse dataset of 50k hours of English speech for training and a well-structured test set for evaluation. The results show that StyleStream outperforms existing methods in terms of intelligibility and style fidelity across multiple metrics, including WER and similarity scores. The paper effectively communicates the performance improvements over baseline models, providing both objective and subjective evaluations, which strengthen the claims of superior performance.
The paper includes detailed descriptions of the architecture, training configurations, and evaluation metrics, which facilitate reproducibility. However, the lack of a publicly available code repository may hinder full reproducibility for some researchers. The authors could enhance reproducibility by providing access to their trained models and detailed implementation instructions.
One limitation is the reliance on a large amount of training data, which may not be readily available for all researchers. Additionally, while the paper claims real-time processing capabilities, the actual latency may vary depending on hardware, which could limit practical applications in certain environments. The model's performance with shorter reference utterances is also noted to degrade, indicating a potential limitation in flexibility.
The advancements presented in StyleStream have significant implications for applications in voice synthesis, dubbing, and personalized voice assistants, where real-time style conversion can enhance user experience. The ability to modify timbre, accent, and emotion in real-time opens up new avenues for interactive applications in entertainment, education, and accessibility, potentially impacting how voice technologies are integrated into daily life. StyleStream introduces the first real-time zero-shot voice style conversion system capable of modifying timbre, accent, and emotion with an end-to-end latency of approximately 1 second. This paper represents a significant technical contribution to the field, addressing key challenges in voice style conversion through innovative methodology and rigorous experimental validation, thus paving the way for practical applications in various domains.
Reconstructing the speech audio envelope from scalp neural recordings (EEG) is a central task for decoding a listener's attentional focus in applications like neuro-steered hearing aids. Current methods for this reconstruction, however, face challenges with fidelity and noise. Prevailing approaches treat it as a static regression problem, processing each EEG window in isolation and ignoring the rich temporal structure inherent in continuous speech. This study introduces a new, dynamic framework for envelope reconstruction that leverages this structure as a predictive temporal prior. We propose a state-space fusion model that combines direct neural estimates from EEG with predictions from recent speech context, using a learned gating mechanism to adaptively balance these cues. To validate this approach, we evaluate our model on the ICASSP 2023 Stimulus Reconstruction benchmark demonstrating significant improvements over static, EEG-only baselines. Our analyses reveal a powerful synergy between the neural and temporal information streams. Ultimately, this work reframes envelope reconstruction not as a simple mapping, but as a dynamic state-estimation problem, opening a new direction for developing more accurate and coherent neural decoding systems.
Primary: Johns Hopkins University
All Institutions: Johns Hopkins University, Laboratory for Computational Audio Perception
The main contribution of this work is the introduction of the DECAF model, which reframes speech envelope reconstruction from EEG as a dynamic state-estimation problem, significantly improving reconstruction accuracy through the integration of temporal context. This innovative approach not only advances the state-of-the-art in auditory attention decoding but also opens new avenues for research in brain-computer interfaces and neurotechnology.
The proposed DECAF model introduces a novel dynamic framework for reconstructing speech envelopes from EEG data by integrating a predictive temporal prior with direct neural estimates. This approach is innovative as it shifts the paradigm from static regression to dynamic state estimation, leveraging temporal dependencies inherent in speech signals. The architecture is modular, consisting of three core components: the EEG to Envelope decoder, the Envelope Forecaster, and the Dynamic Fusion gate, which work together to enhance reconstruction accuracy. The use of a learned gating mechanism to balance the contributions of neural evidence and temporal context is particularly noteworthy, as it allows for adaptive integration of information.
The authors validate their model using the ICASSP 2023 Stimulus Reconstruction benchmark, demonstrating significant improvements over static EEG-only baselines and achieving state-of-the-art performance. The experiments are well-structured, comparing DECAF against established baselines, including traditional methods and contemporary deep learning architectures. The results are quantitatively supported by statistical significance tests, and the ablation studies provide insights into the contributions of each component of the model.
The paper provides sufficient details regarding the dataset, experimental setup, and model training procedures, which enhances reproducibility. The authors adhere to established protocols and publicly share their code repository, facilitating further experimentation and validation by other researchers.
While the model shows promise, it may face challenges in real-world applications where EEG signals are subject to high noise levels. The performance of DECAF under extreme noise conditions aligns with baseline models, indicating a potential limitation in robustness. Additionally, the reliance on past predictions may introduce biases if the initial estimates are inaccurate.
The implications of this research extend to neuro-steered hearing aids and auditory attention decoding systems, potentially improving the quality of life for individuals with hearing impairments. By enhancing the accuracy of speech envelope reconstruction, the DECAF framework could lead to more effective auditory processing technologies, making it a significant contribution to both machine learning and assistive technologies. The main contribution of this work is the introduction of the DECAF model, which reframes speech envelope reconstruction from EEG as a dynamic state-estimation problem, significantly improving reconstruction accuracy through the integration of temporal context. This innovative approach not only advances the state-of-the-art in auditory attention decoding but also opens new avenues for research in brain-computer interfaces and neurotechnology.
Large-language-model (LLM)-based text-to-speech (TTS) systems can generate natural speech, but most are not designed for low-latency dual-streaming synthesis. High-quality dual-streaming TTS depends on accurate text--speech alignment and well-designed training sequences that balance synthesis quality and latency. Prior work often relies on GMM-HMM based forced-alignment toolkits (e.g., MFA), which are pipeline-heavy and less flexible than neural aligners; fixed-ratio interleaving of text and speech tokens struggles to capture text--speech alignment regularities. We propose CTC-TTS, which replaces MFA with a CTC based aligner and introduces a bi-word based interleaving strategy. Two variants are designed: CTC-TTS-L (token concatenation along the sequence length) for higher quality and CTC-TTS-F (embedding stacking along the feature dimension) for lower latency. Experiments show that CTC-TTS outperforms fixed-ratio interleaving and MFA-based baselines on streaming synthesis and zero-shot tasks. Speech samples are available at https://ctctts.github.io/.
Primary: Tsinghua University
All Institutions: Tsinghua University, Xinjiang University
The paper presents CTC-TTS, a novel dual-streaming text-to-speech synthesis method that leverages CTC alignment and bi-word interleaving strategies. The approach demonstrates significant improvements in synthesis quality and latency, marking a meaningful contribution to the field of audio processing and machine learning.
The paper introduces a novel approach to text-to-speech synthesis by employing a Connectionist Temporal Classification (CTC)-based alignment mechanism, which is a significant departure from traditional GMM-HMM forced alignment methods. The introduction of a bi-word interleaving strategy enhances the model's ability to capture temporal dependencies between text and speech, addressing the limitations of fixed-ratio interleaving. The two variants, CTC-TTS-L and CTC-TTS-F, are well-defined and cater to different quality-latency trade-offs, showcasing a thoughtful design that balances synthesis quality with operational efficiency. The methodology is sound, with clear explanations of the alignment process and interleaving strategies.
The experimental setup is robust, utilizing well-known datasets such as LibriSpeech and VoiceAssistant400K for evaluation. The authors provide comprehensive results comparing their method against established baselines, demonstrating significant improvements in both streaming synthesis and zero-shot tasks. The use of objective metrics like Word Error Rate (WER) and Character Error Rate (CER) alongside subjective evaluations (Mean Opinion Score) adds credibility to the findings. The results indicate that both CTC-TTS variants outperform existing methods, validating the proposed approach.
The paper includes detailed implementation details, including model architecture, training configurations, and evaluation metrics, which enhance reproducibility. However, the absence of a publicly available code repository at this stage may hinder full reproducibility until the code is released post-acceptance.
While the proposed method shows promising results, the paper does not address potential limitations regarding the scalability of the approach to multi-speaker scenarios or its performance in diverse linguistic contexts. Additionally, the reliance on a specific G2P model (Phonetisaurus) may limit flexibility in phoneme generation.
The advancements in low-latency, high-quality TTS systems have significant implications for applications in virtual assistants, audiobooks, and real-time communication tools. The ability to synthesize speech with reduced latency while maintaining naturalness can enhance user experience in various interactive applications. The research could pave the way for further innovations in speech synthesis technologies. The paper presents CTC-TTS, a novel dual-streaming text-to-speech synthesis method that leverages CTC alignment and bi-word interleaving strategies. The approach demonstrates significant improvements in synthesis quality and latency, marking a meaningful contribution to the field of audio processing and machine learning.
Cover songs constitute a vital aspect of musical culture, preserving the core melody of an original composition while reinterpreting it to infuse novel emotional depth and thematic emphasis. Although prior research has explored the reinterpretation of instrumental music through melody-conditioned text-to-music models, the task of cover song generation remains largely unaddressed. In this work, we reformulate our cover song generation as a conditional generation, which simultaneously generates new vocals and accompaniment conditioned on the original vocal melody and text prompts. To this end, we present SongEcho, which leverages Instance-Adaptive Element-wise Linear Modulation (IA-EiLM), a framework that incorporates controllable generation by improving both conditioning injection mechanism and conditional representation. To enhance the conditioning injection mechanism, we extend Feature-wise Linear Modulation (FiLM) to an Element-wise Linear Modulation (EiLM), to facilitate precise temporal alignment in melody control. For conditional representations, we propose Instance-Adaptive Condition Refinement (IACR), which refines conditioning features by interacting with the hidden states of the generative model, yielding instance-adaptive conditioning. Additionally, to address the scarcity of large-scale, open-source full-song datasets, we construct Suno70k, a high-quality AI song dataset enriched with comprehensive annotations. Experimental results across multiple datasets demonstrate that our approach generates superior cover songs compared to existing methods, while requiring fewer than 30% of the trainable parameters. The code, dataset, and demos are available at https://github.com/lsfhuihuiff/SongEcho_ICLR2026.
Primary: National Natural Science Foundation of China
All Institutions: National Natural Science Foundation of China, China Scholarship Council, German Research Foundation, National Science and Technology Council, Taiwan
The paper presents a novel approach to cover song generation through the introduction of IA-EiLM and IACR, significantly advancing the field of audio machine learning. The methodology and experimental results indicate a strong potential for practical applications in music generation, although further work is needed to enhance reproducibility and address limitations.
The methodology presented in the paper is innovative, particularly with the introduction of Instance-Adaptive Element-wise Linear Modulation (IA-EiLM) and Instance-Adaptive Condition Refinement (IACR). The extension of Feature-wise Linear Modulation (FiLM) to EiLM is a notable advancement that addresses the challenge of temporal alignment in melody control. The dual focus on generating both vocals and accompaniment conditioned on the original melody and text prompts is a significant step forward in cover song generation. However, the paper could benefit from a more detailed explanation of the underlying mechanics of IA-EiLM and IACR, particularly how they interact with the generative model's hidden states.
The experimental results are compelling, demonstrating that SongEcho outperforms existing methods while utilizing fewer parameters, which suggests a more efficient model. The construction of the Suno70k dataset is a valuable contribution, addressing a critical gap in the availability of high-quality, annotated datasets for song generation tasks. However, the paper should provide more comprehensive comparisons with a wider range of baseline methods to strengthen the claims of superiority.
The authors have made the code and dataset publicly available, which is a positive aspect for reproducibility. However, the paper lacks detailed implementation instructions and hyperparameter settings, which could hinder other researchers from replicating the results accurately.
One limitation of the study is the potential overfitting due to the small size of the dataset relative to the complexity of the task. Additionally, the subjective nature of music generation means that quantitative metrics may not fully capture the quality of the generated songs. The paper could also explore the limitations of the IA-EiLM and IACR methods in different musical contexts or genres.
The proposed framework has significant implications for the music industry, particularly in automated music composition and cover song generation. By enabling more nuanced and emotionally resonant reinterpretations of existing songs, this research could enhance creative processes in music production. Furthermore, the availability of the Suno70k dataset could spur further research in music generation and related fields. The paper presents a novel approach to cover song generation through the introduction of IA-EiLM and IACR, significantly advancing the field of audio machine learning. The methodology and experimental results indicate a strong potential for practical applications in music generation, although further work is needed to enhance reproducibility and address limitations.
Remote monitoring of heart failure (HF) via speech signals provides a non-invasive and cost-effective solution for long-term patient management. However, substantial inter-individual heterogeneity in vocal characteristics often limits the accuracy of traditional cross-sectional classification models. To address this, we propose a Longitudinal Intra-Patient Tracking (LIPT) scheme designed to capture the trajectory of relative symptomatic changes within individuals. Central to this framework is a Personalised Sequential Encoder (PSE), which transforms longitudinal speech recordings into context-aware latent representations. By incorporating historical data at each timestamp, the PSE facilitates a holistic assessment of the clinical trajectory rather than modelling discrete visits independently. Experimental results from a cohort of 225 patients demonstrate that the LIPT paradigm significantly outperforms the classic cross-sectional approaches, achieving a recognition accuracy of 99.7% for clinical status transitions. The model's high sensitivity was further corroborated by additional follow-up data, confirming its efficacy in predicting HF deterioration and its potential to secure patient safety in remote, home-based settings. Furthermore, this work addresses the gap in existing literature by providing a comprehensive analysis of different speech task designs and acoustic features. Taken together, the superior performance of the LIPT framework and PSE architecture validates their readiness for integration into long-term telemonitoring systems, offering a scalable solution for remote heart failure management.
Primary: Taizhou People’s Hospital
All Institutions: Taizhou People’s Hospital, Jiangsu, China
The main contribution of this paper is the development of a personalized speech-based monitoring system for heart failure that utilizes a longitudinal approach to track individual patient trajectories. This innovative methodology and its promising results highlight the potential for speech dynamics to serve as effective biomarkers in the management of chronic health conditions.
The proposed methodology introduces a novel Longitudinal Intra-Patient Tracking (LIPT) framework that leverages a Personalised Sequential Encoder (PSE) to model heart failure (HF) progression through speech dynamics. This approach is innovative as it shifts the focus from traditional cross-sectional models to a longitudinal perspective, capturing individual patient trajectories over time. The methodology is well-structured, with clear stages for feature extraction, statistical screening, and longitudinal tracking, which collectively enhance the model's ability to account for inter-individual variability in speech characteristics. The integration of both global and frame-level features, particularly the emphasis on RASTA features, demonstrates a thoughtful approach to feature selection that aligns with the clinical context of HF monitoring.
The experimental evaluation is robust, utilizing a substantial cohort of 225 patients and multiple speech tasks to assess the model's performance. The results indicate a significant improvement in classification accuracy (99.7%) compared to traditional methods, underscoring the effectiveness of the LIPT framework. The use of follow-up data to validate model performance further strengthens the findings, although the paper could benefit from additional comparative analyses with more diverse datasets to enhance generalizability.
The paper provides a comprehensive description of the data collection process, feature extraction methods, and model architecture, which are essential for reproducibility. However, the lack of detailed hyperparameter settings and training procedures may pose challenges for other researchers attempting to replicate the results. The availability of code and models on GitHub is a positive aspect that facilitates reproducibility.
One notable limitation is the potential for high false-positive rates in identifying stable patients, which could affect clinical applicability. The model's reliance on specific speech tasks may also limit its generalizability across different populations or settings. Additionally, the study's focus on a single institution may restrict the diversity of the patient cohort, impacting the external validity of the findings.
This research has significant implications for remote patient monitoring in heart failure management, particularly in resource-limited settings. The ability to accurately track HF status through non-invasive speech analysis could enhance patient safety and reduce healthcare costs. The findings may pave the way for integrating such technologies into routine clinical practice, thereby improving patient outcomes and access to care. The main contribution of this paper is the development of a personalized speech-based monitoring system for heart failure that utilizes a longitudinal approach to track individual patient trajectories. This innovative methodology and its promising results highlight the potential for speech dynamics to serve as effective biomarkers in the management of chronic health conditions.
Remote monitoring of heart failure (HF) via speech signals provides a non-invasive and cost-effective solution for long-term patient management. However, substantial inter-individual heterogeneity in vocal characteristics often limits the accuracy of traditional cross-sectional classification models. To address this, we propose a Longitudinal Intra-Patient Tracking (LIPT) scheme designed to capture the trajectory of relative symptomatic changes within individuals. Central to this framework is a Personalised Sequential Encoder (PSE), which transforms longitudinal speech recordings into context-aware latent representations. By incorporating historical data at each timestamp, the PSE facilitates a holistic assessment of the clinical trajectory rather than modelling discrete visits independently. Experimental results from a cohort of 225 patients demonstrate that the LIPT paradigm significantly outperforms the classic cross-sectional approaches, achieving a recognition accuracy of 99.7% for clinical status transitions. The model's high sensitivity was further corroborated by additional follow-up data, confirming its efficacy in predicting HF deterioration and its potential to secure patient safety in remote, home-based settings. Furthermore, this work addresses the gap in existing literature by providing a comprehensive analysis of different speech task designs and acoustic features. Taken together, the superior performance of the LIPT framework and PSE architecture validates their readiness for integration into long-term telemonitoring systems, offering a scalable solution for remote heart failure management.
Primary: Taizhou People’s Hospital
All Institutions: Taizhou People’s Hospital, Jiangsu, China
This paper presents a novel framework for continuous telemonitoring of heart failure using personalized speech dynamics, significantly advancing the field of remote patient management. The integration of longitudinal analysis with advanced machine learning techniques demonstrates a promising direction for future healthcare applications.
The proposed methodology introduces the Longitudinal Intra-Patient Tracking (LIPT) paradigm, which is innovative in its focus on personalized monitoring of heart failure through speech dynamics. The Personalised Sequential Encoder (PSE) is a significant advancement, allowing for the capture of temporal dependencies in speech data, which is critical for understanding individual patient trajectories. The integration of both global and frame-level features enhances the robustness of the model. The methodology is well-structured, addressing the limitations of traditional cross-sectional approaches by emphasizing longitudinal data analysis.
The experimental setup is comprehensive, involving a cohort of 225 patients and multiple speech tasks that yield a substantial dataset for analysis. The results indicate a remarkable accuracy of 99.7% in recognizing clinical status transitions, which is a strong validation of the proposed framework. The comparative analysis against baseline models (XGBoost and Fully Connected Neural Network) demonstrates the effectiveness of the LIPT approach. However, while the results are impressive, the paper could benefit from additional details on the statistical significance of the findings and potential confounding factors.
The paper provides a clear outline of the methods and algorithms used, along with a GitHub repository for code and trained models, which supports reproducibility. However, the absence of detailed hyperparameter settings and specific training protocols in the main text may hinder full reproducibility for other researchers.
One limitation is the potential for high false-positive rates in identifying stable patients, which could impact clinical applicability. The model's reliance on specific speech tasks may also limit its generalizability across different populations or languages. Furthermore, the study's focus on a single cohort from a specific region may not account for broader demographic variations.
The implications of this research are significant, particularly in enhancing remote monitoring of heart failure patients, especially in resource-limited settings. By leveraging non-invasive speech analysis, the proposed system could improve patient outcomes through timely interventions. The approach also opens avenues for further research in speech-based diagnostics across various medical conditions. This paper presents a novel framework for continuous telemonitoring of heart failure using personalized speech dynamics, significantly advancing the field of remote patient management. The integration of longitudinal analysis with advanced machine learning techniques demonstrates a promising direction for future healthcare applications.
Long-context modeling is essential for symbolic music generation, since motif repetition and developmental variation can span thousands of musical events. However, practical composition and performance workflows frequently rely on resource-limited devices (e.g., electronic instruments and portable computers), making heavy memory and attention computation difficult to deploy. We introduce Depth-Structured Music Recurrence (DSMR), a recurrent long-context Transformer for full-piece symbolic music modeling that extends context beyond fixed-length excerpts via segment-level recurrence with detached cross-segment states, featuring a layer-wise memory-horizon schedule that budgets recurrent KV states across depth. DSMR is trained in a single left-to-right pass over each complete composition, akin to how a musician experiences it from beginning to end, while carrying recurrent cross-segment states forward. Within this recurrent framework, we systematically study how depth-wise horizon allocations affect optimization, best-checkpoint perplexity, and efficiency. By allocating different history-window lengths across layers while keeping the total recurrent-state budget fixed, DSMR creates depth-dependent temporal receptive fields within a recurrent attention stack without reducing compute depth. Our main instantiation is a two-scale DSMR schedule that allocates long history windows to lower layers and a uniform short window to the remaining layers. Experiments on the piano performance dataset MAESTRO demonstrate that two-scale DSMR provides a practical quality--efficiency recipe for full-length long-context symbolic music modeling with recurrent attention under limited computational resources.
Primary: Auckland University of Technology
All Institutions: Auckland University of Technology
The paper presents a novel approach to long-context modeling in symbolic music generation through the Depth-Structured Music Recurrence framework, effectively balancing computational efficiency with the need for extensive contextual information. The comprehensive methodology and rigorous experimental evaluation contribute to its significance in advancing the field of machine learning for music generation.
The proposed Depth-Structured Music Recurrence (DSMR) framework innovatively addresses the challenges of long-context modeling in symbolic music generation by implementing a recurrent long-context Transformer that utilizes segment-level recurrence with detached cross-segment states. The methodology is well-structured, allowing for depth-dependent temporal receptive fields by varying memory horizons across layers. This approach is particularly relevant for resource-constrained environments, as it balances computational efficiency with the need for extensive contextual information in music generation.
The experiments conducted on the MAESTRO dataset are rigorous and well-documented, demonstrating the effectiveness of the two-scale DSMR model in achieving lower perplexity compared to other methods under similar memory constraints. The evaluation metrics, including perplexity and efficiency (tokens processed per second, peak memory usage), provide a comprehensive view of the model's performance. The comparative analysis with other models, including full-attention references, adds credibility to the findings.
The paper provides sufficient details regarding the experimental setup, model architecture, and training protocols, which enhances reproducibility. However, the absence of a publicly available code repository or demo limits the ease with which other researchers can replicate the results.
The study acknowledges certain limitations, such as the potential impact of the chosen model scale and memory settings on performance. Additionally, the findings may not generalize to all long-context domains, and the exploration of the design space for memory retention and gating could be expanded.
The implications of this research are significant for the field of symbolic music generation, particularly in enabling more efficient models that can operate on consumer-grade hardware. This advancement could facilitate real-time applications in music composition and performance, making sophisticated music generation tools more accessible to creators. The paper presents a novel approach to long-context modeling in symbolic music generation through the Depth-Structured Music Recurrence framework, effectively balancing computational efficiency with the need for extensive contextual information. The comprehensive methodology and rigorous experimental evaluation contribute to its significance in advancing the field of machine learning for music generation.
Automatic Chord Recognition (ACR) is constrained by the scarcity of aligned chord labels, as well-aligned annotations are costly to acquire. At the same time, open-weight pre-trained models are currently more accessible than their proprietary training data. In this work, we present a two-stage training pipeline that leverages pre-trained models together with unlabeled audio. The proposed method decouples training into two stages. In the first stage, we use a pre-trained BTC model as a teacher to generate pseudo-labels for over 1,000 hours of diverse unlabeled audio and train a student model solely on these pseudo-labels. In the second stage, the student is continually trained on ground-truth labels as they become available, with selective knowledge distillation (KD) from the teacher applied as a regularizer to prevent catastrophic forgetting of the representations learned in the first stage. In our experiments, two models (BTC, 2E1D) were used as students. In stage 1, using only pseudo-labels, the BTC student achieves over 98% of the teacher's performance, while the 2E1D model achieves about 96% across seven standard mir_eval metrics. After a single training run for both students in stage 2, the resulting BTC student model surpasses the traditional supervised learning baseline by 2.5% and the original pre-trained teacher model by 1.55% on average across all metrics. And the resulting 2E1D student model improves from the traditional supervised learning baseline by 3.79% on average and achieves almost the same performance as the teacher. Both cases show the large gains on rare chord qualities.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of a two-stage training pipeline that leverages pseudo-labeling and knowledge distillation to enhance automatic chord recognition, particularly in scenarios with limited labeled data. This work presents a significant advancement in the field, offering a practical solution to a common challenge in music information retrieval.
The paper proposes a two-stage training pipeline that effectively utilizes pseudo-labeling and knowledge distillation to enhance automatic chord recognition. The methodology is well-structured, with a clear distinction between the two training phases: the first leverages a pre-trained teacher model to generate pseudo-labels from unlabeled audio, while the second phase incorporates ground-truth labels with selective knowledge distillation to mitigate catastrophic forgetting. This approach is innovative in its decoupling of labeled and unlabeled data training, allowing for improved model performance even when labeled data is scarce.
The experiments are comprehensive, utilizing over 1,000 hours of unlabeled audio across various datasets. The results demonstrate significant improvements in performance metrics, particularly for rare chord qualities, which is a critical aspect of chord recognition. The use of standard mir_eval metrics adds rigor to the evaluation, and the comparative analysis against traditional supervised learning baselines highlights the effectiveness of the proposed method. However, the paper could benefit from more detailed ablation studies to further validate the impact of each component in the training pipeline.
The paper provides sufficient detail regarding the training configurations, datasets, and evaluation metrics, which supports reproducibility. However, the absence of a clear description of the experimental setup and hyperparameter tuning could pose challenges for other researchers attempting to replicate the results. The provided GitHub link to the project may aid in this regard, assuming it contains the necessary code and documentation.
One limitation identified is the reliance on the quality of the teacher model for generating pseudo-labels. If the teacher model is biased or poorly generalizable, it could negatively impact the performance of the student model. Additionally, the paper does not address potential issues related to the scalability of the method when applied to larger datasets or more complex chord recognition tasks.
The proposed methodology has significant implications for the field of music information retrieval, particularly in enhancing automatic chord recognition capabilities. By effectively utilizing unlabeled data, this approach could lower the barriers to developing robust ACR systems, making them more accessible for various applications in music analysis, education, and automated music transcription. The focus on improving recognition of rare chord qualities also addresses a critical gap in existing ACR systems. The main contribution of this paper is the introduction of a two-stage training pipeline that leverages pseudo-labeling and knowledge distillation to enhance automatic chord recognition, particularly in scenarios with limited labeled data. This work presents a significant advancement in the field, offering a practical solution to a common challenge in music information retrieval.
Accent normalization (AN) systems often struggle with unnatural outputs and undesired content distortion, stemming from both suboptimal training data and rigid duration modeling. In this paper, we propose a "source-synthesis" methodology for training data construction. By generating source L2 speech and using authentic native speech as the training target, our approach avoids learning from TTS artifacts and, crucially, requires no real L2 data in training. Alongside this data strategy, we introduce CosyAccent, a non-autoregressive model that resolves the trade-off between prosodic naturalness and duration control. CosyAccent implicitly models rhythm for flexibility yet offers explicit control over total output duration. Experiments show that, despite being trained without any real L2 speech, CosyAccent achieves significantly improved content preservation and superior naturalness compared to strong baselines trained on real-world data.
Primary: Shenzhen Research Institute of Big Data
All Institutions: Shenzhen Research Institute of Big Data
The paper presents CosyAccent, a novel duration-controllable accent normalization model that utilizes a unique source-synthesis training data strategy to improve the naturalness and content preservation of accent conversion systems. This work represents a meaningful advancement in the field, addressing critical challenges in accent normalization and offering a scalable solution for future research and applications.
The paper introduces a novel "source-synthesis" methodology for constructing training data, which synthesizes L2 source speech from a high-quality L1 corpus, thereby avoiding TTS artifacts. This innovative approach allows the model to be trained without real L2 data, which is a significant advancement in the field of accent normalization. The CosyAccent model itself is a non-autoregressive architecture that effectively balances prosodic naturalness and explicit duration control, addressing a critical limitation in existing models.
The experiments are well-structured, comparing CosyAccent against strong baselines trained on real L2 data. The results demonstrate significant improvements in content preservation and naturalness, validated through both subjective and objective metrics. The use of a diverse dataset covering multiple accents adds robustness to the evaluation, although the paper could benefit from a more extensive discussion on the statistical significance of the results.
The paper provides adequate details regarding the model architecture, training data construction, and evaluation metrics, which supports reproducibility. The inclusion of a GitHub repository for code and data further enhances the potential for other researchers to replicate the findings.
The primary limitation noted is the model's robustness to acoustic noise and control over paralinguistic features, as the synthetic data used for training is very clean. Additionally, the reliance on a specific TTS model for data generation may limit the generalizability of the approach to other languages or accents.
The implications of this research are significant, particularly for applications in language learning, dubbing, and personalized TTS systems. By reducing the dependency on real L2 data, the proposed method could facilitate the development of accent normalization systems in resource-scarce languages, potentially broadening access to language education and media. The paper presents CosyAccent, a novel duration-controllable accent normalization model that utilizes a unique source-synthesis training data strategy to improve the naturalness and content preservation of accent conversion systems. This work represents a meaningful advancement in the field, addressing critical challenges in accent normalization and offering a scalable solution for future research and applications.
In sequence-to-sequence Transformer ASR, autoregressive (AR) models achieve strong accuracy but suffer from slow decoding, while non-autoregressive (NAR) models enable parallel decoding at the cost of degraded performance. We propose a principled NAR ASR framework based on Masked Diffusion Models to reduce this gap. A pre-trained speech encoder is coupled with a Transformer diffusion decoder conditioned on acoustic features and partially masked transcripts for parallel token prediction. To mitigate the training-inference mismatch, we introduce Iterative Self-Correction Training that exposes the model to its own intermediate predictions. We also design a Position-Biased Entropy-Bounded Confidence-based sampler with positional bias to further boost results. Experiments across multiple benchmarks demonstrate consistent gains over prior NAR models and competitive performance with strong AR baselines, while retaining parallel decoding efficiency.
Primary: Georgia Institute of Technology
All Institutions: Georgia Institute of Technology, UniversitĂ degli Studi di Palermo
The paper presents MDM-ASR, a novel approach that leverages masked diffusion models for efficient and accurate non-autoregressive automatic speech recognition. The integration of innovative methodologies and comprehensive experimental validation positions this work as a meaningful contribution to the field of machine learning and speech processing.
The proposed MDM-ASR framework innovatively integrates masked diffusion models into the ASR domain, addressing the limitations of autoregressive and traditional non-autoregressive models. The use of Iterative Self-Correction Training (ISCT) to align training with inference is a significant methodological advancement, as it allows the model to learn from its own predictions, thereby enhancing robustness. The introduction of Position-Biased Entropy-Bounded Confidence-based samplers further refines the decoding process, showcasing a well-thought-out approach to improving efficiency and accuracy.
The experiments are comprehensive, covering multiple English and multilingual datasets, and the results demonstrate that MDM-ASR outperforms existing models in both accuracy and decoding efficiency. The ablation studies provide valuable insights into the contributions of various components, reinforcing the robustness of the findings. However, the reliance on specific datasets may limit the generalizability of the results.
The paper provides sufficient details regarding the experimental setup, including model architecture and training procedures, which enhances reproducibility. However, the absence of publicly available code or a demo limits the practical reproducibility of the results.
The paper acknowledges limitations in terms of dataset diversity and the need for further exploration of alternative model configurations. Additionally, the evaluation is primarily based on benchmark datasets, which may not fully capture real-world performance across varied conditions.
The advancements in ASR technology presented in this paper have significant implications for real-time applications, such as virtual assistants and transcription services, where efficiency and accuracy are paramount. The proposed methods could pave the way for more scalable and effective ASR systems across different languages and domains. The paper presents MDM-ASR, a novel approach that leverages masked diffusion models for efficient and accurate non-autoregressive automatic speech recognition. The integration of innovative methodologies and comprehensive experimental validation positions this work as a meaningful contribution to the field of machine learning and speech processing.
In sequence-to-sequence Transformer ASR, autoregressive (AR) models achieve strong accuracy but suffer from slow decoding, while non-autoregressive (NAR) models enable parallel decoding at the cost of degraded performance. We propose a principled NAR ASR framework based on Masked Diffusion Models to reduce this gap. A pre-trained speech encoder is coupled with a Transformer diffusion decoder conditioned on acoustic features and partially masked transcripts for parallel token prediction. To mitigate the training-inference mismatch, we introduce Iterative Self-Correction Training that exposes the model to its own intermediate predictions. We also design a Position-Biased Entropy-Bounded Confidence-based sampler with positional bias to further boost results. Experiments across multiple benchmarks demonstrate consistent gains over prior NAR models and competitive performance with strong AR baselines, while retaining parallel decoding efficiency.
Primary: Georgia Institute of Technology
All Institutions: Georgia Institute of Technology, UniversitĂ degli Studi di Palermo
The MDM-ASR framework presents a significant advancement in bridging the gap between accuracy and efficiency in ASR systems. By innovatively applying masked diffusion models and iterative self-correction training, the authors provide a compelling solution that enhances both the performance and practicality of non-autoregressive speech recognition.
The proposed MDM-ASR framework introduces a novel approach to non-autoregressive (NAR) automatic speech recognition (ASR) by leveraging masked diffusion models. The methodology effectively combines a pre-trained speech encoder with a diffusion-based decoder that utilizes iterative self-correction training (ISCT) to align training and inference processes. This approach addresses the limitations of traditional CTC models and autoregressive methods, particularly in terms of decoding efficiency and accuracy. The introduction of position-biased entropy-bounded confidence-based samplers further enhances decoding performance, showcasing a thoughtful integration of existing techniques into a cohesive framework.
The experimental validation is robust, involving multiple benchmark datasets including LibriSpeech, Earnings22, AMI, and VoxPopuli, demonstrating the model's competitive performance against strong autoregressive baselines. The results indicate that MDM-ASR not only achieves lower word error rates (WER) but also maintains superior decoding efficiency, particularly in longer sequences. Comprehensive ablation studies provide insights into the effectiveness of the proposed methods, reinforcing the validity of the experimental design.
While the paper mentions plans to open source the code and hyperparameters, the current lack of a publicly available implementation limits immediate reproducibility. The detailed description of the model architecture and training procedures, however, suggests that other researchers could replicate the study with sufficient effort.
The paper acknowledges limitations regarding the evaluation scope, which is confined to a subset of datasets, potentially limiting the generalizability of the findings. Additionally, the reliance on specific design choices may restrict the model's adaptability to varied ASR applications. Future work is needed to explore alternative architectures and broader dataset evaluations.
The advancements presented in this research have significant implications for real-time ASR applications, particularly in environments where efficiency is critical, such as virtual assistants and transcription services. The ability to achieve competitive accuracy with faster decoding times could enhance user experience and expand the applicability of ASR technologies across diverse domains. The MDM-ASR framework presents a significant advancement in bridging the gap between accuracy and efficiency in ASR systems. By innovatively applying masked diffusion models and iterative self-correction training, the authors provide a compelling solution that enhances both the performance and practicality of non-autoregressive speech recognition.
This paper highlights the critical importance of multi-channel speech enhancement (MCSE) for speech emotion recognition (ER) in cocktail party scenarios. A multi-channel speech dereverberation and separation front-end integrating DNN-WPE and mask-based MVDR is used to extract the target speaker's speech from the mixture speech, before being fed into the downstream ER back-end using HuBERT- and ViT-based speech and visual features. Experiments on mixture speech constructed using the IEMOCAP and MSP-FACE datasets suggest the MCSE output consistently outperforms domain fine-tuned single-channel speech representations produced by: a) Conformer-based metric GANs; and b) WavLM SSL features with optional SE-ER dual task fine-tuning. Statistically significant increases in weighted, unweighted accuracy and F1 measures by up to 9.5%, 8.5% and 9.1% absolute (17.1%, 14.7% and 16.0% relative) are obtained over the above single-channel baselines. The generalization of IEMOCAP trained MCSE front-ends are also shown when being zero-shot applied to out-of-domain MSP-FACE data.
Primary: Institute of Software, Chinese Academy of Sciences
All Institutions: Institute of Software, Chinese Academy of Sciences, National Research Council Canada, The Chinese University of Hong Kong
This paper makes a significant contribution by demonstrating the effectiveness of multi-channel speech enhancement techniques for improving emotion recognition in challenging acoustic environments. The innovative methodology and strong experimental results highlight its potential impact on future research and applications in the field.
The paper presents a robust methodology that integrates multi-channel speech enhancement (MCSE) with emotion recognition (ER) in cocktail party scenarios. The use of a DNN-WPE based dereverberation and mask-based MVDR separation front-end is innovative, particularly in its application to ER, which has traditionally relied on single-channel inputs. The integration of HuBERT and ViT for feature extraction further enhances the approach, making it suitable for both audio-only and audio-visual ER systems. The detailed ablation studies provide insights into the contributions of each component, showcasing a comprehensive understanding of the problem space.
The experiments are well-structured, utilizing two established datasets (IEMOCAP and MSP-FACE) to evaluate the proposed system's performance. The results demonstrate statistically significant improvements in accuracy and F1 scores compared to single-channel baselines, indicating the effectiveness of the MCSE approach. The zero-shot application of the MCSE front-end to out-of-domain data is particularly noteworthy, suggesting good generalization capabilities.
The paper provides sufficient details regarding the experimental setup, including model configurations and training strategies, which enhances reproducibility. However, the absence of a publicly available code repository may hinder full reproducibility for other researchers.
While the paper addresses a significant gap in the literature, it does not explore the potential computational costs and real-time applicability of the proposed MCSE front-end in practical scenarios. Additionally, the reliance on simulated data for training may limit the model's performance in real-world applications.
The findings of this research have the potential to significantly advance the field of emotion recognition in noisy environments, particularly in applications such as human-computer interaction, assistive technologies, and surveillance systems. The integration of multi-channel processing could lead to more robust systems capable of understanding human emotions in complex auditory scenes. This paper makes a significant contribution by demonstrating the effectiveness of multi-channel speech enhancement techniques for improving emotion recognition in challenging acoustic environments. The innovative methodology and strong experimental results highlight its potential impact on future research and applications in the field.
Self-supervised speech models (S3Ms) are known to encode rich phonetic information, yet how this information is structured remains underexplored. We conduct a comprehensive study across 96 languages to analyze the underlying structure of S3M representations, with particular attention to phonological vectors. We first show that there exist linear directions within the model's representation space that correspond to phonological features. We further demonstrate that the scale of these phonological vectors correlate to the degree of acoustic realization of their corresponding phonological features in a continuous manner. For example, the difference between [d] and [t] yields a voicing vector: adding this vector to [p] produces [b], while scaling it results in a continuum of voicing. Together, these findings indicate that S3Ms encode speech using phonologically interpretable and compositional vectors, demonstrating phonological vector arithmetic. All code and interactive demos are available at https://github.com/juice500ml/phonetic-arithmetic .
Primary: Unknown
All Institutions: Unknown
The main contribution of this paper is the demonstration that self-supervised speech models encode phonologically interpretable and compositional vectors, revealing a structured representation of phonological features. This work significantly advances the understanding of S3M representations and their potential applications in speech technology and linguistics.
The paper presents a novel approach to understanding self-supervised speech models (S3Ms) by investigating the linear structure of phonological features within their representation space. The methodology involves analyzing phonological vectors across 96 languages, establishing a framework for phonological vector arithmetic. The use of cosine similarity to evaluate phonological analogies and the introduction of a vocoder to assess the scaling of phonological vectors are innovative aspects that enhance the understanding of S3M representations.
The experiments are well-structured, utilizing two datasets (TIMIT and VoxAngeles) to validate the hypotheses regarding phonological vector arithmetic and scaling. The results demonstrate a strong correlation between the scale of phonological vectors and acoustic measurements, providing empirical support for the proposed theories. The analysis of phonological features across different languages adds to the robustness of the findings, although the paper could benefit from a broader range of S3Ms to validate the generalizability of the results.
The authors have made their code and interactive demos publicly available, which is a positive aspect for reproducibility. However, the paper could improve by providing more detailed implementation specifics, particularly regarding the training of the vocoder and the exact configurations used for the S3Ms.
The study is limited by its focus on a specific set of phonological features as defined by PanPhon, which may not capture the full complexity of phonological systems across all languages. Additionally, the results are influenced by the choice of vocoder, and the authors acknowledge that different vocoders may yield varying synthesis results. The paper also notes that it does not explore all possible S3Ms, which could limit the generalizability of the findings.
The findings have significant implications for both speech processing and linguistic theory. By demonstrating that S3Ms can learn interpretable phonological structures, the research opens avenues for more intuitive speech synthesis and understanding of phonological features as continuous rather than binary. This could enhance applications in speech recognition, synthesis, and language learning technologies. The main contribution of this paper is the demonstration that self-supervised speech models encode phonologically interpretable and compositional vectors, revealing a structured representation of phonological features. This work significantly advances the understanding of S3M representations and their potential applications in speech technology and linguistics.