Audio ML Papers

Aurchestra: Fine-Grained, Real-Time Soundscape Control on Resource-Constrained Hearables

Seunghyun Oh, Malek Itani, Aseem Gauri ... · arXiv

Hearables are becoming ubiquitous, yet their sound controls remain blunt: users can either enable global noise suppression or focus on a single target sound. Real-world acoustic scenes, however, contain many simultaneous sources that users may want to adjust independently. We int...

Hearables are becoming ubiquitous, yet their sound controls remain blunt: users can either enable global noise suppression or focus on a single target sound. Real-world acoustic scenes, however, contain many simultaneous sources that users may want to adjust independently. We introduce Aurchestra, the first system to provide fine-grained, real-time soundscape control on resource-constrained hearables. Our system has two key components: (1) a dynamic interface that surfaces only active sound classes and (2) a real-time, on-device multi-output extraction network that generates separate streams for each selected class, achieving robust performance for upto 5 overlapping target sounds, and letting users mix their environment by customizing per-class volumes, much like an audio engineer mixes tracks. We optimize the model architecture for multiple compute-limited platforms and demonstrate real-time performance on 6 ms streaming audio chunks. Across real-world environments in previously unseen indoor and outdoor scenarios, our system enables expressive per-class sound control and achieves substantial improvements in target-class enhancement and interference suppression. Our results show that the world need not be heard as a single, undifferentiated stream: with Aurchestra, the soundscape becomes truly programmable.

Institutional Affiliations

Primary: Paul G. Allen School of Computer Science and Engineering, University of Washington

All Institutions: Paul G. Allen School of Computer Science and Engineering, University of Washington, Hearvana AI

ML Relevance Analysis (83)

Aurchestra introduces a groundbreaking approach to soundscape control on hearables, enabling users to manipulate multiple sound sources independently in real-time. The combination of innovative methodology and practical applications positions this work as a significant contribution to the field of audio machine learning, although further enhancements in reproducibility and comparative analysis are necessary for broader acceptance.

Comprehensive Analysis

Methodology Assessment

The methodology presented in Aurchestra is innovative as it combines a dynamic interface with a real-time multi-output extraction network tailored for resource-constrained hearables. The authors detail a robust architecture that allows for the detection and manipulation of multiple overlapping sound classes, which is a significant advancement over traditional binary noise cancellation systems. The use of on-device processing is particularly noteworthy, as it addresses the limitations of latency and resource usage in mobile devices. However, the paper could benefit from a more detailed description of the model architecture and the specific algorithms used for sound class detection and extraction.

Experimental Evaluation

The experimental evaluation is thorough, demonstrating the system's performance in real-world environments with diverse acoustic scenarios. The authors report substantial improvements in target-class enhancement and interference suppression, which are critical metrics for the effectiveness of soundscape control. However, the paper lacks a comprehensive comparison with existing methods, which would help contextualize the results and validate the claimed improvements. Additionally, further details on the datasets used for training and testing would enhance the credibility of the experimental findings.

Reproducibility

The paper does not provide sufficient details regarding the implementation of the system, such as the specific datasets, training procedures, or hyperparameters used in the model. This lack of transparency could hinder reproducibility, which is a crucial aspect of machine learning research. Including a supplementary material section or a dedicated repository with code and data would significantly improve this aspect.

Limitations

While the system shows promising results, there are limitations that need to be addressed. The performance in highly dynamic or noisy environments is not fully explored, and the scalability of the approach to more than five overlapping sounds is unclear. Additionally, the user interface design and user experience aspects are briefly mentioned but not thoroughly evaluated, which could impact the system's practical usability.

Broader Impact

Aurchestra has the potential to revolutionize personal audio experiences, making it applicable in various fields such as augmented reality, hearing aids, and smart environments. By allowing users to customize their auditory experiences in real-time, it could enhance accessibility for individuals with hearing impairments and improve the overall quality of life in noisy urban settings. The implications for privacy and user control over their auditory environment are also significant, as this technology could empower users to manage their soundscapes actively. Aurchestra introduces a groundbreaking approach to soundscape control on hearables, enabling users to manipulate multiple sound sources independently in real-time. The combination of innovative methodology and practical applications positions this work as a significant contribution to the field of audio machine learning, although further enhancements in reproducibility and comparative analysis are necessary for broader acceptance.

Analysis: Full Paper • Full text: 1,141 characters

CMI-RewardBench: Evaluating Music Reward Models with Compositional Multimodal Instruction

Yinghao Ma, Haiwen Xia, Hewei Gao ... · arXiv

While music generation models have evolved to handle complex multimodal inputs mixing text, lyrics, and reference audio, evaluation mechanisms have lagged behind. In this paper, we bridge this critical gap by establishing a comprehensive ecosystem for music reward modeling under ...

While music generation models have evolved to handle complex multimodal inputs mixing text, lyrics, and reference audio, evaluation mechanisms have lagged behind. In this paper, we bridge this critical gap by establishing a comprehensive ecosystem for music reward modeling under Compositional Multimodal Instruction (CMI), where the generated music may be conditioned on text descriptions, lyrics, and audio prompts. We first introduce CMI-Pref-Pseudo, a large-scale preference dataset comprising 110k pseudo-labeled samples, and CMI-Pref, a high-quality, human-annotated corpus tailored for fine-grained alignment tasks. To unify the evaluation landscape, we propose CMI-RewardBench, a unified benchmark that evaluates music reward models on heterogeneous samples across musicality, text-music alignment, and compositional instruction alignment. Leveraging these resources, we develop CMI reward models (CMI-RMs), a parameter-efficient reward model family capable of processing heterogeneous inputs. We evaluate their correlation with human judgments scores on musicality and alignment on CMI-Pref along with previous datasets. Further experiments demonstrate that CMI-RM not only correlates strongly with human judgments, but also enables effective inference-time scaling via top-k filtering. The necessary training data, benchmarks, and reward models are publicly available.

Institutional Affiliations

Primary: Queen Mary University of London

All Institutions: Queen Mary University of London, Peking University, Technical University of Munich, Beijing University of Post and Telecommunications, SooChow University, University of Manchester, Hong Kong University of Science and Technology

GitHub

ML Relevance Analysis (83)

This paper presents a comprehensive framework for evaluating music generation models through a novel benchmark and datasets, significantly advancing the state of the art in music reward modeling. The methodology is rigorous, and the results demonstrate a clear improvement in aligning model outputs with human preferences, making it a valuable contribution to the field of machine learning and music technology.

Comprehensive Analysis

Methodology Assessment

The paper introduces a novel framework for evaluating music generation models using Compositional Multimodal Instruction (CMI). It constructs two datasets—CMI-Pref-Pseudo and CMI-Pref—alongside a unified benchmark, CMI-RewardBench, which assesses models on multiple dimensions of musicality and alignment. The methodology is robust, utilizing both pseudo-labeling and expert annotations, and employs a parameter-efficient architecture for the reward models, allowing for effective processing of heterogeneous inputs. The two-stage training strategy enhances the model's performance by leveraging both large-scale pseudo-labeled data and high-quality human annotations.

Experimental Evaluation

The experiments are comprehensive, demonstrating the effectiveness of the proposed CMI-RM against existing baselines across various tasks. The results show strong correlations with human judgments, indicating that the proposed models can effectively evaluate music generation quality. The paper provides detailed metrics and comparisons, showcasing the advantages of the CMI-RewardBench in capturing the nuances of human preferences in music generation.

Reproducibility

The authors have made their datasets, benchmark, and model weights publicly available, which enhances reproducibility. The detailed methodology, including the training protocols and evaluation metrics, is well-documented, allowing other researchers to replicate the experiments. However, the reliance on specific models for pseudo-labeling may introduce variability that is not fully accounted for.

Limitations

One limitation is the potential bias in the pseudo-labeling process, which may affect the quality of the training data. Additionally, while the framework addresses the complexity of multimodal inputs, the evaluation may still be subjective, as musicality and alignment can vary significantly based on individual listener preferences. The paper also does not extensively discuss the scalability of the approach to larger datasets or different musical genres.

Broader Impact

This work has significant implications for the field of music generation and evaluation, providing a structured approach to assess models that can handle complex multimodal inputs. The availability of the datasets and benchmark can spur further research in aligned music generation and improve the quality of AI-generated music in commercial applications. The methodology could also be adapted for use in other creative domains where multimodal inputs are prevalent. This paper presents a comprehensive framework for evaluating music generation models through a novel benchmark and datasets, significantly advancing the state of the art in music reward modeling. The methodology is rigorous, and the results demonstrate a clear improvement in aligning model outputs with human preferences, making it a valuable contribution to the field of machine learning and music technology.

Analysis: Full Paper • Full text: 50,026 characters

Whisper-MLA: Reducing GPU Memory Consumption of ASR Models based on MHA2MLA Conversion

Sen Zhang, Jianguo Wei, Wenhuan Lu ... · ICASSP 2026

The Transformer-based Whisper model has achieved state-of-the-art performance in Automatic Speech Recognition (ASR). However, its Multi-Head Attention (MHA) mechanism results in significant GPU memory consumption due to the linearly growing Key-Value (KV) cache usage, which is pr...

The Transformer-based Whisper model has achieved state-of-the-art performance in Automatic Speech Recognition (ASR). However, its Multi-Head Attention (MHA) mechanism results in significant GPU memory consumption due to the linearly growing Key-Value (KV) cache usage, which is problematic for many applications especially with long-form audio. To address this, we introduce Whisper-MLA, a novel architecture that incorporates Multi-Head Latent Attention (MLA) into the Whisper model. Specifically, we adapt MLA for Whisper's absolute positional embeddings and systematically investigate its application across encoder self-attention, decoder self-attention, and cross-attention modules. Empirical results indicate that applying MLA exclusively to decoder self-attention yields the desired balance between performance and memory efficiency. Our proposed approach allows conversion of a pretrained Whisper model to Whisper-MLA with minimal fine-tuning. Extensive experiments on the LibriSpeech benchmark validate the effectiveness of this conversion, demonstrating that Whisper-MLA reduces the KV cache size by up to 87.5% while maintaining competitive accuracy.

Institutional Affiliations

Primary: Tianjin University

All Institutions: Tianjin University

ML Relevance Analysis (83)

The paper presents Whisper-MLA, a novel architecture that reduces GPU memory consumption in ASR models while preserving performance. This work is significant as it addresses a critical bottleneck in deploying state-of-the-art ASR systems, particularly for long-form audio applications, thereby enhancing accessibility and usability in various practical contexts.

Comprehensive Analysis

Methodology Assessment

The proposed methodology introduces a novel architecture, Whisper-MLA, which effectively integrates Multi-Head Latent Attention (MLA) into the Whisper model. The authors adapt MLA specifically for absolute positional embeddings, which is a significant innovation given the existing limitations of applying MLA to encoder-decoder architectures. The systematic investigation of MLA's application across different attention modules is commendable, and the decision to focus on decoder self-attention for optimization reflects a well-thought-out approach to balancing memory efficiency and performance.

Experimental Evaluation

The experiments conducted on the LibriSpeech benchmark are extensive and demonstrate the effectiveness of the Whisper-MLA model in reducing GPU memory consumption significantly while maintaining competitive accuracy. The results clearly show that the proposed model achieves up to 87.5% reduction in KV cache size, which is a critical metric for real-world applications, especially in resource-constrained environments. The comparative analysis with the original Whisper model provides a solid basis for the claims made.

Reproducibility

The paper provides sufficient details regarding the experimental setup, including the model architecture, training parameters, and the dataset used. However, the lack of a publicly available code repository limits the reproducibility of the results. The authors mention that their source code is publicly available, but no specific URL is provided in the text, which is a missed opportunity for enhancing reproducibility.

Limitations

One limitation is that the paper primarily focuses on the decoder self-attention mechanism, potentially overlooking the benefits that could be derived from optimizing other components of the model. Additionally, while the results are promising, the experiments are limited to the LibriSpeech dataset, which may not fully represent the model's performance across diverse ASR tasks and environments.

Broader Impact

The Whisper-MLA architecture has significant implications for deploying large-scale ASR models in real-world applications, particularly in scenarios where GPU memory is a limiting factor. By reducing memory consumption while maintaining performance, this work could facilitate the use of advanced ASR technologies in mobile devices, embedded systems, and other resource-constrained environments. The paper presents Whisper-MLA, a novel architecture that reduces GPU memory consumption in ASR models while preserving performance. This work is significant as it addresses a critical bottleneck in deploying state-of-the-art ASR systems, particularly for long-form audio applications, thereby enhancing accessibility and usability in various practical contexts.

Analysis: Full Paper • Full text: 16,052 characters

Efficient Long-Sequence Diffusion Modeling for Symbolic Music Generation

Jinhan Xu, Xing Tang, Houpeng Yang ... · arXiv

Symbolic music generation is a challenging task in multimedia generation, involving long sequences with hierarchical temporal structures, long-range dependencies, and fine-grained local details. Though recent diffusion-based models produce high quality generations, they tend to s...

Symbolic music generation is a challenging task in multimedia generation, involving long sequences with hierarchical temporal structures, long-range dependencies, and fine-grained local details. Though recent diffusion-based models produce high quality generations, they tend to suffer from high training and inference costs with long symbolic sequences due to iterative denoising and sequence-length-related costs. To deal with such problem, we put forth a diffusing strategy named SMDIM to combine efficient global structure construction and light local refinement. SMDIM uses structured state space models to capture long range musical context at near linear cost, and selectively refines local musical details via a hybrid refinement scheme. Experiments performed on a wide range of symbolic music datasets which encompass various Western classical music, popular music and traditional folk music show that the SMDIM model outperforms the other state-of-the-art approaches on both the generation quality and the computational efficiency, and it has robust generalization to underexplored musical styles. These results show that SMDIM offers a principled solution for long-sequence symbolic music generation, including associated attributes that accompany the sequences. We provide a project webpage with audio examples and supplementary materials at https://3328702107.github.io/smdim-music/.

Institutional Affiliations

Primary: Hubei University of Technology

All Institutions: Hubei University of Technology, Hubei Key Laboratory of Digital Finance Innovation, Hubei University of Economics, Wuhan University of Technology

Demo

ML Relevance Analysis (82)

The main contribution of this paper is the introduction of SMDIM, a novel diffusion-based architecture that effectively addresses the challenges of long-sequence symbolic music generation by integrating structured state space models and hybrid refinement techniques. This work represents a meaningful advancement in the field, offering both theoretical insights and practical applications that could impact the future of music generation technologies.

Comprehensive Analysis

Methodology Assessment

The proposed SMDIM framework innovatively integrates structured state space models with diffusion modeling to address the challenges of long-sequence symbolic music generation. The methodology is well-articulated, with a clear explanation of the hybrid architecture that balances global structure modeling and local detail refinement. The introduction of the MFA block, which combines Mamba layers, feed-forward networks, and self-attention, is a significant contribution that enhances both efficiency and expressiveness. The theoretical underpinnings are solid, and the approach is tailored to the unique requirements of symbolic music, making it a meaningful advancement in the field.

Experimental Evaluation

The experimental evaluation is robust, utilizing a diverse set of datasets (MAESTRO, POP909, and FolkDB) that cover various musical styles. The paper presents comprehensive results, demonstrating that SMDIM outperforms state-of-the-art models in both generation quality and computational efficiency. The use of objective metrics such as average overlap area (OA) provides a quantitative basis for the claims, and the subjective evaluations through listening tests add depth to the assessment of musical quality. The ablation studies further strengthen the findings by elucidating the contributions of different components within the model.

Reproducibility

The paper includes sufficient details about the training process, hyperparameters, and model architecture, which are essential for reproducibility. However, the absence of a public code repository limits the ease of reproduction. While the methodology is described in detail, having an accessible implementation would enhance the ability of other researchers to validate and build upon this work.

Limitations

The paper acknowledges certain limitations, such as the model's tendency to produce musically implausible pitch ranges and overly dense vertical note stacking. Additionally, the structural coherence of generated music may degrade in longer compositions, indicating challenges in maintaining global musical form. These limitations suggest areas for future research, including the incorporation of constraints to improve musical plausibility and coherence.

Broader Impact

The implications of this research are significant for the fields of music generation and multimedia content creation. By improving the efficiency and quality of symbolic music generation, SMDIM could facilitate advancements in automated music composition, interactive music applications, and educational tools for music theory. The model's ability to generalize across diverse musical styles also opens avenues for cross-cultural music generation, potentially enriching the landscape of automated music creation. The main contribution of this paper is the introduction of SMDIM, a novel diffusion-based architecture that effectively addresses the challenges of long-sequence symbolic music generation by integrating structured state space models and hybrid refinement techniques. This work represents a meaningful advancement in the field, offering both theoretical insights and practical applications that could impact the future of music generation technologies.

Analysis: Full Paper • Full text: 47,727 characters

DashengTokenizer: One layer is enough for unified audio understanding and generation

Heinrich Dinkel, Xingwei Sun, Gang Li ... · arXiv

This paper introduces DashengTokenizer, a continuous audio tokenizer engineered for joint use in both understanding and generation tasks. Unlike conventional approaches, which train acoustic tokenizers and subsequently integrate frozen semantic knowledge, our method inverts this ...

This paper introduces DashengTokenizer, a continuous audio tokenizer engineered for joint use in both understanding and generation tasks. Unlike conventional approaches, which train acoustic tokenizers and subsequently integrate frozen semantic knowledge, our method inverts this paradigm: we leverage frozen semantic features and inject acoustic information. In linear evaluation across 22 diverse tasks, our method outperforms previous audio codec and audio encoder baselines by a significant margin while maintaining competitive audio reconstruction quality. Notably, we demonstrate that this acoustic injection improves performance for tasks such as speech emotion recognition, music understanding, and acoustic scene classification. We further evaluate the tokenizer's generative performance on text-to-audio (TTA), text-to-music (TTM), and speech enhancement (SE). Our approach surpasses standard variational autoencoder (VAE)-based methods on TTA and TTM tasks, while its effectiveness on SE underscores its capabilities as a general-purpose audio encoder. Finally, our results challenge the prevailing assumption that VAE-based architectures are a prerequisite for audio synthesis. Checkpoints are available at https://huggingface.co/mispeech/dashengtokenizer.

Institutional Affiliations

Primary: unknown

All Institutions: unknown

Demo

ML Relevance Analysis (80)

The main contribution of this paper is the introduction of DashengTokenizer, a unified audio tokenizer that enhances both understanding and generation tasks by inverting the traditional paradigm of acoustic tokenization. This innovative approach, alongside its competitive performance across various benchmarks, positions it as a significant advancement in the field of audio machine learning.

Comprehensive Analysis

Methodology Assessment

The methodology presented in the paper is innovative, as it proposes the DashengTokenizer, which inverts the conventional approach of audio tokenization by leveraging frozen semantic features to inject acoustic information. The simplicity of using a linear projection for acoustic injection is a notable strength, making the method efficient and accessible. However, the paper could benefit from a more detailed explanation of the training process and the selection of hyperparameters.

Experimental Evaluation

The experimental evaluation is robust, covering a wide range of tasks across understanding and generation domains, with comparisons to existing state-of-the-art methods. The use of diverse datasets and benchmarks strengthens the findings, and the results indicate significant improvements over traditional methods, particularly in understanding tasks. However, more detailed statistical analysis of the results could enhance the credibility of the claims.

Reproducibility

The paper provides a clear overview of the architecture and training setup, which aids reproducibility. The availability of checkpoints on Hugging Face is a positive aspect, although the paper lacks specific implementation details that could help other researchers replicate the experiments more easily.

Limitations

One limitation is the potential overfitting to the specific datasets used for training and evaluation, which may not generalize to all audio tasks. Additionally, the reliance on a frozen semantic encoder may limit adaptability to new domains without retraining.

Broader Impact

The DashengTokenizer has the potential to significantly impact audio understanding and generation tasks, making it a valuable tool for applications in speech recognition, music analysis, and environmental sound classification. Its efficiency and performance could lead to broader adoption in real-world applications, particularly in areas requiring high-fidelity audio processing. The main contribution of this paper is the introduction of DashengTokenizer, a unified audio tokenizer that enhances both understanding and generation tasks by inverting the traditional paradigm of acoustic tokenization. This innovative approach, alongside its competitive performance across various benchmarks, positions it as a significant advancement in the field of audio machine learning.

Analysis: Full Paper • Full text: 21,728 characters

Online Register for Dual-Mode Self-Supervised Speech Models: Mitigating The Lack of Future Context

Keita Goto, Takashi Maekaku, Jin Sakuma ... · ICASSP 2026

Dual-mode self-supervised speech models (S3Ms), which jointly pre-trained in the offline and online mode, suffer from attention mismatch in streaming scenarios due to missing future context. To address this challenge, we proposed online registers, learnable tokens appended to eac...

Dual-mode self-supervised speech models (S3Ms), which jointly pre-trained in the offline and online mode, suffer from attention mismatch in streaming scenarios due to missing future context. To address this challenge, we proposed online registers, learnable tokens appended to each chunk in online mode. These tokens act as virtual placeholders for unseen future frames, enabling the model to compensate for missing context without introducing additional latency. Furthermore, we introduce a future prediction loss that explicitly guides the registers to capture predictive cues, thereby enriching their ability to retain future information. Experiments on LibriSpeech, and out-of-domain benchmarks demonstrate that online registers consistently reduce the performance gap between offline and online modes, achieving a 3.4% relative improvement on LibriSpeech with 160 ms chunks, especially in low-latency settings.

Institutional Affiliations

Primary: unknown

All Institutions: unknown

ML Relevance Analysis (75)

The main contribution of this paper is the introduction of online registers and a future prediction loss to improve dual-mode self-supervised speech models, addressing the challenge of missing future context in streaming scenarios. This research presents a meaningful advancement in the field of speech recognition, combining innovative methodology with rigorous experimental validation to enhance model performance in real-time applications.

Comprehensive Analysis

Methodology Assessment

The proposed methodology introduces online registers as learnable tokens that serve as virtual placeholders for future context in dual-mode self-supervised speech models. This approach is innovative as it addresses the critical issue of attention mismatch in streaming scenarios without increasing latency. The incorporation of a future prediction loss further enhances the model's ability to retain useful predictive information. The design is well-structured, leveraging existing frameworks while introducing novel components that effectively bridge the gap between offline and online processing.

Experimental Evaluation

The experiments conducted on the LibriSpeech dataset and out-of-domain benchmarks provide a solid evaluation of the proposed method. The reported 3.4% relative improvement in word error rate (WER) demonstrates the effectiveness of online registers, particularly in low-latency settings. The comparison with existing methods like wav2vec 2.0 and UFO2 highlights the competitive performance of the proposed approach. However, the paper could benefit from more extensive experimentation across diverse datasets to validate the generalizability of the findings.

Reproducibility

The paper provides detailed implementation information, including model architecture, training procedures, and hyperparameter settings, which facilitate reproducibility. However, the absence of a publicly available code repository limits the ability for other researchers to replicate the results independently.

Limitations

One notable limitation is the potential overfitting observed when increasing the number of online registers. The paper indicates that using too many registers may degrade performance, suggesting a need for careful tuning. Additionally, the future prediction loss's effectiveness appears to vary depending on the dataset and chunk size, indicating that its benefits may not be universally applicable.

Broader Impact

The proposed method has significant implications for real-time speech recognition systems, especially in applications requiring low-latency processing. By effectively mitigating the lack of future context, this research could enhance the performance of speech models in various domains, including virtual assistants, transcription services, and accessibility tools for the hearing impaired. The lightweight nature of the online registers also suggests potential for deployment in resource-constrained environments. The main contribution of this paper is the introduction of online registers and a future prediction loss to improve dual-mode self-supervised speech models, addressing the challenge of missing future context in streaming scenarios. This research presents a meaningful advancement in the field of speech recognition, combining innovative methodology with rigorous experimental validation to enhance model performance in real-time applications.

Analysis: Full Paper • Full text: 19,617 characters

SemanticVocoder: Bridging Audio Generation and Audio Understanding via Semantic Latents

Zeyu Xie, Chenxing Li, Qiao Jin ... · arXiv

Recent audio generation models typically rely on Variational Autoencoders (VAEs) and perform generation within the VAE latent space. Although VAEs excel at compression and reconstruction, their latents inherently encode low-level acoustic details rather than semantically discrimi...

Recent audio generation models typically rely on Variational Autoencoders (VAEs) and perform generation within the VAE latent space. Although VAEs excel at compression and reconstruction, their latents inherently encode low-level acoustic details rather than semantically discriminative information, leading to entangled event semantics and complicating the training of generative models. To address these issues, we discard VAE acoustic latents and introduce semantic encoder latents, thereby proposing SemanticVocoder, a generative vocoder that directly synthesizes waveforms from semantic latents. Equipped with SemanticVocoder, our text-to-audio generation model achieves a Frechet Distance of 12.823 and a Frechet Audio Distance of 1.709 on the AudioCaps test set, as the introduced semantic latents exhibit superior discriminability compared to acoustic VAE latents. Beyond improved generation performance, it also serves as a promising attempt towards unifying audio understanding and generation within a shared semantic space. Generated samples are available at https://zeyuxie29.github.io/SemanticVocoder/.

Institutional Affiliations

Primary: Tencent AI Lab

All Institutions: Tencent AI Lab

Demo

ML Relevance Analysis (83)

The main contribution of this paper is the introduction of SemanticVocoder, a generative vocoder that synthesizes audio waveforms from semantic latents, overcoming the limitations of traditional VAE-based approaches and demonstrating superior performance in both audio generation and understanding tasks. This work represents a significant step forward in bridging the gap between audio generation and understanding, with implications for various applications in the audio processing domain.

Comprehensive Analysis

Methodology Assessment

The paper introduces SemanticVocoder, a novel approach that replaces traditional VAE acoustic latents with semantic latents for audio generation. The methodology is well-structured, leveraging a flow-matching approach to synthesize waveforms directly from semantic representations, thus addressing the limitations of conventional VAE-based systems. The use of a pretrained MAE encoder for extracting semantic latents is a significant innovation, enabling the model to focus on high-level semantic information rather than low-level acoustic details. The proposed architecture effectively balances the optimization difficulty across the text-to-latent and latent-to-waveform stages, which is a notable advancement in the field.

Experimental Evaluation

The experiments are comprehensive, utilizing multiple datasets (AudioCaps, AudioSet, and WavCaps) to evaluate the performance of SemanticVocoder against existing models. The reported results demonstrate superior performance in terms of Fréchet Distance and Fréchet Audio Distance, indicating that the model generates audio closer to real distributions. Additionally, the paper includes evaluations on audio understanding tasks, showcasing the discriminative power of semantic latents. However, the lack of subjective evaluation metrics is a minor drawback.

Reproducibility

The paper provides detailed implementation specifics, including model architecture, training parameters, and datasets used, which enhances reproducibility. However, the absence of a publicly available code repository limits the ease of reproduction for external researchers.

Limitations

The paper acknowledges limitations such as dependency on the pretrained semantic encoder's performance, constraints on audio length generation, and the need for subjective evaluations to complement objective metrics. These factors could impact the model's applicability in real-world scenarios.

Broader Impact

SemanticVocoder has the potential to significantly advance the field of audio generation and understanding by providing a unified framework that leverages semantic information. This could lead to improved applications in areas such as content creation, audio synthesis, and interactive media, thereby enhancing user experiences in various domains. The main contribution of this paper is the introduction of SemanticVocoder, a generative vocoder that synthesizes audio waveforms from semantic latents, overcoming the limitations of traditional VAE-based approaches and demonstrating superior performance in both audio generation and understanding tasks. This work represents a significant step forward in bridging the gap between audio generation and understanding, with implications for various applications in the audio processing domain.

Analysis: Full Paper • Full text: 27,565 characters

TADA: A Generative Framework for Speech Modeling via Text-Acoustic Dual Alignment

Trung Dang, Sharath Rao, Ananya Gupta ... · arXiv

Modern Text-to-Speech (TTS) systems increasingly leverage Large Language Model (LLM) architectures to achieve scalable, high-fidelity, zero-shot generation. However, these systems typically rely on fixed-frame-rate acoustic tokenization, resulting in speech sequences that are sig...

Modern Text-to-Speech (TTS) systems increasingly leverage Large Language Model (LLM) architectures to achieve scalable, high-fidelity, zero-shot generation. However, these systems typically rely on fixed-frame-rate acoustic tokenization, resulting in speech sequences that are significantly longer than, and asynchronous with their corresponding text. Beyond computational inefficiency, this sequence length disparity often triggers hallucinations in TTS and amplifies the modality gap in spoken language modeling (SLM). In this paper, we propose a novel tokenization scheme that establishes one-to-one synchronization between continuous acoustic features and text tokens, enabling unified, single-stream modeling within an LLM. We demonstrate that these synchronous tokens maintain high-fidelity audio reconstruction and can be effectively modeled in a latent space by a large language model with a flow matching head. Moreover, the ability to seamlessly toggle speech modality within the context enables text-only guidance--a technique that blends logits from text-only and text-speech modes to flexibly bridge the gap toward text-only LLM intelligence. Experimental results indicate that our approach achieves performance competitive with state-of-the-art TTS and SLM systems while virtually eliminating content hallucinations and preserving linguistic integrity, all at a significantly reduced inference cost.

Institutional Affiliations

Primary: Dartmouth College

All Institutions: Dartmouth College, Hume AI

GitHub

ML Relevance Analysis (83)

The paper presents TADA, a generative framework that synchronizes text and acoustic features for improved speech modeling. This innovative approach addresses key challenges in traditional TTS systems, offering a significant contribution to the field with its potential for high-fidelity, efficient speech synthesis.

Comprehensive Analysis

Methodology Assessment

The proposed methodology introduces a novel tokenization scheme that achieves one-to-one synchronization between text and acoustic features, which is a significant advancement over traditional fixed-frame-rate approaches. The use of a dual alignment mechanism and a flow matching head within a large language model framework allows for efficient and high-fidelity speech synthesis. The architecture is well-structured, leveraging a combination of variational autoencoders and transformer-based models, which enhances the model's ability to handle both modalities concurrently. The approach to mitigate the modality gap through Speech Free Guidance (SFG) is particularly innovative, allowing for flexible integration of text and speech modalities.

Experimental Evaluation

The experiments conducted demonstrate the effectiveness of the proposed model against state-of-the-art TTS and SLM systems. The authors provide a comprehensive evaluation using multiple datasets and metrics, including character error rate, speaker similarity, and subjective evaluations of naturalness. The results indicate that the proposed method not only matches but often exceeds the performance of existing models, particularly in terms of reducing content hallucinations and improving inference efficiency. The extensive dataset used for training and evaluation further supports the robustness of the findings.

Reproducibility

The paper includes a link to the GitHub repository containing the code and pre-trained models, which is a positive aspect for reproducibility. However, the details regarding the training process, hyperparameters, and specific configurations could be elaborated further to enhance clarity for future researchers attempting to replicate the results.

Limitations

One limitation noted is the potential for speaker drifting during long-form generation, which suggests that while the model performs well in many scenarios, there are still challenges in maintaining speaker consistency over extended outputs. Additionally, the subjective evaluations indicate that while the model performs competitively, there is room for improvement in perceptual audio quality.

Broader Impact

The implications of this research are significant for the field of speech synthesis and spoken language modeling. The ability to generate high-fidelity speech with reduced hallucinations and improved efficiency can enhance applications in voice cloning, virtual assistants, and interactive AI systems. The methodology could also pave the way for further advancements in multimodal AI systems, where seamless integration of text and speech is crucial. The paper presents TADA, a generative framework that synchronizes text and acoustic features for improved speech modeling. This innovative approach addresses key challenges in traditional TTS systems, offering a significant contribution to the field with its potential for high-fidelity, efficient speech synthesis.

Analysis: Full Paper • Full text: 43,102 characters

Make It Hard to Hear, Easy to Learn: Long-Form Bengali ASR and Speaker Diarization via Extreme Augmentation and Perfect Alignment

Sanjid Hasan, Risalat Labib, A H M Fuad ... · arXiv

Although Automatic Speech Recognition (ASR) in Bengali has seen significant progress, processing long-duration audio and performing robust speaker diarization remain critical research gaps. To address the severe scarcity of joint ASR and diarization resources for this language, w...

Although Automatic Speech Recognition (ASR) in Bengali has seen significant progress, processing long-duration audio and performing robust speaker diarization remain critical research gaps. To address the severe scarcity of joint ASR and diarization resources for this language, we introduce Lipi-Ghor-882, a comprehensive 882-hour multi-speaker Bengali dataset. In this paper, detailing our submission to the DL Sprint 4.0 competition, we systematically evaluate various architectures and approaches for long-form Bengali speech. For ASR, we demonstrate that raw data scaling is ineffective; instead, targeted fine-tuning utilizing perfectly aligned annotations paired with synthetic acoustic degradation (noise and reverberation) emerges as the singular most effective approach. Conversely, for speaker diarization, we observed that global open-source state-of-the-art models (such as Diarizen) performed surprisingly poorly on this complex dataset. Extensive model retraining yielded negligible improvements; instead, strategic, heuristic post-processing of baseline model outputs proved to be the primary driver for increasing accuracy. Ultimately, this work outlines a highly optimized dual pipeline achieving a $\sim$0.019 Real-Time Factor (RTF), establishing a practical, empirically backed benchmark for low-resource, long-form speech processing.

Institutional Affiliations

Primary: KUET

All Institutions: KUET, BUET

ML Relevance Analysis (82)

The main contribution of this paper is the introduction of the Lipi-Ghor-882 dataset and the development of optimized ASR and speaker diarization pipelines for Bengali audio. This work addresses critical gaps in the field, providing a foundation for future research and applications in low-resource speech processing. The comprehensive analysis of methodologies and results highlights the significance of targeted fine-tuning and heuristic post-processing in overcoming the challenges of long-form audio processing.

Comprehensive Analysis

Methodology Assessment

The methodology presented in the paper is robust, focusing on two critical areas: Automatic Speech Recognition (ASR) and speaker diarization for Bengali audio. The authors systematically evaluated various architectures and fine-tuning strategies, demonstrating a clear understanding of the challenges associated with long-form audio processing. The innovative use of perfectly aligned annotations and synthetic acoustic degradation for ASR is particularly noteworthy. The shift from model retraining to heuristic post-processing for diarization reflects a pragmatic approach to overcoming the limitations of existing models in this domain.

Experimental Evaluation

The experiments conducted are thorough, with a clear focus on optimizing inference time and accuracy. The introduction of the Lipi-Ghor-882 dataset is a significant contribution, as it addresses the scarcity of resources for Bengali ASR and diarization. The results indicate a well-structured evaluation process, with the authors providing insights into the performance of various models and the effectiveness of their proposed methods. The achievement of a Real-Time Factor (RTF) of 0.019 is impressive and establishes a benchmark for future work.

Reproducibility

While the paper outlines the methodologies and experiments in detail, it lacks specific implementation details and code availability, which are crucial for reproducibility. The absence of a project URL or demo limits the ability of other researchers to replicate the findings or build upon this work.

Limitations

The paper acknowledges several limitations, including compute resource constraints, data quality issues, and the challenges of model optimization for Bengali features. The reliance on heuristic post-processing for diarization, while effective, indicates that further research is needed to develop more robust models for this task.

Broader Impact

This research has significant implications for the field of speech processing, particularly for low-resource languages like Bengali. The introduction of a large-scale dataset and the development of optimized pipelines can facilitate advancements in conversational AI, making it more accessible for Bengali speakers. The findings may also inspire similar approaches in other low-resource language contexts. The main contribution of this paper is the introduction of the Lipi-Ghor-882 dataset and the development of optimized ASR and speaker diarization pipelines for Bengali audio. This work addresses critical gaps in the field, providing a foundation for future research and applications in low-resource speech processing. The comprehensive analysis of methodologies and results highlights the significance of targeted fine-tuning and heuristic post-processing in overcoming the challenges of long-form audio processing.

Analysis: Full Paper • Full text: 14,297 characters

Leveraging large multimodal models for audio-video deepfake detection: a pilot study

Songjun Cao, Yuqi Li, Yunpeng Luo ... · arXiv

Audio-visual deepfake detection (AVD) is increasingly important as modern generators can fabricate convincing speech and video. Most current multimodal detectors are small, task-specific models: they work well on curated tests but scale poorly and generalize weakly across domains...

Audio-visual deepfake detection (AVD) is increasingly important as modern generators can fabricate convincing speech and video. Most current multimodal detectors are small, task-specific models: they work well on curated tests but scale poorly and generalize weakly across domains. We introduce AV-LMMDetect, a supervised fine-tuned (SFT) large multimodal model that casts AVD as a prompted yes/no classification - "Is this video real or fake?". Built on Qwen 2.5 Omni, it jointly analyzes audio and visual streams for deepfake detection and is trained in two stages: lightweight LoRA alignment followed by audio-visual encoder full fine-tuning. On FakeAVCeleb and Mavos-DD, AV-LMMDetect matches or surpasses prior methods and sets a new state of the art on Mavos-DD datasets.

Institutional Affiliations

Primary: Fudan University

All Institutions: Fudan University, Tencent Youtu Lab

ML Relevance Analysis (83)

The main contribution of this paper is the introduction of AV-LMMDetect, a large multimodal model for audio-visual deepfake detection that utilizes a novel two-stage training approach, achieving state-of-the-art results and demonstrating the potential of large models in enhancing detection capabilities. The methodology and results present a meaningful advancement in the field, addressing critical challenges in deepfake detection and setting a foundation for future research.

Comprehensive Analysis

Methodology Assessment

The paper introduces AV-LMMDetect, a novel supervised fine-tuned large multimodal model for audio-visual deepfake detection, employing a two-stage training strategy that combines lightweight LoRA alignment with full fine-tuning of audio-visual encoders. This methodology is innovative as it reformulates the detection task into a binary question-answering format, which is a fresh approach in the context of deepfake detection. The use of a large multimodal model also enhances the model's ability to capture cross-modal inconsistencies, a significant improvement over traditional methods that often rely on smaller, task-specific models.

Experimental Evaluation

The experimental results demonstrate that AV-LMMDetect achieves state-of-the-art performance on the MAVOS-DD dataset and competitive results on FakeAVCeleb. The paper provides thorough evaluations across multiple scenarios, including in-domain and open-set conditions, showcasing the model's robustness and generalization capabilities. The use of standard binary classification metrics (accuracy, AUC, and mAP) adds rigor to the evaluation process, although the paper could benefit from more detailed comparisons with additional baseline methods.

Reproducibility

The paper lacks explicit details regarding the implementation and hyperparameter settings, which are crucial for reproducibility. While the methodology is described, the absence of a dedicated section on experimental setup and code availability limits the ability of other researchers to replicate the results. Including a link to a code repository or supplementary materials would significantly enhance reproducibility.

Limitations

One limitation of the study is the reliance on two specific datasets, which may not fully represent the diversity of deepfake techniques in real-world applications. Additionally, while the model shows strong performance, the paper does not address potential biases in the datasets or the model's performance across different demographic groups, which is critical for ethical considerations in deployment.

Broader Impact

The implications of this research are significant, as robust audio-visual deepfake detection is crucial for maintaining media integrity and public trust in an era of increasing misinformation. The proposed model could be applied in various domains, including social media, journalism, and law enforcement, to identify and mitigate the risks posed by deepfake technology. The main contribution of this paper is the introduction of AV-LMMDetect, a large multimodal model for audio-visual deepfake detection that utilizes a novel two-stage training approach, achieving state-of-the-art results and demonstrating the potential of large models in enhancing detection capabilities. The methodology and results present a meaningful advancement in the field, addressing critical challenges in deepfake detection and setting a foundation for future research.

Analysis: Full Paper • Full text: 12,288 characters

Moving Speaker Separation via Parallel Spectral-Spatial Processing

Yuzhu Wang, Archontis Politis, Konstantinos Drossos ... · IEEE Transactions on Audio, Speech and Language Processing

Multi-channel speech separation in dynamic environments is challenging as time-varying spatial and spectral features evolve at different temporal scales. Existing methods typically employ sequential architectures, forcing a single network stream to simultaneously model both featu...

Multi-channel speech separation in dynamic environments is challenging as time-varying spatial and spectral features evolve at different temporal scales. Existing methods typically employ sequential architectures, forcing a single network stream to simultaneously model both feature types, creating an inherent modeling conflict. In this paper, we propose a dual-branch parallel spectral-spatial (PS2) architecture that separately processes spectral and spatial features through parallel streams. The spectral branch uses a bi-directional long short-term memory (BLSTM)-based frequency module, a Mamba-based temporal module, and a self-attention module to model spectral features. The spatial branch employs bi-directional gated recurrent unit (BGRU) networks to process spatial features that encode the evolving geometric relationships between sources and microphones. Features from both branches are integrated through a cross-attention fusion mechanism that adaptively weights their contributions. Experimental results demonstrate that the PS2 outperforms existing state-of-the-art (SOTA) methods by 1.6-2.2 dB in scale-invariant signal-to-distortion ratio (SI-SDR) for moving speaker scenarios, with robust separation quality under different reverberation times (RT60), noise levels, and source movement speeds. Even with fast source movements, the proposed model maintains SI-SDR improvements of over 13 dB. These improvements are consistently observed across multiple datasets, including WHAMR! and our generated WSJ0-Demand-6ch-Move dataset.

Institutional Affiliations

Primary: Tampere University

All Institutions: Tampere University, Nokia Technologies

ML Relevance Analysis (83)

The paper presents a novel dual-branch architecture for moving speaker separation that effectively addresses challenges in multi-channel speech processing. The methodology is well-founded and the experimental results demonstrate substantial improvements over existing methods, highlighting its potential impact in practical applications.

Comprehensive Analysis

Methodology Assessment

The proposed dual-branch parallel spectral-spatial (PS2) architecture represents a significant methodological advancement in the field of multi-channel speech separation. By separating the processing of spectral and spatial features, the authors effectively address the inherent modeling conflicts present in existing sequential architectures. The use of bi-directional long short-term memory (BLSTM) and bi-directional gated recurrent unit (BGRU) networks, along with a cross-attention fusion mechanism, allows for more nuanced feature extraction and integration. This approach is well-grounded in the theoretical understanding of the different temporal scales at which spectral and spatial features evolve, making it a thoughtful and innovative contribution to the field.

Experimental Evaluation

The experimental setup is robust, utilizing multiple datasets including WHAMR! and a newly generated WSJ0-Demand-6ch-Move dataset specifically designed for moving speaker scenarios. The results demonstrate clear improvements over state-of-the-art methods, with significant gains in scale-invariant signal-to-distortion ratio (SI-SDR) across varying acoustic conditions. The ablation studies provide valuable insights into the contributions of each component of the PS2 architecture, reinforcing the importance of the dual-branch design. However, the paper could benefit from a more detailed analysis of the computational efficiency and potential trade-offs involved in the proposed architecture.

Reproducibility

The paper provides a comprehensive description of the architecture, training configurations, and datasets used, which supports reproducibility. However, the absence of publicly available code or a project URL limits the ability for others to replicate the results directly. Future work could include releasing the model and training scripts to enhance reproducibility within the research community.

Limitations

One limitation of the study is the reliance on synthetic datasets for moving speaker scenarios, which may not fully capture the complexities of real-world environments. Additionally, while the model shows strong performance in various conditions, the authors do not extensively discuss its limitations in extreme acoustic scenarios or potential failure modes. The evaluation metrics primarily focus on SI-SDR, which, while important, may not encompass all aspects of speech quality and intelligibility.

Broader Impact

The advancements in multi-channel speech separation have significant implications for various applications, including voice recognition systems, hearing aids, and telecommunication technologies. The ability to effectively separate moving speakers in dynamic environments could enhance user experiences in real-world applications, making this research particularly relevant in the context of increasing reliance on audio processing technologies in daily life. The paper presents a novel dual-branch architecture for moving speaker separation that effectively addresses challenges in multi-channel speech processing. The methodology is well-founded and the experimental results demonstrate substantial improvements over existing methods, highlighting its potential impact in practical applications.

Analysis: Full Paper • Full text: 50,026 characters

UniWhisper: Efficient Continual Multi-task Training for Robust Universal Audio Representation

Yuxuan Chen, Peize He, Haoyuan Xu ... · arXiv

A universal audio representation should capture fine-grained speech cues and high-level semantics for environmental sounds and music in a single encoder. Existing encoders often excel in one domain but degrade in others. We propose UniWhisper, an efficient continual multi-task tr...

A universal audio representation should capture fine-grained speech cues and high-level semantics for environmental sounds and music in a single encoder. Existing encoders often excel in one domain but degrade in others. We propose UniWhisper, an efficient continual multi-task training framework that casts heterogeneous audio tasks into a unified instruction and answer format. This enables standard next-token training without task-specific heads and losses. We train it on 38k hours of public audio and assess the encoder using shallow MLP probes and k-nearest neighbors (kNN) on 20 tasks spanning speech, environmental sound, and music. UniWhisper reaches normalized weighted averages of 0.81 with MLP probes and 0.61 with kNN, compared to 0.64 and 0.46 for Whisper, while retaining strong speech performance.

Institutional Affiliations

Primary: Jilin University

All Institutions: Hunan University, Jilin University, Shandong University, University of Electronic Science and Technology of China

ML Relevance Analysis (82)

UniWhisper presents a novel approach to continual multi-task training for universal audio representation, achieving competitive performance across diverse audio tasks while maintaining strong speech capabilities. This work significantly contributes to the field by addressing the limitations of existing models and offering a streamlined methodology that enhances both efficiency and effectiveness in audio representation learning.

Comprehensive Analysis

Methodology Assessment

The methodology presented in UniWhisper is innovative, focusing on a unified instruction and answer format for continual multi-task training. This approach effectively addresses the limitations of existing audio encoders that excel in specific domains but struggle with others. By leveraging a single encoder and a compact pretrained language model as the decoder, the authors streamline the training process and reduce redundancy in audio token representation. The decision to utilize shallow MLP probes and kNN for evaluation is appropriate, as it allows for a clear assessment of the model's performance across diverse tasks.

Experimental Evaluation

The experimental setup is robust, utilizing a substantial training dataset of 38k hours of public audio, which enhances the generalizability of the results. The evaluation across 20 tasks provides a comprehensive view of the model's capabilities. The results indicate that UniWhisper outperforms existing models like Whisper, HuBERT, and others, particularly in non-speech tasks, which is a significant achievement. The use of normalized weighted averages for performance metrics is a strong point, ensuring comparability across different tasks.

Reproducibility

The paper mentions that code and pretrained weights will be released upon acceptance, which is crucial for reproducibility. However, the details provided in the methodology, such as specific hyperparameters and training configurations, are adequately described, allowing other researchers to replicate the experiments. The use of a compact pretrained language model as a decoder is also a noteworthy detail that aids in understanding the architecture.

Limitations

One limitation is the reliance on a single encoder, which may not capture all nuances across diverse audio tasks as effectively as dual-encoder systems. Additionally, while the results are promising, the paper does not extensively discuss the potential trade-offs in performance when adapting UniWhisper to other audio domains not covered in the evaluation.

Broader Impact

The implications of this work are significant, as it proposes a more efficient method for training universal audio representations that can be applied in various applications, including speech recognition, environmental sound classification, and music analysis. The potential for reduced training costs and improved performance across multiple tasks could lead to advancements in audio processing technologies, making them more accessible and effective in real-world applications. UniWhisper presents a novel approach to continual multi-task training for universal audio representation, achieving competitive performance across diverse audio tasks while maintaining strong speech capabilities. This work significantly contributes to the field by addressing the limitations of existing models and offering a streamlined methodology that enhances both efficiency and effectiveness in audio representation learning.

Analysis: Full Paper • Full text: 21,414 characters

TG-ASR: Translation-Guided Learning with Parallel Gated Cross Attention for Low-Resource Automatic Speech Recognition

Cheng-Yeh Yang, Chien-Chun Wang, Li-Wei Chen ... · LREC 2026

Low-resource automatic speech recognition (ASR) continues to pose significant challenges, primarily due to the limited availability of transcribed data for numerous languages. While a wealth of spoken content is accessible in television dramas and online videos, Taiwanese Hokkien...

Low-resource automatic speech recognition (ASR) continues to pose significant challenges, primarily due to the limited availability of transcribed data for numerous languages. While a wealth of spoken content is accessible in television dramas and online videos, Taiwanese Hokkien exemplifies this issue, with transcriptions often being scarce and the majority of available subtitles provided only in Mandarin. To address this deficiency, we introduce TG-ASR for Taiwanese Hokkien drama speech recognition, a translation-guided ASR framework that utilizes multilingual translation embeddings to enhance recognition performance in low-resource environments. The framework is centered around the parallel gated cross-attention (PGCA) mechanism, which adaptively integrates embeddings from various auxiliary languages into the ASR decoder. This mechanism facilitates robust cross-linguistic semantic guidance while ensuring stable optimization and minimizing interference between languages. To support ongoing research initiatives, we present YT-THDC, a 30-hour corpus of Taiwanese Hokkien drama speech with aligned Mandarin subtitles and manually verified Taiwanese Hokkien transcriptions. Comprehensive experiments and analyses identify the auxiliary languages that most effectively enhance ASR performance, achieving a 14.77% relative reduction in character error rate and demonstrating the efficacy of translation-guided learning for underrepresented languages in practical applications.

Institutional Affiliations

Primary: unknown

All Institutions: unknown

ML Relevance Analysis (80)

The main contribution of this paper is the TG-ASR framework, which effectively utilizes translation-guided learning through a novel PGCA mechanism to enhance automatic speech recognition for low-resource languages. This research addresses critical gaps in ASR technology and provides a valuable resource for future studies in multilingual and low-resource language processing.

Comprehensive Analysis

Methodology Assessment

The methodology presented in TG-ASR is innovative, particularly the introduction of the Parallel Gated Cross-Attention (PGCA) mechanism, which adaptively integrates multilingual translation embeddings into the ASR decoder. This approach is well-justified, addressing the specific challenges of low-resource languages by leveraging auxiliary languages to improve transcription accuracy. The two-stage training process is clearly articulated, ensuring that the model benefits from both initial fine-tuning and subsequent integration of multilingual embeddings. However, the reliance on pre-trained models for auxiliary language embeddings and the potential noise introduced by machine translations are notable considerations.

Experimental Evaluation

The experiments are comprehensive, utilizing the newly created YT-THDC corpus, which is a significant contribution to the field of low-resource ASR. The results demonstrate a substantial reduction in character error rate (CER), validating the effectiveness of the proposed framework. The ablation studies provide insights into the contributions of various components of the PGCA mechanism, reinforcing the robustness of the findings. However, the paper could benefit from additional comparative analyses with state-of-the-art models to contextualize the performance gains more clearly.

Reproducibility

The paper provides a detailed description of the model architecture, training procedures, and evaluation metrics, which enhances reproducibility. However, the lack of a publicly accessible code repository or demo limits the ability for other researchers to replicate the results fully. The authors should consider releasing their code and model weights to facilitate further research and validation.

Limitations

The study acknowledges several limitations, including the size and domain specificity of the YT-THDC corpus, which may restrict generalizability. Additionally, the reliance on auxiliary translations introduces potential noise that could affect performance. The findings are also specific to Taiwanese Hokkien, and the effectiveness of the approach for other low-resource languages remains to be validated.

Broader Impact

The work has significant implications for the preservation of endangered languages and the accessibility of media content. By improving ASR for Taiwanese Hokkien, the research contributes to cultural preservation efforts and enhances bilingual accessibility. The methodology could be adapted for other low-resource languages, potentially benefiting a wider range of linguistic communities. The main contribution of this paper is the TG-ASR framework, which effectively utilizes translation-guided learning through a novel PGCA mechanism to enhance automatic speech recognition for low-resource languages. This research addresses critical gaps in ASR technology and provides a valuable resource for future studies in multilingual and low-resource language processing.

Analysis: Full Paper • Full text: 34,392 characters

AR&D: A Framework for Retrieving and Describing Concepts for Interpreting AudioLLMs

Townim Faisal Chowdhury, Ta Duc Huy, Siqi Pan ... · International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

Despite strong performance in audio perception tasks, large audio-language models (AudioLLMs) remain opaque to interpretation. A major factor behind this lack of interpretability is that individual neurons in these models frequently activate in response to several unrelated conce...

Despite strong performance in audio perception tasks, large audio-language models (AudioLLMs) remain opaque to interpretation. A major factor behind this lack of interpretability is that individual neurons in these models frequently activate in response to several unrelated concepts. We introduce the first mechanistic interpretability framework for AudioLLMs, leveraging sparse autoencoders (SAEs) to disentangle polysemantic activations into monosemantic features. Our pipeline identifies representative audio clips, assigns meaningful names via automated captioning, and validates concepts through human evaluation and steering. Experiments show that AudioLLMs encode structured and interpretable features, enhancing transparency and control. This work provides a foundation for trustworthy deployment in high-stakes domains and enables future extensions to larger models, multilingual audio, and more fine-grained paralinguistic features. Project URL: https://townim-faisal.github.io/AutoInterpret-AudioLLM/

Institutional Affiliations

Primary: Dolby Laboratories

All Institutions: Dolby Laboratories

GitHub

ML Relevance Analysis (83)

The paper presents AR&D, a pioneering framework for interpreting AudioLLMs by disentangling polysemantic activations into interpretable features, significantly advancing the field of audio machine learning. The methodology is innovative and well-executed, demonstrating substantial technical impact and relevance in enhancing model interpretability.

Comprehensive Analysis

Methodology Assessment

The paper introduces the AR&D framework, which effectively utilizes sparse autoencoders to disentangle polysemantic activations in AudioLLMs. The methodology is well-structured, comprising three main components: feature disentanglement, representative audio retrieval, and interpretable concept naming. The use of human evaluation to validate the concepts adds robustness to the approach. However, the reliance on automated captioning for naming may introduce biases or inaccuracies, which should be addressed in future work.

Experimental Evaluation

The experiments are comprehensive, comparing AR&D against multiple baseline methods. The metrics used for evaluation, including precision, recall, and F1 scores, are appropriate for assessing interpretability. The results demonstrate a clear advantage of AR&D over the baselines, with significant improvements in semantic alignment and steering sensitivity. However, the paper could benefit from more extensive ablation studies to further validate the contributions of each component in the pipeline.

Reproducibility

The implementation details provided are thorough, including specifics on datasets, training parameters, and model architectures. However, the paper lacks a complete code release or clear instructions for reproducing the experiments, which could hinder reproducibility efforts by other researchers.

Limitations

One limitation is the potential bias in automated captioning, which may not always accurately reflect human perceptions of audio concepts. Additionally, the framework's performance may vary with different AudioLLMs, and the scalability to larger models or diverse datasets remains to be fully explored.

Broader Impact

The proposed framework has significant implications for enhancing the interpretability of AudioLLMs, which is crucial for their deployment in high-stakes applications such as healthcare and assistive technologies. By improving transparency, AR&D can foster trust in AI systems that rely on audio data, paving the way for more responsible and ethical AI practices. The paper presents AR&D, a pioneering framework for interpreting AudioLLMs by disentangling polysemantic activations into interpretable features, significantly advancing the field of audio machine learning. The methodology is innovative and well-executed, demonstrating substantial technical impact and relevance in enhancing model interpretability.

Analysis: Full Paper • Full text: 16,828 characters

StyleStream: Real-Time Zero-Shot Voice Style Conversion

Yisi Liu, Nicholas Lee, Gopala Anumanchipalli · arXiv

Voice style conversion aims to transform an input utterance to match a target speaker's timbre, accent, and emotion, with a central challenge being the disentanglement of linguistic content from style. While prior work has explored this problem, conversion quality remains limited...

Voice style conversion aims to transform an input utterance to match a target speaker's timbre, accent, and emotion, with a central challenge being the disentanglement of linguistic content from style. While prior work has explored this problem, conversion quality remains limited, and real-time voice style conversion has not been addressed. We propose StyleStream, the first streamable zero-shot voice style conversion system that achieves state-of-the-art performance. StyleStream consists of two components: a Destylizer, which removes style attributes while preserving linguistic content, and a Stylizer, a diffusion transformer (DiT) that reintroduces target style conditioned on reference speech. Robust content-style disentanglement is enforced through text supervision and a highly constrained information bottleneck. This design enables a fully non-autoregressive architecture, achieving real-time voice style conversion with an end-to-end latency of 1 second. Samples and real-time demo: https://berkeley-speech-group.github.io/StyleStream/.

Institutional Affiliations

Primary: UC Berkeley

All Institutions: UC Berkeley

Demo

ML Relevance Analysis (92)

StyleStream introduces the first real-time zero-shot voice style conversion system capable of modifying timbre, accent, and emotion with an end-to-end latency of approximately 1 second. This paper represents a significant technical contribution to the field, addressing key challenges in voice style conversion through innovative methodology and rigorous experimental validation, thus paving the way for practical applications in various domains.

Comprehensive Analysis

Methodology Assessment

The methodology presented in StyleStream is innovative, combining a Destylizer for content-style disentanglement with a Stylizer based on a diffusion transformer. The use of ASR loss and a compact finite scalar quantization (FSQ) bottleneck is a significant advancement over previous methods, allowing for cleaner disentanglement of linguistic content from style attributes. The non-autoregressive architecture enables real-time processing, which is a notable improvement in the field of voice style conversion. The paper provides a comprehensive description of the architecture, training procedures, and the rationale behind design choices, demonstrating a solid understanding of the challenges in voice style conversion.

Experimental Evaluation

The experimental evaluation is thorough, utilizing a diverse dataset of 50k hours of English speech for training and a well-structured test set for evaluation. The results show that StyleStream outperforms existing methods in terms of intelligibility and style fidelity across multiple metrics, including WER and similarity scores. The paper effectively communicates the performance improvements over baseline models, providing both objective and subjective evaluations, which strengthen the claims of superior performance.

Reproducibility

The paper includes detailed descriptions of the architecture, training configurations, and evaluation metrics, which facilitate reproducibility. However, the lack of a publicly available code repository may hinder full reproducibility for some researchers. The authors could enhance reproducibility by providing access to their trained models and detailed implementation instructions.

Limitations

One limitation is the reliance on a large amount of training data, which may not be readily available for all researchers. Additionally, while the paper claims real-time processing capabilities, the actual latency may vary depending on hardware, which could limit practical applications in certain environments. The model's performance with shorter reference utterances is also noted to degrade, indicating a potential limitation in flexibility.

Broader Impact

The advancements presented in StyleStream have significant implications for applications in voice synthesis, dubbing, and personalized voice assistants, where real-time style conversion can enhance user experience. The ability to modify timbre, accent, and emotion in real-time opens up new avenues for interactive applications in entertainment, education, and accessibility, potentially impacting how voice technologies are integrated into daily life. StyleStream introduces the first real-time zero-shot voice style conversion system capable of modifying timbre, accent, and emotion with an end-to-end latency of approximately 1 second. This paper represents a significant technical contribution to the field, addressing key challenges in voice style conversion through innovative methodology and rigorous experimental validation, thus paving the way for practical applications in various domains.

Analysis: Full Paper • Full text: 39,206 characters

DECAF: Dynamic Envelope Context-Aware Fusion for Speech-Envelope Reconstruction from EEG

Karan Thakkar, Mounya Elhilali · ICASSP 2026

Reconstructing the speech audio envelope from scalp neural recordings (EEG) is a central task for decoding a listener's attentional focus in applications like neuro-steered hearing aids. Current methods for this reconstruction, however, face challenges with fidelity and noise. Pr...

Reconstructing the speech audio envelope from scalp neural recordings (EEG) is a central task for decoding a listener's attentional focus in applications like neuro-steered hearing aids. Current methods for this reconstruction, however, face challenges with fidelity and noise. Prevailing approaches treat it as a static regression problem, processing each EEG window in isolation and ignoring the rich temporal structure inherent in continuous speech. This study introduces a new, dynamic framework for envelope reconstruction that leverages this structure as a predictive temporal prior. We propose a state-space fusion model that combines direct neural estimates from EEG with predictions from recent speech context, using a learned gating mechanism to adaptively balance these cues. To validate this approach, we evaluate our model on the ICASSP 2023 Stimulus Reconstruction benchmark demonstrating significant improvements over static, EEG-only baselines. Our analyses reveal a powerful synergy between the neural and temporal information streams. Ultimately, this work reframes envelope reconstruction not as a simple mapping, but as a dynamic state-estimation problem, opening a new direction for developing more accurate and coherent neural decoding systems.

Institutional Affiliations

Primary: Johns Hopkins University

All Institutions: Johns Hopkins University, Laboratory for Computational Audio Perception

GitHub

ML Relevance Analysis (84)

The main contribution of this work is the introduction of the DECAF model, which reframes speech envelope reconstruction from EEG as a dynamic state-estimation problem, significantly improving reconstruction accuracy through the integration of temporal context. This innovative approach not only advances the state-of-the-art in auditory attention decoding but also opens new avenues for research in brain-computer interfaces and neurotechnology.

Comprehensive Analysis

Methodology Assessment

The proposed DECAF model introduces a novel dynamic framework for reconstructing speech envelopes from EEG data by integrating a predictive temporal prior with direct neural estimates. This approach is innovative as it shifts the paradigm from static regression to dynamic state estimation, leveraging temporal dependencies inherent in speech signals. The architecture is modular, consisting of three core components: the EEG to Envelope decoder, the Envelope Forecaster, and the Dynamic Fusion gate, which work together to enhance reconstruction accuracy. The use of a learned gating mechanism to balance the contributions of neural evidence and temporal context is particularly noteworthy, as it allows for adaptive integration of information.

Experimental Evaluation

The authors validate their model using the ICASSP 2023 Stimulus Reconstruction benchmark, demonstrating significant improvements over static EEG-only baselines and achieving state-of-the-art performance. The experiments are well-structured, comparing DECAF against established baselines, including traditional methods and contemporary deep learning architectures. The results are quantitatively supported by statistical significance tests, and the ablation studies provide insights into the contributions of each component of the model.

Reproducibility

The paper provides sufficient details regarding the dataset, experimental setup, and model training procedures, which enhances reproducibility. The authors adhere to established protocols and publicly share their code repository, facilitating further experimentation and validation by other researchers.

Limitations

While the model shows promise, it may face challenges in real-world applications where EEG signals are subject to high noise levels. The performance of DECAF under extreme noise conditions aligns with baseline models, indicating a potential limitation in robustness. Additionally, the reliance on past predictions may introduce biases if the initial estimates are inaccurate.

Broader Impact

The implications of this research extend to neuro-steered hearing aids and auditory attention decoding systems, potentially improving the quality of life for individuals with hearing impairments. By enhancing the accuracy of speech envelope reconstruction, the DECAF framework could lead to more effective auditory processing technologies, making it a significant contribution to both machine learning and assistive technologies. The main contribution of this work is the introduction of the DECAF model, which reframes speech envelope reconstruction from EEG as a dynamic state-estimation problem, significantly improving reconstruction accuracy through the integration of temporal context. This innovative approach not only advances the state-of-the-art in auditory attention decoding but also opens new avenues for research in brain-computer interfaces and neurotechnology.

Analysis: Full Paper • Full text: 16,680 characters

CTC-TTS: LLM-based dual-streaming text-to-speech with CTC alignment

Hanwen Liu, Saierdaer Yusuyin, Hao Huang ... · INTERSPEECH 2026

Large-language-model (LLM)-based text-to-speech (TTS) systems can generate natural speech, but most are not designed for low-latency dual-streaming synthesis. High-quality dual-streaming TTS depends on accurate text--speech alignment and well-designed training sequences that bala...

Large-language-model (LLM)-based text-to-speech (TTS) systems can generate natural speech, but most are not designed for low-latency dual-streaming synthesis. High-quality dual-streaming TTS depends on accurate text--speech alignment and well-designed training sequences that balance synthesis quality and latency. Prior work often relies on GMM-HMM based forced-alignment toolkits (e.g., MFA), which are pipeline-heavy and less flexible than neural aligners; fixed-ratio interleaving of text and speech tokens struggles to capture text--speech alignment regularities. We propose CTC-TTS, which replaces MFA with a CTC based aligner and introduces a bi-word based interleaving strategy. Two variants are designed: CTC-TTS-L (token concatenation along the sequence length) for higher quality and CTC-TTS-F (embedding stacking along the feature dimension) for lower latency. Experiments show that CTC-TTS outperforms fixed-ratio interleaving and MFA-based baselines on streaming synthesis and zero-shot tasks. Speech samples are available at https://ctctts.github.io/.

Institutional Affiliations

Primary: Tsinghua University

All Institutions: Tsinghua University, Xinjiang University

Demo

ML Relevance Analysis (83)

The paper presents CTC-TTS, a novel dual-streaming text-to-speech synthesis method that leverages CTC alignment and bi-word interleaving strategies. The approach demonstrates significant improvements in synthesis quality and latency, marking a meaningful contribution to the field of audio processing and machine learning.

Comprehensive Analysis

Methodology Assessment

The paper introduces a novel approach to text-to-speech synthesis by employing a Connectionist Temporal Classification (CTC)-based alignment mechanism, which is a significant departure from traditional GMM-HMM forced alignment methods. The introduction of a bi-word interleaving strategy enhances the model's ability to capture temporal dependencies between text and speech, addressing the limitations of fixed-ratio interleaving. The two variants, CTC-TTS-L and CTC-TTS-F, are well-defined and cater to different quality-latency trade-offs, showcasing a thoughtful design that balances synthesis quality with operational efficiency. The methodology is sound, with clear explanations of the alignment process and interleaving strategies.

Experimental Evaluation

The experimental setup is robust, utilizing well-known datasets such as LibriSpeech and VoiceAssistant400K for evaluation. The authors provide comprehensive results comparing their method against established baselines, demonstrating significant improvements in both streaming synthesis and zero-shot tasks. The use of objective metrics like Word Error Rate (WER) and Character Error Rate (CER) alongside subjective evaluations (Mean Opinion Score) adds credibility to the findings. The results indicate that both CTC-TTS variants outperform existing methods, validating the proposed approach.

Reproducibility

The paper includes detailed implementation details, including model architecture, training configurations, and evaluation metrics, which enhance reproducibility. However, the absence of a publicly available code repository at this stage may hinder full reproducibility until the code is released post-acceptance.

Limitations

While the proposed method shows promising results, the paper does not address potential limitations regarding the scalability of the approach to multi-speaker scenarios or its performance in diverse linguistic contexts. Additionally, the reliance on a specific G2P model (Phonetisaurus) may limit flexibility in phoneme generation.

Broader Impact

The advancements in low-latency, high-quality TTS systems have significant implications for applications in virtual assistants, audiobooks, and real-time communication tools. The ability to synthesize speech with reduced latency while maintaining naturalness can enhance user experience in various interactive applications. The research could pave the way for further innovations in speech synthesis technologies. The paper presents CTC-TTS, a novel dual-streaming text-to-speech synthesis method that leverages CTC alignment and bi-word interleaving strategies. The approach demonstrates significant improvements in synthesis quality and latency, marking a meaningful contribution to the field of audio processing and machine learning.

Analysis: Full Paper • Full text: 20,927 characters

SongEcho: Towards Cover Song Generation via Instance-Adaptive Element-wise Linear Modulation

Sifei Li, Yang Li, Zizhou Wang ... · ICLR 2026

Cover songs constitute a vital aspect of musical culture, preserving the core melody of an original composition while reinterpreting it to infuse novel emotional depth and thematic emphasis. Although prior research has explored the reinterpretation of instrumental music through m...

Cover songs constitute a vital aspect of musical culture, preserving the core melody of an original composition while reinterpreting it to infuse novel emotional depth and thematic emphasis. Although prior research has explored the reinterpretation of instrumental music through melody-conditioned text-to-music models, the task of cover song generation remains largely unaddressed. In this work, we reformulate our cover song generation as a conditional generation, which simultaneously generates new vocals and accompaniment conditioned on the original vocal melody and text prompts. To this end, we present SongEcho, which leverages Instance-Adaptive Element-wise Linear Modulation (IA-EiLM), a framework that incorporates controllable generation by improving both conditioning injection mechanism and conditional representation. To enhance the conditioning injection mechanism, we extend Feature-wise Linear Modulation (FiLM) to an Element-wise Linear Modulation (EiLM), to facilitate precise temporal alignment in melody control. For conditional representations, we propose Instance-Adaptive Condition Refinement (IACR), which refines conditioning features by interacting with the hidden states of the generative model, yielding instance-adaptive conditioning. Additionally, to address the scarcity of large-scale, open-source full-song datasets, we construct Suno70k, a high-quality AI song dataset enriched with comprehensive annotations. Experimental results across multiple datasets demonstrate that our approach generates superior cover songs compared to existing methods, while requiring fewer than 30% of the trainable parameters. The code, dataset, and demos are available at https://github.com/lsfhuihuiff/SongEcho_ICLR2026.

Institutional Affiliations

Primary: National Natural Science Foundation of China

All Institutions: National Natural Science Foundation of China, China Scholarship Council, German Research Foundation, National Science and Technology Council, Taiwan

Demo · GitHub

ML Relevance Analysis (83)

The paper presents a novel approach to cover song generation through the introduction of IA-EiLM and IACR, significantly advancing the field of audio machine learning. The methodology and experimental results indicate a strong potential for practical applications in music generation, although further work is needed to enhance reproducibility and address limitations.

Comprehensive Analysis

Methodology Assessment

The methodology presented in the paper is innovative, particularly with the introduction of Instance-Adaptive Element-wise Linear Modulation (IA-EiLM) and Instance-Adaptive Condition Refinement (IACR). The extension of Feature-wise Linear Modulation (FiLM) to EiLM is a notable advancement that addresses the challenge of temporal alignment in melody control. The dual focus on generating both vocals and accompaniment conditioned on the original melody and text prompts is a significant step forward in cover song generation. However, the paper could benefit from a more detailed explanation of the underlying mechanics of IA-EiLM and IACR, particularly how they interact with the generative model's hidden states.

Experimental Evaluation

The experimental results are compelling, demonstrating that SongEcho outperforms existing methods while utilizing fewer parameters, which suggests a more efficient model. The construction of the Suno70k dataset is a valuable contribution, addressing a critical gap in the availability of high-quality, annotated datasets for song generation tasks. However, the paper should provide more comprehensive comparisons with a wider range of baseline methods to strengthen the claims of superiority.

Reproducibility

The authors have made the code and dataset publicly available, which is a positive aspect for reproducibility. However, the paper lacks detailed implementation instructions and hyperparameter settings, which could hinder other researchers from replicating the results accurately.

Limitations

One limitation of the study is the potential overfitting due to the small size of the dataset relative to the complexity of the task. Additionally, the subjective nature of music generation means that quantitative metrics may not fully capture the quality of the generated songs. The paper could also explore the limitations of the IA-EiLM and IACR methods in different musical contexts or genres.

Broader Impact

The proposed framework has significant implications for the music industry, particularly in automated music composition and cover song generation. By enabling more nuanced and emotionally resonant reinterpretations of existing songs, this research could enhance creative processes in music production. Furthermore, the availability of the Suno70k dataset could spur further research in music generation and related fields. The paper presents a novel approach to cover song generation through the introduction of IA-EiLM and IACR, significantly advancing the field of audio machine learning. The methodology and experimental results indicate a strong potential for practical applications in music generation, although further work is needed to enhance reproducibility and address limitations.

Analysis: Full Paper • Full text: 782 characters

Continuous Telemonitoring of Heart Failure using Personalised Speech Dynamics

Yue Pan, Xingyao Wang, Hanyue Zhang ... · arXiv

Remote monitoring of heart failure (HF) via speech signals provides a non-invasive and cost-effective solution for long-term patient management. However, substantial inter-individual heterogeneity in vocal characteristics often limits the accuracy of traditional cross-sectional c...

Remote monitoring of heart failure (HF) via speech signals provides a non-invasive and cost-effective solution for long-term patient management. However, substantial inter-individual heterogeneity in vocal characteristics often limits the accuracy of traditional cross-sectional classification models. To address this, we propose a Longitudinal Intra-Patient Tracking (LIPT) scheme designed to capture the trajectory of relative symptomatic changes within individuals. Central to this framework is a Personalised Sequential Encoder (PSE), which transforms longitudinal speech recordings into context-aware latent representations. By incorporating historical data at each timestamp, the PSE facilitates a holistic assessment of the clinical trajectory rather than modelling discrete visits independently. Experimental results from a cohort of 225 patients demonstrate that the LIPT paradigm significantly outperforms the classic cross-sectional approaches, achieving a recognition accuracy of 99.7% for clinical status transitions. The model's high sensitivity was further corroborated by additional follow-up data, confirming its efficacy in predicting HF deterioration and its potential to secure patient safety in remote, home-based settings. Furthermore, this work addresses the gap in existing literature by providing a comprehensive analysis of different speech task designs and acoustic features. Taken together, the superior performance of the LIPT framework and PSE architecture validates their readiness for integration into long-term telemonitoring systems, offering a scalable solution for remote heart failure management.

Institutional Affiliations

Primary: Taizhou People’s Hospital

All Institutions: Taizhou People’s Hospital, Jiangsu, China

GitHub

ML Relevance Analysis (82)

The main contribution of this paper is the development of a personalized speech-based monitoring system for heart failure that utilizes a longitudinal approach to track individual patient trajectories. This innovative methodology and its promising results highlight the potential for speech dynamics to serve as effective biomarkers in the management of chronic health conditions.

Comprehensive Analysis

Methodology Assessment

The proposed methodology introduces a novel Longitudinal Intra-Patient Tracking (LIPT) framework that leverages a Personalised Sequential Encoder (PSE) to model heart failure (HF) progression through speech dynamics. This approach is innovative as it shifts the focus from traditional cross-sectional models to a longitudinal perspective, capturing individual patient trajectories over time. The methodology is well-structured, with clear stages for feature extraction, statistical screening, and longitudinal tracking, which collectively enhance the model's ability to account for inter-individual variability in speech characteristics. The integration of both global and frame-level features, particularly the emphasis on RASTA features, demonstrates a thoughtful approach to feature selection that aligns with the clinical context of HF monitoring.

Experimental Evaluation

The experimental evaluation is robust, utilizing a substantial cohort of 225 patients and multiple speech tasks to assess the model's performance. The results indicate a significant improvement in classification accuracy (99.7%) compared to traditional methods, underscoring the effectiveness of the LIPT framework. The use of follow-up data to validate model performance further strengthens the findings, although the paper could benefit from additional comparative analyses with more diverse datasets to enhance generalizability.

Reproducibility

The paper provides a comprehensive description of the data collection process, feature extraction methods, and model architecture, which are essential for reproducibility. However, the lack of detailed hyperparameter settings and training procedures may pose challenges for other researchers attempting to replicate the results. The availability of code and models on GitHub is a positive aspect that facilitates reproducibility.

Limitations

One notable limitation is the potential for high false-positive rates in identifying stable patients, which could affect clinical applicability. The model's reliance on specific speech tasks may also limit its generalizability across different populations or settings. Additionally, the study's focus on a single institution may restrict the diversity of the patient cohort, impacting the external validity of the findings.

Broader Impact

This research has significant implications for remote patient monitoring in heart failure management, particularly in resource-limited settings. The ability to accurately track HF status through non-invasive speech analysis could enhance patient safety and reduce healthcare costs. The findings may pave the way for integrating such technologies into routine clinical practice, thereby improving patient outcomes and access to care. The main contribution of this paper is the development of a personalized speech-based monitoring system for heart failure that utilizes a longitudinal approach to track individual patient trajectories. This innovative methodology and its promising results highlight the potential for speech dynamics to serve as effective biomarkers in the management of chronic health conditions.

Analysis: Full Paper • Full text: 49,335 characters

Continuous Telemonitoring of Heart Failure using Personalised Speech Dynamics

Yue Pan, Xingyao Wang, Hanyue Zhang ... · arXiv

Remote monitoring of heart failure (HF) via speech signals provides a non-invasive and cost-effective solution for long-term patient management. However, substantial inter-individual heterogeneity in vocal characteristics often limits the accuracy of traditional cross-sectional c...

Remote monitoring of heart failure (HF) via speech signals provides a non-invasive and cost-effective solution for long-term patient management. However, substantial inter-individual heterogeneity in vocal characteristics often limits the accuracy of traditional cross-sectional classification models. To address this, we propose a Longitudinal Intra-Patient Tracking (LIPT) scheme designed to capture the trajectory of relative symptomatic changes within individuals. Central to this framework is a Personalised Sequential Encoder (PSE), which transforms longitudinal speech recordings into context-aware latent representations. By incorporating historical data at each timestamp, the PSE facilitates a holistic assessment of the clinical trajectory rather than modelling discrete visits independently. Experimental results from a cohort of 225 patients demonstrate that the LIPT paradigm significantly outperforms the classic cross-sectional approaches, achieving a recognition accuracy of 99.7% for clinical status transitions. The model's high sensitivity was further corroborated by additional follow-up data, confirming its efficacy in predicting HF deterioration and its potential to secure patient safety in remote, home-based settings. Furthermore, this work addresses the gap in existing literature by providing a comprehensive analysis of different speech task designs and acoustic features. Taken together, the superior performance of the LIPT framework and PSE architecture validates their readiness for integration into long-term telemonitoring systems, offering a scalable solution for remote heart failure management.

Institutional Affiliations

Primary: Taizhou People’s Hospital

All Institutions: Taizhou People’s Hospital, Jiangsu, China

GitHub

ML Relevance Analysis (82)

This paper presents a novel framework for continuous telemonitoring of heart failure using personalized speech dynamics, significantly advancing the field of remote patient management. The integration of longitudinal analysis with advanced machine learning techniques demonstrates a promising direction for future healthcare applications.

Comprehensive Analysis

Methodology Assessment

The proposed methodology introduces the Longitudinal Intra-Patient Tracking (LIPT) paradigm, which is innovative in its focus on personalized monitoring of heart failure through speech dynamics. The Personalised Sequential Encoder (PSE) is a significant advancement, allowing for the capture of temporal dependencies in speech data, which is critical for understanding individual patient trajectories. The integration of both global and frame-level features enhances the robustness of the model. The methodology is well-structured, addressing the limitations of traditional cross-sectional approaches by emphasizing longitudinal data analysis.

Experimental Evaluation

The experimental setup is comprehensive, involving a cohort of 225 patients and multiple speech tasks that yield a substantial dataset for analysis. The results indicate a remarkable accuracy of 99.7% in recognizing clinical status transitions, which is a strong validation of the proposed framework. The comparative analysis against baseline models (XGBoost and Fully Connected Neural Network) demonstrates the effectiveness of the LIPT approach. However, while the results are impressive, the paper could benefit from additional details on the statistical significance of the findings and potential confounding factors.

Reproducibility

The paper provides a clear outline of the methods and algorithms used, along with a GitHub repository for code and trained models, which supports reproducibility. However, the absence of detailed hyperparameter settings and specific training protocols in the main text may hinder full reproducibility for other researchers.

Limitations

One limitation is the potential for high false-positive rates in identifying stable patients, which could impact clinical applicability. The model's reliance on specific speech tasks may also limit its generalizability across different populations or languages. Furthermore, the study's focus on a single cohort from a specific region may not account for broader demographic variations.

Broader Impact

The implications of this research are significant, particularly in enhancing remote monitoring of heart failure patients, especially in resource-limited settings. By leveraging non-invasive speech analysis, the proposed system could improve patient outcomes through timely interventions. The approach also opens avenues for further research in speech-based diagnostics across various medical conditions. This paper presents a novel framework for continuous telemonitoring of heart failure using personalized speech dynamics, significantly advancing the field of remote patient management. The integration of longitudinal analysis with advanced machine learning techniques demonstrates a promising direction for future healthcare applications.

Analysis: Full Paper • Full text: 49,335 characters

Depth-Structured Music Recurrence: Budgeted Recurrent Attention for Full-Piece Symbolic Music Modeling

Yungang Yi · arXiv

Long-context modeling is essential for symbolic music generation, since motif repetition and developmental variation can span thousands of musical events. However, practical composition and performance workflows frequently rely on resource-limited devices (e.g., electronic instru...

Long-context modeling is essential for symbolic music generation, since motif repetition and developmental variation can span thousands of musical events. However, practical composition and performance workflows frequently rely on resource-limited devices (e.g., electronic instruments and portable computers), making heavy memory and attention computation difficult to deploy. We introduce Depth-Structured Music Recurrence (DSMR), a recurrent long-context Transformer for full-piece symbolic music modeling that extends context beyond fixed-length excerpts via segment-level recurrence with detached cross-segment states, featuring a layer-wise memory-horizon schedule that budgets recurrent KV states across depth. DSMR is trained in a single left-to-right pass over each complete composition, akin to how a musician experiences it from beginning to end, while carrying recurrent cross-segment states forward. Within this recurrent framework, we systematically study how depth-wise horizon allocations affect optimization, best-checkpoint perplexity, and efficiency. By allocating different history-window lengths across layers while keeping the total recurrent-state budget fixed, DSMR creates depth-dependent temporal receptive fields within a recurrent attention stack without reducing compute depth. Our main instantiation is a two-scale DSMR schedule that allocates long history windows to lower layers and a uniform short window to the remaining layers. Experiments on the piano performance dataset MAESTRO demonstrate that two-scale DSMR provides a practical quality--efficiency recipe for full-length long-context symbolic music modeling with recurrent attention under limited computational resources.

Institutional Affiliations

Primary: Auckland University of Technology

All Institutions: Auckland University of Technology

ML Relevance Analysis (82)

The paper presents a novel approach to long-context modeling in symbolic music generation through the Depth-Structured Music Recurrence framework, effectively balancing computational efficiency with the need for extensive contextual information. The comprehensive methodology and rigorous experimental evaluation contribute to its significance in advancing the field of machine learning for music generation.

Comprehensive Analysis

Methodology Assessment

The proposed Depth-Structured Music Recurrence (DSMR) framework innovatively addresses the challenges of long-context modeling in symbolic music generation by implementing a recurrent long-context Transformer that utilizes segment-level recurrence with detached cross-segment states. The methodology is well-structured, allowing for depth-dependent temporal receptive fields by varying memory horizons across layers. This approach is particularly relevant for resource-constrained environments, as it balances computational efficiency with the need for extensive contextual information in music generation.

Experimental Evaluation

The experiments conducted on the MAESTRO dataset are rigorous and well-documented, demonstrating the effectiveness of the two-scale DSMR model in achieving lower perplexity compared to other methods under similar memory constraints. The evaluation metrics, including perplexity and efficiency (tokens processed per second, peak memory usage), provide a comprehensive view of the model's performance. The comparative analysis with other models, including full-attention references, adds credibility to the findings.

Reproducibility

The paper provides sufficient details regarding the experimental setup, model architecture, and training protocols, which enhances reproducibility. However, the absence of a publicly available code repository or demo limits the ease with which other researchers can replicate the results.

Limitations

The study acknowledges certain limitations, such as the potential impact of the chosen model scale and memory settings on performance. Additionally, the findings may not generalize to all long-context domains, and the exploration of the design space for memory retention and gating could be expanded.

Broader Impact

The implications of this research are significant for the field of symbolic music generation, particularly in enabling more efficient models that can operate on consumer-grade hardware. This advancement could facilitate real-time applications in music composition and performance, making sophisticated music generation tools more accessible to creators. The paper presents a novel approach to long-context modeling in symbolic music generation through the Depth-Structured Music Recurrence framework, effectively balancing computational efficiency with the need for extensive contextual information. The comprehensive methodology and rigorous experimental evaluation contribute to its significance in advancing the field of machine learning for music generation.

Analysis: Full Paper • Full text: 49,402 characters

Enhancing Automatic Chord Recognition via Pseudo-Labeling and Knowledge Distillation

Nghia Phan, Rong Jin, Gang Liu ... · arXiv

Automatic Chord Recognition (ACR) is constrained by the scarcity of aligned chord labels, as well-aligned annotations are costly to acquire. At the same time, open-weight pre-trained models are currently more accessible than their proprietary training data. In this work, we prese...

Automatic Chord Recognition (ACR) is constrained by the scarcity of aligned chord labels, as well-aligned annotations are costly to acquire. At the same time, open-weight pre-trained models are currently more accessible than their proprietary training data. In this work, we present a two-stage training pipeline that leverages pre-trained models together with unlabeled audio. The proposed method decouples training into two stages. In the first stage, we use a pre-trained BTC model as a teacher to generate pseudo-labels for over 1,000 hours of diverse unlabeled audio and train a student model solely on these pseudo-labels. In the second stage, the student is continually trained on ground-truth labels as they become available, with selective knowledge distillation (KD) from the teacher applied as a regularizer to prevent catastrophic forgetting of the representations learned in the first stage. In our experiments, two models (BTC, 2E1D) were used as students. In stage 1, using only pseudo-labels, the BTC student achieves over 98% of the teacher's performance, while the 2E1D model achieves about 96% across seven standard mir_eval metrics. After a single training run for both students in stage 2, the resulting BTC student model surpasses the traditional supervised learning baseline by 2.5% and the original pre-trained teacher model by 1.55% on average across all metrics. And the resulting 2E1D student model improves from the traditional supervised learning baseline by 3.79% on average and achieves almost the same performance as the teacher. Both cases show the large gains on rare chord qualities.

Institutional Affiliations

Primary: unknown

All Institutions: unknown

GitHub

ML Relevance Analysis (80)

The main contribution of this paper is the introduction of a two-stage training pipeline that leverages pseudo-labeling and knowledge distillation to enhance automatic chord recognition, particularly in scenarios with limited labeled data. This work presents a significant advancement in the field, offering a practical solution to a common challenge in music information retrieval.

Comprehensive Analysis

Methodology Assessment

The paper proposes a two-stage training pipeline that effectively utilizes pseudo-labeling and knowledge distillation to enhance automatic chord recognition. The methodology is well-structured, with a clear distinction between the two training phases: the first leverages a pre-trained teacher model to generate pseudo-labels from unlabeled audio, while the second phase incorporates ground-truth labels with selective knowledge distillation to mitigate catastrophic forgetting. This approach is innovative in its decoupling of labeled and unlabeled data training, allowing for improved model performance even when labeled data is scarce.

Experimental Evaluation

The experiments are comprehensive, utilizing over 1,000 hours of unlabeled audio across various datasets. The results demonstrate significant improvements in performance metrics, particularly for rare chord qualities, which is a critical aspect of chord recognition. The use of standard mir_eval metrics adds rigor to the evaluation, and the comparative analysis against traditional supervised learning baselines highlights the effectiveness of the proposed method. However, the paper could benefit from more detailed ablation studies to further validate the impact of each component in the training pipeline.

Reproducibility

The paper provides sufficient detail regarding the training configurations, datasets, and evaluation metrics, which supports reproducibility. However, the absence of a clear description of the experimental setup and hyperparameter tuning could pose challenges for other researchers attempting to replicate the results. The provided GitHub link to the project may aid in this regard, assuming it contains the necessary code and documentation.

Limitations

One limitation identified is the reliance on the quality of the teacher model for generating pseudo-labels. If the teacher model is biased or poorly generalizable, it could negatively impact the performance of the student model. Additionally, the paper does not address potential issues related to the scalability of the method when applied to larger datasets or more complex chord recognition tasks.

Broader Impact

The proposed methodology has significant implications for the field of music information retrieval, particularly in enhancing automatic chord recognition capabilities. By effectively utilizing unlabeled data, this approach could lower the barriers to developing robust ACR systems, making them more accessible for various applications in music analysis, education, and automated music transcription. The focus on improving recognition of rare chord qualities also addresses a critical gap in existing ACR systems. The main contribution of this paper is the introduction of a two-stage training pipeline that leverages pseudo-labeling and knowledge distillation to enhance automatic chord recognition, particularly in scenarios with limited labeled data. This work presents a significant advancement in the field, offering a practical solution to a common challenge in music information retrieval.

Analysis: Full Paper • Full text: 30,602 characters

CosyAccent: Duration-Controllable Accent Normalization Using Source-Synthesis Training Data

Qibing Bai, Shuhao Shi, Shuai Wang ... · ICASSP 2026

Accent normalization (AN) systems often struggle with unnatural outputs and undesired content distortion, stemming from both suboptimal training data and rigid duration modeling. In this paper, we propose a "source-synthesis" methodology for training data construction. By generat...

Accent normalization (AN) systems often struggle with unnatural outputs and undesired content distortion, stemming from both suboptimal training data and rigid duration modeling. In this paper, we propose a "source-synthesis" methodology for training data construction. By generating source L2 speech and using authentic native speech as the training target, our approach avoids learning from TTS artifacts and, crucially, requires no real L2 data in training. Alongside this data strategy, we introduce CosyAccent, a non-autoregressive model that resolves the trade-off between prosodic naturalness and duration control. CosyAccent implicitly models rhythm for flexibility yet offers explicit control over total output duration. Experiments show that, despite being trained without any real L2 speech, CosyAccent achieves significantly improved content preservation and superior naturalness compared to strong baselines trained on real-world data.

Institutional Affiliations

Primary: Shenzhen Research Institute of Big Data

All Institutions: Shenzhen Research Institute of Big Data

Demo · GitHub

ML Relevance Analysis (83)

The paper presents CosyAccent, a novel duration-controllable accent normalization model that utilizes a unique source-synthesis training data strategy to improve the naturalness and content preservation of accent conversion systems. This work represents a meaningful advancement in the field, addressing critical challenges in accent normalization and offering a scalable solution for future research and applications.

Comprehensive Analysis

Methodology Assessment

The paper introduces a novel "source-synthesis" methodology for constructing training data, which synthesizes L2 source speech from a high-quality L1 corpus, thereby avoiding TTS artifacts. This innovative approach allows the model to be trained without real L2 data, which is a significant advancement in the field of accent normalization. The CosyAccent model itself is a non-autoregressive architecture that effectively balances prosodic naturalness and explicit duration control, addressing a critical limitation in existing models.

Experimental Evaluation

The experiments are well-structured, comparing CosyAccent against strong baselines trained on real L2 data. The results demonstrate significant improvements in content preservation and naturalness, validated through both subjective and objective metrics. The use of a diverse dataset covering multiple accents adds robustness to the evaluation, although the paper could benefit from a more extensive discussion on the statistical significance of the results.

Reproducibility

The paper provides adequate details regarding the model architecture, training data construction, and evaluation metrics, which supports reproducibility. The inclusion of a GitHub repository for code and data further enhances the potential for other researchers to replicate the findings.

Limitations

The primary limitation noted is the model's robustness to acoustic noise and control over paralinguistic features, as the synthetic data used for training is very clean. Additionally, the reliance on a specific TTS model for data generation may limit the generalizability of the approach to other languages or accents.

Broader Impact

The implications of this research are significant, particularly for applications in language learning, dubbing, and personalized TTS systems. By reducing the dependency on real L2 data, the proposed method could facilitate the development of accent normalization systems in resource-scarce languages, potentially broadening access to language education and media. The paper presents CosyAccent, a novel duration-controllable accent normalization model that utilizes a unique source-synthesis training data strategy to improve the naturalness and content preservation of accent conversion systems. This work represents a meaningful advancement in the field, addressing critical challenges in accent normalization and offering a scalable solution for future research and applications.

Analysis: Full Paper • Full text: 21,328 characters

Audio ML Papers

🏆 Top Papers This Week

Institutional Affiliations

ML Relevance Analysis (92)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (84)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (83)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (83)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (83)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (83)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (82)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (80)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (75)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (83)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility