The maturation of Large Audio Language Models (LALMs) has raised growing expectations for them to comprehend complex audio much like humans. Current efforts primarily replicate text-based reasoning by contextualizing audio content through a one-time encoding, which introduces a critical information bottleneck. Drawing inspiration from human cognition, we propose audio-interleaved reasoning to break through this bottleneck. It treats audio as an active reasoning component, enabling sustained audio engagement and perception-grounded analysis. To instantiate it, we introduce a two-stage training framework, first teaching LALMs to localize salient audio segments through supervised fine-tuning, and then incentivizing proficient re-listening via reinforcement learning. In parallel, a structured data generation pipeline is developed to produce high-quality training data. Consequently, we present Echo, a LALM capable of dynamically re-listening to audio in demand during reasoning. On audio comprehension benchmarks, Echo achieves overall superiority in both challenging expert-level and general-purpose tasks. Comprehensive analysis further confirms the efficiency and generalizability of audio-interleaved reasoning, establishing it as a promising direction for advancing audio comprehension. Project page: https://github.com/wdqqdw/Echo.
Primary: Tsinghua University
All Institutions: Tsinghua University, ByteDance China, Department of Psychological and Cognitive Sciences, School of Information Science and Technology, ShanghaiTech University
The main contribution of this paper is the introduction of audio-interleaved reasoning, which significantly enhances the audio comprehension capabilities of LALMs by allowing them to engage with audio data dynamically during reasoning tasks. This innovative approach, combined with a robust training framework and comprehensive evaluation, positions the work as a significant advancement in the field of audio machine learning.
The paper introduces a novel approach called audio-interleaved reasoning, which allows Large Audio Language Models (LALMs) to actively engage with audio data during reasoning tasks. This is achieved through a two-stage training framework that combines supervised fine-tuning and reinforcement learning, enabling the model to dynamically re-listen to salient audio segments. The methodology is well-structured, leveraging human cognitive processes as inspiration, and includes a comprehensive data generation pipeline that produces high-quality training data. The approach is innovative in its treatment of audio as an active component rather than a static context, which is a significant departure from existing methods.
The experiments are rigorously designed, utilizing multiple audio comprehension benchmarks to validate the effectiveness of the proposed methodology. The results demonstrate that Echo outperforms existing LALMs, including advanced proprietary models, in both expert-level and general-purpose tasks. The paper provides detailed comparisons and analyses, showcasing the advantages of the audio-interleaved reasoning format over traditional methods. The evaluation metrics are appropriate, and the results are statistically significant, reinforcing the claims made by the authors.
The paper includes a detailed description of the training framework, data generation pipeline, and evaluation settings, which supports reproducibility. The authors express a commitment to releasing the complete code and dataset in the future, which is crucial for enabling further research and validation of their findings.
While the proposed method shows promise, the authors acknowledge that the implementation remains relatively straightforward and that there is room for refinement. The current approach may not fully exploit the potential of audio re-listening, and the automated generation of CoT annotations lacks human heuristics, which could lead to biases in the training data. Additionally, the reliance on existing datasets may limit the generalizability of the findings.
The advancements in audio comprehension capabilities have significant implications for various applications, including human-computer interaction, accessibility technologies, and educational tools. By improving how machines understand and reason about audio, this research could lead to more intuitive and effective systems that better mimic human cognitive processes. The potential for future research in this area is substantial, particularly in enhancing the interaction between audio and other modalities. The main contribution of this paper is the introduction of audio-interleaved reasoning, which significantly enhances the audio comprehension capabilities of LALMs by allowing them to engage with audio data dynamically during reasoning tasks. This innovative approach, combined with a robust training framework and comprehensive evaluation, positions the work as a significant advancement in the field of audio machine learning.
Due to recent advancements in Large Audio-Language Models (LALMs) that demonstrate remarkable performance across a range of sound-, speech- and music-related tasks, there is a growing interest in proposing benchmarks to assess these models. Existing benchmarks generally focus only on reasoning with internal knowledge, neglecting real-world scenarios that require external information grounding. To bridge this gap, we introduce AudioRAG, a novel benchmark designed to evaluate audio-based reasoning augmented by information retrieval in realistic web environments. This benchmark comprises both LLM-generated and manually curated question-answer pairs. Our evaluations reveal that even the state-of-the-art LALMs struggle to answer these questions. We therefore propose an agentic pipeline that integrates audio reasoning with retrieval-augmented generation, providing a stronger baseline for future research.
Primary: National University of Singapore
All Institutions: National University of Singapore, The Chinese University of Hong Kong, Tianjin University
The main contribution of this paper is the introduction of AudioRAG, a benchmark for evaluating audio reasoning in conjunction with information retrieval, alongside the development of an agentic pipeline that improves performance on this benchmark. This work significantly advances the understanding of audio-Language Models' limitations and proposes a novel approach to enhance their reasoning capabilities through external knowledge integration.
The methodology is well-structured, introducing AudioRAG as a benchmark that combines audio reasoning with information retrieval. The authors employ both LLM-generated and manually curated questions, which is a thoughtful approach to ensure diversity and relevance in the dataset. The use of an agentic pipeline that integrates audio processing and retrieval-augmented generation is innovative and addresses the limitations of existing LALMs. However, the paper could benefit from more detailed descriptions of the audio processing tool and its integration with the reasoning LLM, as well as clearer explanations of the filtering process for question validity and answer correctness.
The experimental evaluation is thorough, assessing multiple state-of-the-art LALMs against the AudioRAG benchmark. The results clearly demonstrate the challenges faced by current models, highlighting the need for improved reasoning capabilities. The comparison between raw models and the agentic pipeline provides compelling evidence of the pipeline's effectiveness. However, the paper lacks detailed statistical analyses and visualizations that could further substantiate the findings.
The paper provides a GitHub repository link for the dataset, which is a positive step towards reproducibility. However, it lacks detailed implementation instructions for the agentic pipeline and the specific configurations used in experiments. This could hinder other researchers from replicating the results accurately.
One limitation is the reliance on LLMs for generating questions and answers, which may introduce biases or inaccuracies inherent in the models. Additionally, the benchmark's scope may not cover all real-world scenarios, potentially limiting its applicability. The increase in invalid answers from the agentic pipeline suggests that the complexity of multi-hop reasoning may lead to logical errors.
The proposed benchmark and agentic pipeline have significant implications for enhancing audio-based reasoning systems. By addressing the challenges of integrating external knowledge with audio processing, this work could lead to more robust applications in various fields, including education, entertainment, and information retrieval systems. The main contribution of this paper is the introduction of AudioRAG, a benchmark for evaluating audio reasoning in conjunction with information retrieval, alongside the development of an agentic pipeline that improves performance on this benchmark. This work significantly advances the understanding of audio-Language Models' limitations and proposes a novel approach to enhance their reasoning capabilities through external knowledge integration.
While recent years have witnessed rapid progress in speech synthesis, open-source singing voice synthesis (SVS) systems still face significant barriers to industrial deployment, particularly in terms of robustness and zero-shot generalization. In this report, we introduce SoulX-Singer, a high-quality open-source SVS system designed with practical deployment considerations in mind. SoulX-Singer supports controllable singing generation conditioned on either symbolic musical scores (MIDI) or melodic representations, enabling flexible and expressive control in real-world production workflows. Trained on more than 42,000 hours of vocal data, the system supports Mandarin Chinese, English, and Cantonese and consistently achieves state-of-the-art synthesis quality across languages under diverse musical conditions. Furthermore, to enable reliable evaluation of zero-shot SVS performance in practical scenarios, we construct SoulX-Singer-Eval, a dedicated benchmark with strict training-test disentanglement, facilitating systematic assessment in zero-shot settings.
Primary: Soul-AI Lab
All Institutions: Soul-AI Lab
SoulX-Singer represents a significant advancement in zero-shot singing voice synthesis, combining a large-scale dataset with innovative modeling techniques to achieve high-quality, flexible vocal generation across multiple languages. The comprehensive evaluation and robust methodology position this work as a valuable contribution to the field of machine learning and audio synthesis.
The methodology of SoulX-Singer is robust, leveraging a large-scale dataset of over 42,000 hours of vocal recordings to enhance zero-shot generalization capabilities. The dual-control mechanism (melody-control and score-control modes) is innovative, allowing for flexible synthesis based on different input types. The data processing pipeline is well-structured, ensuring high-quality vocal extraction and annotation, which is crucial for training effective models. The use of flow matching and a dedicated Singing Content Encoder to manage multimodal inputs is a significant advancement in the field.
The experimental evaluation is thorough, utilizing two distinct benchmarks (GMO-SVS and SoulX-Singer-Eval) to assess performance across multiple dimensions, including melodic accuracy, intelligibility, and overall singing quality. The results consistently demonstrate that SoulX-Singer outperforms existing state-of-the-art models, showcasing its effectiveness in both controlled and zero-shot scenarios. The comprehensive metrics used for evaluation provide a clear picture of the model's capabilities.
The paper provides sufficient detail regarding the architecture, training process, and evaluation metrics, which supports reproducibility. The availability of the dataset and code on GitHub further enhances the potential for other researchers to replicate the study. However, the reliance on specific pretrained models for vocal extraction and transcription may pose some challenges in reproducing the exact results without access to those models.
One limitation of the study is the potential for voice impersonation and ethical concerns associated with the use of synthesized voices, which the authors acknowledge. Additionally, while the model shows strong performance across multiple languages, the dataset's composition may still limit its generalization to other languages or dialects not represented in the training data.
SoulX-Singer has significant implications for the music production industry, enabling creators to synthesize high-quality singing voices without the need for extensive vocal recordings. This technology could democratize music creation, allowing individuals without access to professional singers to produce high-quality vocal tracks. However, the ethical considerations surrounding voice synthesis and potential misuse must be addressed to ensure responsible deployment. SoulX-Singer represents a significant advancement in zero-shot singing voice synthesis, combining a large-scale dataset with innovative modeling techniques to achieve high-quality, flexible vocal generation across multiple languages. The comprehensive evaluation and robust methodology position this work as a valuable contribution to the field of machine learning and audio synthesis.
The maturation of Large Audio Language Models (LALMs) has raised growing expectations for them to comprehend complex audio much like humans. Current efforts primarily replicate text-based reasoning by contextualizing audio content through a one-time encoding, which introduces a critical information bottleneck. Drawing inspiration from human cognition, we propose audio-interleaved reasoning to break through this bottleneck. It treats audio as an active reasoning component, enabling sustained audio engagement and perception-grounded analysis. To instantiate it, we introduce a two-stage training framework, first teaching LALMs to localize salient audio segments through supervised fine-tuning, and then incentivizing proficient re-listening via reinforcement learning. In parallel, a structured data generation pipeline is developed to produce high-quality training data. Consequently, we present Echo, a LALM capable of dynamically re-listening to audio in demand during reasoning. On audio comprehension benchmarks, Echo achieves overall superiority in both challenging expert-level and general-purpose tasks. Comprehensive analysis further confirms the efficiency and generalizability of audio-interleaved reasoning, establishing it as a promising direction for advancing audio comprehension. Project page: https://github.com/wdqqdw/Echo.
Primary: Tsinghua University
All Institutions: Tsinghua University, ByteDance China, Department of Psychological and Cognitive Sciences, School of Information Science and Technology, ShanghaiTech University
The main contribution of this paper is the introduction of audio-interleaved reasoning, which significantly enhances the audio comprehension capabilities of LALMs by allowing them to engage with audio data dynamically during reasoning tasks. This innovative approach, combined with a robust training framework and comprehensive evaluation, positions the work as a significant advancement in the field of audio machine learning.
The paper introduces a novel approach called audio-interleaved reasoning, which allows Large Audio Language Models (LALMs) to actively engage with audio data during reasoning tasks. This is achieved through a two-stage training framework that combines supervised fine-tuning and reinforcement learning, enabling the model to dynamically re-listen to salient audio segments. The methodology is well-structured, leveraging human cognitive processes as inspiration, and includes a comprehensive data generation pipeline that produces high-quality training data. The approach is innovative in its treatment of audio as an active component rather than a static context, which is a significant departure from existing methods.
The experiments are rigorously designed, utilizing multiple audio comprehension benchmarks to validate the effectiveness of the proposed methodology. The results demonstrate that Echo outperforms existing LALMs, including advanced proprietary models, in both expert-level and general-purpose tasks. The paper provides detailed comparisons and analyses, showcasing the advantages of the audio-interleaved reasoning format over traditional methods. The evaluation metrics are appropriate, and the results are statistically significant, reinforcing the claims made by the authors.
The paper includes a detailed description of the training framework, data generation pipeline, and evaluation settings, which supports reproducibility. The authors express a commitment to releasing the complete code and dataset in the future, which is crucial for enabling further research and validation of their findings.
While the proposed method shows promise, the authors acknowledge that the implementation remains relatively straightforward and that there is room for refinement. The current approach may not fully exploit the potential of audio re-listening, and the automated generation of CoT annotations lacks human heuristics, which could lead to biases in the training data. Additionally, the reliance on existing datasets may limit the generalizability of the findings.
The advancements in audio comprehension capabilities have significant implications for various applications, including human-computer interaction, accessibility technologies, and educational tools. By improving how machines understand and reason about audio, this research could lead to more intuitive and effective systems that better mimic human cognitive processes. The potential for future research in this area is substantial, particularly in enhancing the interaction between audio and other modalities. The main contribution of this paper is the introduction of audio-interleaved reasoning, which significantly enhances the audio comprehension capabilities of LALMs by allowing them to engage with audio data dynamically during reasoning tasks. This innovative approach, combined with a robust training framework and comprehensive evaluation, positions the work as a significant advancement in the field of audio machine learning.
Accurate upsampling of Head-Related Transfer Functions (HRTFs) from sparse measurements is crucial for personalized spatial audio rendering. Traditional interpolation methods, such as kernel-based weighting or basis function expansions, rely on measurements from a single subject and are limited by the spatial sampling theorem, resulting in significant performance degradation under sparse sampling. Recent learning-based methods alleviate this limitation by leveraging cross-subject information, yet most existing neural architectures primarily focus on modeling spatial relationships across directions, while spectral dependencies along the frequency dimension are often modeled implicitly or treated independently. However, HRTF magnitude responses exhibit strong local continuity and long-range structure in the frequency domain, which are not fully exploited. This work investigates frequency-domain feature modeling by examining how different architectural choices, ranging from per-frequency multilayer perceptrons to convolutional, dilated convolutional, and attention-based models, affect performance under varying sparsity levels, showing that explicit spectral modeling consistently improves reconstruction accuracy, particularly under severe sparsity. Motivated by this observation, a frequency-domain Conformer-based architecture is adopted to jointly capture local spectral continuity and long-range frequency correlations. Experimental results on the SONICOM and HUTUBS datasets demonstrate that the proposed method achieves state-of-the-art performance in terms of interaural level difference and log-spectral distortion.
Primary: University of Technology Sydney
All Institutions: University of Technology Sydney, Monash University
This paper makes a substantial contribution to the field of audio processing by introducing a frequency-domain modeling approach for HRTF magnitude upsampling, demonstrating its effectiveness through rigorous experimentation and analysis. The findings highlight the importance of architectural choices in modeling spectral features, paving the way for future innovations in personalized audio rendering.
The paper proposes a novel approach to HRTF magnitude upsampling through frequency-domain feature modeling. It critically examines various architectural choices, including per-frequency MLPs, convolutional models, and a Conformer-based architecture, to effectively capture both local spectral continuity and long-range frequency correlations. The methodology is well-structured, with a clear separation between spatial mapping and frequency-domain modeling, which allows for a comprehensive exploration of the design space. The integration of spectral gradient loss alongside log-spectral distortion as a training objective is a thoughtful addition that enhances the model's ability to preserve spectral features.
The experiments are robust, utilizing two well-established datasets (SONICOM and HUTUBS) to evaluate the proposed method's performance under varying sparsity levels. The results demonstrate that the FD-Conformer consistently outperforms existing methods in terms of interaural level difference (ILD) and log-spectral distortion (LSD), particularly in sparse measurement scenarios. The ablation studies provide valuable insights into the contributions of different components of the architecture, reinforcing the importance of frequency-domain modeling.
The paper includes sufficient details regarding the experimental setup, including the datasets used, preprocessing steps, model architecture, and training protocols. The availability of the source code on GitHub enhances reproducibility, allowing other researchers to validate and build upon the findings.
While the proposed method shows significant improvements, it may still be sensitive to the choice of hyperparameters and the specific configurations of the datasets used. Additionally, the performance in extremely sparse scenarios, while improved, may still not meet practical requirements for all applications, indicating a potential area for further research.
The advancements in HRTF upsampling have significant implications for personalized spatial audio rendering, which is increasingly relevant in virtual reality, gaming, and immersive audio applications. By improving the accuracy of HRTF estimations from sparse measurements, this research could enhance user experiences in various audio applications, making spatial audio more accessible and effective. This paper makes a substantial contribution to the field of audio processing by introducing a frequency-domain modeling approach for HRTF magnitude upsampling, demonstrating its effectiveness through rigorous experimentation and analysis. The findings highlight the importance of architectural choices in modeling spectral features, paving the way for future innovations in personalized audio rendering.
Although lip-to-speech synthesis (L2S) has achieved significant progress in recent years, current state-of-the-art methods typically rely on intermediate representations such as mel-spectrograms or discrete self-supervised learning (SSL) tokens. The potential of latent diffusion models (LDMs) in this task remains largely unexplored. In this paper, we introduce SLD-L2S, a novel L2S framework built upon a hierarchical subspace latent diffusion model. Our method aims to directly map visual lip movements to the continuous latent space of a pre-trained neural audio codec, thereby avoiding the information loss inherent in traditional intermediate representations. The core of our method is a hierarchical architecture that processes visual representations through multiple parallel subspaces, initiated by a subspace decomposition module. To efficiently enhance interactions within and between these subspaces, we design the diffusion convolution block (DiCB) as our network backbone. Furthermore, we employ a reparameterized flow matching technique to directly generate the target latent vectors. This enables a principled inclusion of speech language model (SLM) and semantic losses during training, moving beyond conventional flow matching objectives and improving synthesized speech quality. Our experiments show that SLD-L2S achieves state-of-the-art generation quality on multiple benchmark datasets, surpassing existing methods in both objective and subjective evaluations.
Primary: unknown
All Institutions: unknown
The paper presents SLD-L2S, a novel framework for high-fidelity lip-to-speech synthesis that leverages a hierarchical subspace latent diffusion model, achieving state-of-the-art results in synthesis quality. The methodology is innovative and addresses critical challenges in the field, while the experimental evaluation supports its effectiveness, though the lack of a publicly available implementation may hinder reproducibility.
The paper introduces a novel framework, SLD-L2S, which employs a hierarchical subspace latent diffusion model to directly map visual lip movements to the latent space of a pre-trained audio codec. The methodology is innovative in its use of diffusion convolution blocks (DiCB) and a reparameterized flow matching technique, which enhances the model's ability to generate high-fidelity speech without relying on traditional intermediate representations like mel-spectrograms. The hierarchical architecture and subspace decomposition approach are well-justified, addressing the inherent challenges of lip-to-speech synthesis effectively.
The experiments are robust, utilizing multiple benchmark datasets (LRS3-TED and LRS2-BBC) to validate the performance of the proposed method. The results demonstrate that SLD-L2S achieves state-of-the-art performance in both objective and subjective evaluations, significantly outperforming existing methods. The use of comprehensive metrics, including UTMOS, SCOREQ, WER, and subjective MOS tests, provides a well-rounded assessment of the model's capabilities.
The paper provides detailed implementation details, including architecture configurations, training procedures, and hyperparameter settings, which are essential for reproducibility. However, the absence of a publicly available code repository or demo URL limits the practical reproducibility of the results.
One notable limitation is the lack of a clear discussion on the potential computational costs associated with the proposed method, particularly in real-world applications. Additionally, the paper does not address the scalability of the model to different languages or accents, which could impact its generalizability.
The proposed SLD-L2S framework has significant implications for various applications, including automated video dubbing, assistive technologies for individuals with speech impairments, and enhancing communication in noisy environments. By improving the quality and intelligibility of synthesized speech from visual inputs, this work could facilitate more natural interactions in human-computer interfaces. The paper presents SLD-L2S, a novel framework for high-fidelity lip-to-speech synthesis that leverages a hierarchical subspace latent diffusion model, achieving state-of-the-art results in synthesis quality. The methodology is innovative and addresses critical challenges in the field, while the experimental evaluation supports its effectiveness, though the lack of a publicly available implementation may hinder reproducibility.
Due to recent advancements in Large Audio-Language Models (LALMs) that demonstrate remarkable performance across a range of sound-, speech- and music-related tasks, there is a growing interest in proposing benchmarks to assess these models. Existing benchmarks generally focus only on reasoning with internal knowledge, neglecting real-world scenarios that require external information grounding. To bridge this gap, we introduce AudioRAG, a novel benchmark designed to evaluate audio-based reasoning augmented by information retrieval in realistic web environments. This benchmark comprises both LLM-generated and manually curated question-answer pairs. Our evaluations reveal that even the state-of-the-art LALMs struggle to answer these questions. We therefore propose an agentic pipeline that integrates audio reasoning with retrieval-augmented generation, providing a stronger baseline for future research.
Primary: National University of Singapore
All Institutions: National University of Singapore, The Chinese University of Hong Kong, Tianjin University
The main contribution of this paper is the introduction of AudioRAG, a benchmark for evaluating audio reasoning in conjunction with information retrieval, alongside the development of an agentic pipeline that improves performance on this benchmark. This work significantly advances the understanding of audio-Language Models' limitations and proposes a novel approach to enhance their reasoning capabilities through external knowledge integration.
The methodology is well-structured, introducing AudioRAG as a benchmark that combines audio reasoning with information retrieval. The authors employ both LLM-generated and manually curated questions, which is a thoughtful approach to ensure diversity and relevance in the dataset. The use of an agentic pipeline that integrates audio processing and retrieval-augmented generation is innovative and addresses the limitations of existing LALMs. However, the paper could benefit from more detailed descriptions of the audio processing tool and its integration with the reasoning LLM, as well as clearer explanations of the filtering process for question validity and answer correctness.
The experimental evaluation is thorough, assessing multiple state-of-the-art LALMs against the AudioRAG benchmark. The results clearly demonstrate the challenges faced by current models, highlighting the need for improved reasoning capabilities. The comparison between raw models and the agentic pipeline provides compelling evidence of the pipeline's effectiveness. However, the paper lacks detailed statistical analyses and visualizations that could further substantiate the findings.
The paper provides a GitHub repository link for the dataset, which is a positive step towards reproducibility. However, it lacks detailed implementation instructions for the agentic pipeline and the specific configurations used in experiments. This could hinder other researchers from replicating the results accurately.
One limitation is the reliance on LLMs for generating questions and answers, which may introduce biases or inaccuracies inherent in the models. Additionally, the benchmark's scope may not cover all real-world scenarios, potentially limiting its applicability. The increase in invalid answers from the agentic pipeline suggests that the complexity of multi-hop reasoning may lead to logical errors.
The proposed benchmark and agentic pipeline have significant implications for enhancing audio-based reasoning systems. By addressing the challenges of integrating external knowledge with audio processing, this work could lead to more robust applications in various fields, including education, entertainment, and information retrieval systems. The main contribution of this paper is the introduction of AudioRAG, a benchmark for evaluating audio reasoning in conjunction with information retrieval, alongside the development of an agentic pipeline that improves performance on this benchmark. This work significantly advances the understanding of audio-Language Models' limitations and proposes a novel approach to enhance their reasoning capabilities through external knowledge integration.
Large Audio Language Models (LALMs) have demonstrated strong capabilities in audio understanding and reasoning. However, their performance on fine grained auditory perception remains unreliable, and existing approaches largely rely on data intensive training to internalize perceptual abilities. We propose AudioRouter, a reinforcement learning framework that enables LALMs to improve audio understanding by learning when and how to use external audio tools. Rather than tightly coupling tool usage with audio reasoning, AudioRouter formulates tool use as an explicit decision making problem and optimizes a lightweight routing policy while keeping the underlying reasoning model frozen. Experimental results show that AudioRouter achieves substantial improvements on standard audio understanding benchmarks while requiring up to 600x less training data to learn tool usage compared with conventional training paradigms. These findings suggest that learning effective tool usage offers a data efficient and scalable alternative to internalizing perceptual abilities in LALMs.
Primary: University of California
All Institutions: University of California, The University of Queensland
The main contribution of this paper is the introduction of AudioRouter, a reinforcement learning framework that enhances audio understanding in large audio language models by optimizing tool usage while significantly reducing the amount of required training data. This innovative approach not only improves performance but also offers a scalable alternative to traditional data-intensive training methods, marking a significant advancement in the field of audio processing and reasoning.
The methodology presented in the paper is innovative as it decouples tool usage from the reasoning model, allowing for a more efficient learning process. The use of reinforcement learning to optimize a routing policy for tool invocation is a significant departure from traditional end-to-end training approaches. The authors effectively formulate tool usage as a discrete decision-making problem, which is a novel perspective in the context of audio language models. The decision to keep the reasoning model frozen while training the router is a strategic choice that enhances data efficiency and reduces complexity.
The experimental evaluation is robust, demonstrating the effectiveness of AudioRouter across multiple benchmarks (MMAU-mini and MMAR). The results indicate substantial improvements in performance while requiring significantly less training data compared to conventional methods. The paper provides clear comparisons against baseline models, showcasing the advantages of the proposed framework. However, the experiments could benefit from a broader range of datasets and tasks to further validate the generalizability of the approach.
The paper includes sufficient details regarding the experimental setup, including model architectures, training data, and reinforcement learning specifics. However, the lack of URLs for code or project repositories limits the reproducibility of the results. Providing access to the trained models or implementation would enhance the ability of other researchers to replicate the findings.
The paper acknowledges that the relative outcome reward relies on a fixed reasoning model, which may limit the Router's learning signal. Additionally, the focus on short-form, closed-set audio reasoning tasks with a limited set of audio tools may restrict the applicability of the findings. Future work should explore extending the framework to more complex reasoning tasks and diverse tool capabilities.
The proposed AudioRouter framework has the potential to significantly advance the field of audio understanding by providing a more data-efficient method for leveraging external tools. This approach could lead to broader applications in various domains, including audio analysis, multimedia processing, and interactive AI systems. By reducing the reliance on large annotated datasets, it may also democratize access to advanced audio processing capabilities. The main contribution of this paper is the introduction of AudioRouter, a reinforcement learning framework that enhances audio understanding in large audio language models by optimizing tool usage while significantly reducing the amount of required training data. This innovative approach not only improves performance but also offers a scalable alternative to traditional data-intensive training methods, marking a significant advancement in the field of audio processing and reasoning.
Discrete audio tokenizers are fundamental to empowering large language models with native audio processing and generation capabilities. Despite recent progress, existing approaches often rely on pretrained encoders, semantic distillation, or heterogeneous CNN-based architectures. These designs introduce fixed inductive biases that limit reconstruction fidelity and hinder effective scaling. In this paper, we argue that discrete audio tokenization should be learned fully end-to-end using a homogeneous and scalable architecture. To this end, we first propose CAT (Causal Audio Tokenizer with Transformer), a purely Transformer-based architecture that jointly optimizes the encoder, quantizer, and decoder from scratch for high-fidelity reconstruction. Building on the CAT architecture, we develop MOSS-Audio-Tokenizer, a large-scale audio tokenizer featuring 1.6 billion parameters, pre-trained on 3 million hours of diverse, general audio data. We show that this simple, fully end-to-end approach built from homogeneous, causal Transformer blocks scales gracefully and supports high-fidelity reconstruction across diverse audio domains. Across speech, sound, and music, MOSS-Audio-Tokenizer consistently outperforms prior codecs over a wide range of bitrates, while exhibiting predictable improvements with increased scale. Notably, leveraging the discrete tokens from our model, we develop the first purely autoregressive TTS model that surpasses prior non-autoregressive and cascaded systems. Furthermore, MOSS-Audio-Tokenizer enables competitive ASR performance without auxiliary encoders. Our findings position the CAT architecture as a unified, scalable interface for the next generation of native audio foundation models.
Primary: Fudan University
All Institutions: Fudan University, MOSI Intelligence, Shanghai Innovation Institute
The paper presents MOSS-Audio-Tokenizer, a novel end-to-end audio tokenizer that significantly improves audio processing capabilities for autoregressive models. Its comprehensive methodology and robust experimental validation establish it as a noteworthy contribution to the field of machine learning and audio processing.
The paper introduces the Causal Audio Tokenizer (CAT), a novel architecture that employs a fully end-to-end approach to audio tokenization using a homogeneous stack of causal Transformer blocks. This design minimizes fixed inductive biases, allowing for high-fidelity audio reconstruction across diverse domains. The architecture's simplicity and scalability are emphasized, with joint optimization of the encoder, quantizer, decoder, and discriminator, which is a significant departure from existing methods that often rely on pretrained components or complex architectures. The methodology is well-structured, with clear explanations of the training objectives and the integration of semantic modeling through audio-to-text tasks.
The authors conduct extensive experiments to evaluate the performance of MOSS-Audio-Tokenizer against existing audio tokenizers across various bitrate regimes. The results demonstrate state-of-the-art reconstruction quality in speech, sound, and music, with a clear advantage in low-bitrate scenarios. The use of both objective and subjective evaluation metrics strengthens the findings, providing a comprehensive assessment of the model's capabilities. The experiments are well-designed, showcasing the effectiveness of the proposed Progressive Sequence Dropout training strategy and the model's robustness across different conditions.
The paper provides detailed implementation information, including architecture specifications, training schedules, and optimization strategies. However, it lacks a publicly accessible code repository or demo URL, which could hinder reproducibility. The absence of shared code or datasets limits the ability for other researchers to validate the findings independently.
While the paper presents a strong technical contribution, it does not sufficiently address potential limitations, such as the computational resources required for training the large-scale model and the generalizability of the results to real-world applications. Additionally, the reliance on a large dataset for training may not be feasible for all researchers.
The development of MOSS-Audio-Tokenizer has significant implications for the field of audio processing and generation, particularly in enhancing the capabilities of autoregressive models. Its ability to provide high-fidelity audio reconstruction and support various downstream tasks like text-to-speech and automatic speech recognition positions it as a valuable tool for future audio foundation models. The research could lead to advancements in applications such as virtual assistants, content creation, and accessibility technologies. The paper presents MOSS-Audio-Tokenizer, a novel end-to-end audio tokenizer that significantly improves audio processing capabilities for autoregressive models. Its comprehensive methodology and robust experimental validation establish it as a noteworthy contribution to the field of machine learning and audio processing.
Discrete audio tokenizers are fundamental to empowering large language models with native audio processing and generation capabilities. Despite recent progress, existing approaches often rely on pretrained encoders, semantic distillation, or heterogeneous CNN-based architectures. These designs introduce fixed inductive biases that limit reconstruction fidelity and hinder effective scaling. In this paper, we argue that discrete audio tokenization should be learned fully end-to-end using a homogeneous and scalable architecture. To this end, we first propose CAT (Causal Audio Tokenizer with Transformer), a purely Transformer-based architecture that jointly optimizes the encoder, quantizer, and decoder from scratch for high-fidelity reconstruction. Building on the CAT architecture, we develop MOSS-Audio-Tokenizer, a large-scale audio tokenizer featuring 1.6 billion parameters, pre-trained on 3 million hours of diverse, general audio data. We show that this simple, fully end-to-end approach built from homogeneous, causal Transformer blocks scales gracefully and supports high-fidelity reconstruction across diverse audio domains. Across speech, sound, and music, MOSS-Audio-Tokenizer consistently outperforms prior codecs over a wide range of bitrates, while exhibiting predictable improvements with increased scale. Notably, leveraging the discrete tokens from our model, we develop the first purely autoregressive TTS model that surpasses prior non-autoregressive and cascaded systems. Furthermore, MOSS-Audio-Tokenizer enables competitive ASR performance without auxiliary encoders. Our findings position the CAT architecture as a unified, scalable interface for the next generation of native audio foundation models.
Primary: Fudan University
All Institutions: Fudan University, MOSI Intelligence, Shanghai Innovation Institute
The paper introduces MOSS-Audio-Tokenizer, a scalable and effective audio tokenizer that leverages a fully end-to-end Transformer architecture to achieve high-fidelity audio reconstruction and competitive performance in downstream tasks. This work represents a significant advancement in audio processing methodologies, emphasizing the importance of simplicity and scalability in model design.
The paper presents a novel architecture, MOSS-Audio-Tokenizer, built on the Causal Audio Tokenizer (CAT) framework, which utilizes a purely Transformer-based approach for audio tokenization. This end-to-end model optimizes the encoder, quantizer, and decoder jointly, which is a significant departure from existing methods that rely on pretrained encoders or complex architectures. The use of residual vector quantization and a multi-task learning strategy to align audio representations with text further enhances the methodology. The design principles emphasize simplicity, scalability, and causality, making it suitable for autoregressive modeling.
The experiments are comprehensive, evaluating the model across various audio domains including speech, sound, and music. The authors provide both objective and subjective metrics for reconstruction quality, demonstrating that MOSS-Audio-Tokenizer consistently outperforms existing codecs across different bitrates. The results indicate a clear advantage in reconstruction fidelity and robustness, particularly in low-bitrate scenarios, showcasing the effectiveness of the proposed architecture.
The paper includes detailed implementation specifics, including architecture configurations, training schedules, and optimization strategies. However, the lack of a publicly available code repository or demo limits the reproducibility of the results. The authors do mention training on a substantial dataset (3 million hours of audio), but without access to the code or data, independent verification of results could be challenging.
One limitation is the reliance on a large-scale dataset for training, which may not be readily available to all researchers. Additionally, while the model shows strong performance across various tasks, the scalability of the architecture in real-world applications and its performance in edge cases or less common audio types remains to be fully explored.
The MOSS-Audio-Tokenizer has the potential to significantly advance the field of audio processing by providing a unified framework for audio generation and understanding. Its applications could extend to various domains including speech synthesis, automatic speech recognition, and audio content generation, making it a valuable tool for both researchers and practitioners in the field. The paper introduces MOSS-Audio-Tokenizer, a scalable and effective audio tokenizer that leverages a fully end-to-end Transformer architecture to achieve high-fidelity audio reconstruction and competitive performance in downstream tasks. This work represents a significant advancement in audio processing methodologies, emphasizing the importance of simplicity and scalability in model design.
Standardized laboratory characterizations for absorbing materials rely on idealized sound field assumptions, which deviate largely from real-life conditions. Consequently, \emph{in-situ} acoustic characterization has become essential for accurate diagnosis and virtual prototyping. We propose a physics-informed neural field that reconstructs local, near-surface broadband sound fields from sparse pressure samples to directly infer complex surface impedance. A parallel, multi-frequency architecture enables a broadband impedance retrieval within runtimes on the order of seconds to minutes. To validate the method, we developed a compact microphone array with low hardware complexity. Numerical verifications and laboratory experiments demonstrate accurate impedance retrieval with a small number of sensors under realistic conditions. We further showcase the approach in a vehicle cabin to provide practical guidance on measurement locations that avoid strong interference. Here, we show that this approach offers a robust means of characterizing \emph{in-situ} boundary conditions for architectural and automotive acoustics.
Primary: Technical University of Denmark
All Institutions: Technical University of Denmark
The main contribution of this paper is the development of a physics-informed neural network framework for rapid in-situ characterization of surface impedance from sparse acoustic data, which significantly advances the state-of-the-art in acoustic material characterization. The methodology combines innovative neural network architecture with practical experimental validation, addressing critical challenges in the field of acoustics.
The paper introduces a novel physics-informed neural network architecture for inferring surface impedance from sparse acoustic data, which is a significant advancement over traditional methods that rely on dense sensor arrays and idealized conditions. The use of a parallel multi-frequency architecture allows for efficient processing and inference, addressing computational bottlenecks associated with broadband sound field reconstruction. The methodology is well-structured, incorporating automatic differentiation to infer particle velocity, and employs a composite loss function that integrates data fidelity, physical constraints, and regularization terms, which enhances the robustness of the model.
The experimental validation is thorough, encompassing both numerical simulations and laboratory experiments in anechoic and reverberant environments. The results demonstrate the framework's capability to accurately retrieve impedance under realistic conditions, showcasing its practical applicability in complex acoustic environments such as vehicle cabins. The sensitivity analysis and parametric sweeps provide valuable insights into the performance of the proposed microphone array configurations, further reinforcing the robustness of the method.
The paper provides detailed descriptions of the experimental setups, training protocols, and evaluation metrics, which facilitate reproducibility. However, the lack of publicly available code and data at this stage may hinder independent validation of the results. The authors mention plans to establish a public repository upon acceptance, which would enhance reproducibility.
One limitation noted is the sensitivity of the method to local sound field complexity, particularly in the presence of strong nodal lines and reflections, which can degrade inference accuracy. Additionally, the reliance on specific microphone configurations may limit the generalizability of the findings to other setups or environments. The paper also acknowledges the challenges posed by measurement noise, especially in the context of near-rigid surfaces.
The proposed framework has significant implications for in-situ acoustic characterization in various fields, including architectural acoustics and automotive design. By enabling rapid and accurate impedance retrieval, this method can improve the design and optimization of sound-absorbing materials and structures, ultimately enhancing acoustic performance in real-world applications. The integration of machine learning with physics-informed approaches represents a promising direction for future research in acoustic engineering. The main contribution of this paper is the development of a physics-informed neural network framework for rapid in-situ characterization of surface impedance from sparse acoustic data, which significantly advances the state-of-the-art in acoustic material characterization. The methodology combines innovative neural network architecture with practical experimental validation, addressing critical challenges in the field of acoustics.
Passive acoustic monitoring has become a key strategy in biodiversity assessment, conservation, and behavioral ecology, especially as Internet-of-Things (IoT) devices enable continuous in situ audio collection at scale. While recent self-supervised learning (SSL)-based audio encoders, such as BEATs and AVES, have shown strong performance in bioacoustic tasks, their computational cost and limited robustness to unseen environments hinder deployment on resource-constrained platforms. In this work, we introduce BioME, a resource-efficient audio encoder designed for bioacoustic applications. BioME is trained via layer-to-layer distillation from a high-capacity teacher model, enabling strong representational transfer while reducing the parameter count by 75%. To further improve ecological generalization, the model is pretrained on multi-domain data spanning speech, environmental sounds, and animal vocalizations. A key contribution is the integration of modulation-aware acoustic features via FiLM conditioning, injecting a DSP-inspired inductive bias that enhances feature disentanglement in low-capacity regimes. Across multiple bioacoustic tasks, BioME matches or surpasses the performance of larger models, including its teacher, while being suitable for resource-constrained IoT deployments. For reproducibility, code and pretrained checkpoints are publicly available.
Primary: Institut national de la recherche scientifique (INRS - EMT)
All Institutions: Institut national de la recherche scientifique (INRS - EMT)
The main contribution of this paper is the introduction of BioME, a resource-efficient audio encoder designed for bioacoustic applications, which achieves state-of-the-art performance while significantly reducing computational costs. This work represents a meaningful advancement in the field of audio representation learning, particularly in the context of ecological monitoring, and demonstrates the potential of integrating traditional signal processing techniques with modern deep learning approaches.
The methodology presented in this paper is robust and innovative, leveraging layer-to-layer knowledge distillation to create a compact audio encoder, BioME, that retains high performance on bioacoustic tasks. The integration of modulation-aware features via FiLM conditioning is particularly noteworthy, as it introduces a novel inductive bias that enhances feature disentanglement, which is crucial for effective audio representation in resource-constrained environments. The use of a multi-domain pretraining strategy further strengthens the model's generalization capabilities across diverse bioacoustic tasks.
The experimental evaluation is thorough, utilizing a variety of datasets and benchmarks, including the BEANS benchmark for bioacoustic tasks. The results demonstrate that BioME outperforms larger models, including its teacher model, in several scenarios, particularly in resource-constrained setups. The ablation studies provide clear insights into the contributions of different architectural components, validating the effectiveness of the proposed modifications and confirming the model's robustness across various tasks.
The authors have made efforts to ensure reproducibility by providing code and pretrained checkpoints publicly. However, the paper lacks specific URLs for accessing these resources, which could enhance reproducibility further. Detailed descriptions of the datasets and training procedures are included, which aids in replicating the experiments.
One limitation is the potential overfitting observed in larger model configurations, particularly in specific tasks like binary classification for beehive monitoring. Additionally, while the model shows promise, the paper does not extensively discuss the trade-offs between model size and performance in all contexts, which could be important for practical applications.
The implications of this work are significant for ecological monitoring and conservation efforts, as it enables efficient and effective bioacoustic monitoring using resource-constrained IoT devices. The advancements in self-supervised learning for audio representation can also influence broader applications in machine learning, particularly in fields requiring real-time audio processing and analysis. The main contribution of this paper is the introduction of BioME, a resource-efficient audio encoder designed for bioacoustic applications, which achieves state-of-the-art performance while significantly reducing computational costs. This work represents a meaningful advancement in the field of audio representation learning, particularly in the context of ecological monitoring, and demonstrates the potential of integrating traditional signal processing techniques with modern deep learning approaches.
Music stem generation, the task of producing musically-synchronized and isolated instrument audio clips, offers the potential of greater user control and better alignment with musician workflows compared to conventional text-to-music models. Existing stem generation approaches, however, either rely on fixed architectures that output a predefined set of stems in parallel, or generate only one stem at a time, resulting in slow inference despite flexibility in stem combination. We propose Stemphonic, a diffusion-/flow-based framework that overcomes this trade-off and generates a variable set of synchronized stems in one inference pass. During training, we treat each stem as a batch element, group synchronized stems in a batch, and apply a shared noise latent to each group. At inference-time, we use a shared initial noise latent and stem-specific text inputs to generate synchronized multi-stem outputs in one pass. We further expand our approach to enable one-pass conditional multi-stem generation and stem-wise activity controls to empower users to iteratively generate and orchestrate the temporal layering of a mix. We benchmark our results on multiple open-source stem evaluation sets and show that Stemphonic produces higher-quality outputs while accelerating the full mix generation process by 25 to 50%. Demos at: https://stemphonic-demo.vercel.app.
Primary: Adobe Research
All Institutions: Adobe Research
The paper introduces Stemphonic, a novel framework for efficient multi-stem music generation, significantly advancing the field of audio generation through innovative methodologies and promising experimental results.
The methodology presents a novel framework that integrates diffusion and flow-based models for music stem generation, addressing the limitations of existing approaches by allowing for variable and synchronized stem outputs in a single inference pass. The introduction of techniques such as stem grouping and noise sharing during training is particularly innovative, as it enhances inter-stem cohesion and synchronization, which are critical in music generation tasks. The approach is well-structured and builds upon established generative models, showcasing a clear progression from theory to practical application.
The experiments are comprehensive, utilizing multiple datasets and evaluation metrics to assess the quality of generated stems and mixes. The results demonstrate significant improvements in generation quality and efficiency, with quantitative metrics such as Fréchet Audio Distance (FAD) providing a robust framework for evaluation. The ablation studies effectively highlight the contributions of the proposed techniques, reinforcing the validity of the claims made by the authors.
The paper provides detailed implementation specifics, including architecture choices, training procedures, and dataset descriptions, which facilitate reproducibility. However, the absence of a publicly available code repository limits the ease with which other researchers can replicate the results.
One limitation is the reliance on specific datasets for training and evaluation, which may not fully capture the diversity of music styles and genres. Additionally, while the model shows promise in generating synchronized stems, the quality of generated audio may still vary depending on the complexity of the input prompts and conditions.
The proposed framework has significant implications for music production, enabling greater creative control for musicians and content creators. By facilitating the generation of isolated instrument tracks, it can streamline workflows in music composition and production, potentially democratizing music creation for non-experts. The ability to generate stems on-demand could also enhance collaborative efforts in music-making. The paper introduces Stemphonic, a novel framework for efficient multi-stem music generation, significantly advancing the field of audio generation through innovative methodologies and promising experimental results.
In this work, we present Covo-Audio, a 7B-parameter end-to-end LALM that directly processes continuous audio inputs and generates audio outputs within a single unified architecture. Through large-scale curated pretraining and targeted post-training, Covo-Audio achieves state-of-the-art or competitive performance among models of comparable scale across a broad spectrum of tasks, including speech-text modeling, spoken dialogue, speech understanding, audio understanding, and full-duplex voice interaction. Extensive evaluations demonstrate that the pretrained foundation model exhibits strong speech-text comprehension and semantic reasoning capabilities on multiple benchmarks, outperforming representative open-source models of comparable scale. Furthermore, Covo-Audio-Chat, the dialogue-oriented variant, demonstrates strong spoken conversational abilities, including understanding, contextual reasoning, instruction following, and generating contextually appropriate and empathetic responses, validating its applicability to real-world conversational assistant scenarios. Covo-Audio-Chat-FD, the evolved full-duplex model, achieves substantially superior performance on both spoken dialogue capabilities and full-duplex interaction behaviors, demonstrating its competence in practical robustness. To mitigate the high cost of deploying end-to-end LALMs for natural conversational systems, we propose an intelligence-speaker decoupling strategy that separates dialogue intelligence from voice rendering, enabling flexible voice customization with minimal text-to-speech (TTS) data while preserving dialogue performance. Overall, our results highlight the strong potential of 7B-scale models to integrate sophisticated audio intelligence with high-level semantic reasoning, and suggest a scalable path toward more capable and versatile LALMs.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of Covo-Audio, a novel end-to-end LALM that effectively integrates audio processing and semantic reasoning, demonstrating strong performance across various tasks. This work represents a significant advancement in the field of audio machine learning, particularly in its approach to conversational systems and dialogue intelligence.
The methodology presented in Covo-Audio is innovative as it integrates a large-scale end-to-end LALM capable of processing continuous audio inputs and generating audio outputs. The architecture is designed for various tasks, including speech-text modeling and full-duplex voice interaction, which demonstrates a comprehensive approach to audio processing. The intelligence-speaker decoupling strategy is particularly noteworthy as it allows for flexible voice customization while maintaining dialogue performance, showcasing a novel approach to reducing deployment costs.
The experiments are extensive, covering multiple benchmarks and demonstrating strong performance against representative open-source models. The paper provides quantitative results that validate the model's capabilities in speech-text comprehension and conversational abilities. However, the paper could benefit from more detailed comparisons with existing models to better contextualize its performance.
The paper lacks detailed implementation specifics that would facilitate reproducibility. While it mentions large-scale pretraining and post-training, the absence of code or a project URL limits the ability for other researchers to replicate the findings or build upon the work.
One limitation is the high parameter count of the model, which may hinder accessibility for researchers with limited computational resources. Additionally, while the decoupling strategy is innovative, its practical implications and potential trade-offs in performance are not thoroughly explored.
The potential applications of Covo-Audio are significant, particularly in developing more capable conversational assistants that can handle complex audio interactions. The model's ability to generate empathetic responses could enhance user experience in real-world applications, making it a valuable contribution to the field of audio processing and conversational AI. The main contribution of this paper is the introduction of Covo-Audio, a novel end-to-end LALM that effectively integrates audio processing and semantic reasoning, demonstrating strong performance across various tasks. This work represents a significant advancement in the field of audio machine learning, particularly in its approach to conversational systems and dialogue intelligence.
Real-time voice conversion and speaker anonymization require causal, low-latency synthesis without sacrificing intelligibility or naturalness. Current systems have a core representational mismatch: content is time-varying, while speaker identity is injected as a static global embedding. We introduce a streamable speech synthesizer that aligns the temporal granularity of identity and content via a content-synchronous, time-varying timbre (TVT) representation. A Global Timbre Memory expands a global timbre instance into multiple compact facets; frame-level content attends to this memory, a gate regulates variation, and spherical interpolation preserves identity geometry while enabling smooth local changes. In addition, a factorized vector-quantized bottleneck regularizes content to reduce residual speaker leakage. The resulting system is streamable end-to-end, with <80 ms GPU latency. Experiments show improvements in naturalness, speaker transfer, and anonymization compared to SOTA streaming baselines, establishing TVT as a scalable approach for privacy-preserving and expressive speech synthesis under strict latency budgets.
Primary: unknown
All Institutions: unknown
The paper presents TVTSyn, a novel streaming voice conversion and anonymization system that effectively synchronizes speaker identity with content through a time-varying timbre representation, demonstrating significant advancements in privacy and expressivity under strict latency constraints. The methodology is innovative, and the experimental results suggest a strong potential for real-world applications, although further work is needed to address limitations and enhance reproducibility.
The proposed methodology introduces a novel time-varying timbre (TVT) representation that synchronizes speaker identity with content, addressing the static-dynamic mismatch prevalent in existing voice conversion systems. The architecture is well-structured, comprising a Global Timbre Memory (GTM) that enhances the expressivity of speaker identity while maintaining low latency, which is crucial for real-time applications. The use of a factorized vector-quantized bottleneck to regularize content and reduce speaker leakage is a significant innovation that contributes to the overall effectiveness of the system. The integration of causal convolutional networks and self-attention mechanisms demonstrates a sophisticated approach to maintaining temporal coherence in streaming scenarios.
The experiments are comprehensive, evaluating the proposed system against state-of-the-art (SOTA) methods across multiple metrics, including naturalness, speaker transfer, and anonymization effectiveness. The use of perceptual listening tests alongside objective metrics provides a well-rounded assessment of performance. The results indicate that TVTSyn achieves a favorable balance between privacy and utility, outperforming several baselines in terms of both speaker similarity and anonymization quality. However, the paper could benefit from a more detailed exploration of the datasets used and the specific configurations of the baseline models for clearer comparisons.
The paper provides a detailed account of the architecture, training procedures, and evaluation metrics, which supports reproducibility. However, the absence of a publicly available code repository limits the ability for others to replicate the results fully. The authors mention that the model was trained on specific datasets, but more information on data preprocessing and augmentation techniques would enhance reproducibility.
One notable limitation is the reliance on a fixed number of pseudo-speakers, which may restrict the model's adaptability to diverse speaker characteristics in real-world applications. Additionally, while the model performs well under controlled conditions, its robustness in noisy or variable environments has not been thoroughly evaluated. Future work should also address the scalability of the system in terms of processing power and memory requirements, especially for deployment in resource-constrained settings.
The implications of this research are significant, particularly in the context of privacy-preserving technologies for voice communication. The ability to anonymize speaker identity while maintaining intelligibility and naturalness is crucial for applications in teleconferencing, live translation, and other real-time voice interfaces. As privacy concerns continue to grow, the development of effective voice conversion and anonymization systems like TVTSyn could play a vital role in enhancing user security and trust in voice technologies. The paper presents TVTSyn, a novel streaming voice conversion and anonymization system that effectively synchronizes speaker identity with content through a time-varying timbre representation, demonstrating significant advancements in privacy and expressivity under strict latency constraints. The methodology is innovative, and the experimental results suggest a strong potential for real-world applications, although further work is needed to address limitations and enhance reproducibility.
Blind room impulse response (RIR) estimation is a core task for capturing and transferring acoustic properties; yet existing methods often suffer from limited modeling capability and degraded performance under unseen conditions. Moreover, emerging generative audio applications call for more flexible impulse response generation methods. We propose Gencho, a diffusion-transformer-based model that predicts complex spectrogram RIRs from reverberant speech. A structure-aware encoder leverages isolation between early and late reflections to encode the input audio into a robust representation for conditioning, while the diffusion decoder generates diverse and perceptually realistic impulse responses from it. Gencho integrates modularly with standard speech processing pipelines for acoustic matching. Results show richer generated RIRs than non-generative baselines while maintaining strong performance in standard RIR metrics. We further demonstrate its application to text-conditioned RIR generation, highlighting Gencho's versatility for controllable acoustic simulation and generative audio tasks.
Primary: University of Maryland
All Institutions: University of Illinois Urbana-Champaign, University of Maryland, Adobe, Paris Smaragdis
The paper presents Gencho, a novel diffusion-transformer model for generating room impulse responses from reverberant speech, significantly advancing the state of the art in acoustic matching and generative audio applications. The comprehensive methodology and experimental validation underscore its potential impact on the field of audio processing.
The methodology presented in this paper is innovative, leveraging a diffusion-transformer architecture to generate room impulse responses (RIRs) from reverberant speech. The proposed structure-aware encoder effectively separates early and late reflections, which is a notable improvement over traditional methods that treat the input as a monolithic signal. This separation allows for more accurate modeling of the acoustic environment. The use of a diffusion-based decoder enhances the model's ability to generate diverse and perceptually realistic outputs, addressing the limitations of previous non-generative approaches. The integration of text conditioning for RIR generation further demonstrates the versatility of the proposed method.
The experiments are well-structured, utilizing a variety of datasets to evaluate the model's performance in different scenarios. The comparison with baseline models, particularly the regression-based FiNS variants, effectively highlights the advantages of the proposed Gencho model. The results indicate significant improvements in standard RIR metrics, showcasing the model's ability to generalize across unseen data. The hybrid approach combining the strengths of both generative and non-generative methods is a valuable addition to the experimental evaluation, demonstrating practical applications in real-world settings.
The paper provides sufficient details regarding the model architecture, training procedures, and evaluation metrics, which supports reproducibility. However, the absence of a publicly available code repository limits the ease with which other researchers can replicate the results. The authors could enhance reproducibility by providing access to their implementation and datasets used for training and evaluation.
One limitation of the proposed method is its reliance on high-quality input data for optimal performance. The model may struggle with noisy or poorly recorded reverberant speech, which could affect the accuracy of the generated RIRs. Additionally, while the text-to-RIR generation shows promise, the model's performance may vary based on the quality and specificity of the text prompts provided.
The implications of this research are significant for various applications in audio processing, including automated dialogue replacement, immersive audio experiences in AR/VR, and generative audio content creation. By enabling more flexible and realistic acoustic simulations, this work could enhance the quality of synthetic speech and audio in numerous contexts, ultimately contributing to advancements in the field of machine learning and audio technology. The paper presents Gencho, a novel diffusion-transformer model for generating room impulse responses from reverberant speech, significantly advancing the state of the art in acoustic matching and generative audio applications. The comprehensive methodology and experimental validation underscore its potential impact on the field of audio processing.
While deep learning has advanced speech enhancement (SE), effective phase modeling remains challenging, as conventional networks typically operate within a flat Euclidean feature space, which is not easy to model the underlying circular topology of the phase. To address this, we propose a manifold-aware magnitude-phase dual-stream framework that aligns the phase stream with its intrinsic circular geometry by enforcing Global Rotation Equivariance (GRE) characteristic. Specifically, we introduce a Magnitude-Phase Interactive Convolutional Module (MPICM) for modulus-based information exchange and a Hybrid-Attention Dual-FFN (HADF) bottleneck for unified feature fusion, both of which are designed to preserve GRE in the phase stream. Comprehensive evaluations are conducted across phase retrieval, denoising, dereverberation, and bandwidth extension tasks to validate the superiority of the proposed method over multiple advanced baselines. Notably, the proposed architecture reduces Phase Distance by over 20\% in the phase retrieval task and improves PESQ by more than 0.1 in zero-shot cross-corpus denoising evaluations. The overall superiority is also established in universal SE tasks involving mixed distortions. Qualitative analysis further reveals that the learned phase features exhibit distinct periodic patterns, which are consistent with the intrinsic circular nature of the phase. The source code is available at https://github.com/wangchengzhong/RENet.
Primary: Institute of Acoustics, Chinese Academy of Sciences
All Institutions: Institute of Acoustics, Chinese Academy of Sciences, University of Chinese Academy of Sciences
The paper presents a significant advancement in speech enhancement through a novel phase modeling approach that respects the geometric properties of phase data. The methodology is innovative, and the results demonstrate substantial improvements over existing methods, marking a meaningful contribution to the field of audio processing and machine learning.
The paper introduces a novel manifold-aware framework for phase modeling in speech enhancement, emphasizing Global Rotation Equivariance (GRE) to address the circular topology of phase data. The methodology is well-structured, with two main components: the Magnitude-Phase Interactive Convolutional Module (MPICM) and the Hybrid-Attention Dual-FFN (HADF). These components facilitate effective interaction between magnitude and phase streams while preserving the intrinsic geometric properties of phase. The approach is innovative, as it fundamentally alters how phase information is processed in deep learning architectures, moving away from traditional Euclidean assumptions.
The authors conduct extensive experiments across various tasks, including phase retrieval, denoising, dereverberation, and bandwidth extension. They use established datasets like VoiceBank+DEMAND and DNS Challenge 2020, demonstrating the effectiveness of their method against multiple strong baselines. The results indicate significant improvements in phase modeling accuracy and perceptual quality metrics, showcasing the robustness of the proposed architecture in diverse acoustic conditions. However, the paper could benefit from more detailed comparisons with a wider range of state-of-the-art methods.
The paper provides a clear description of the proposed architecture and the experimental setup, including datasets and training configurations. The availability of the source code on GitHub enhances reproducibility, allowing other researchers to validate and build upon the work. However, specific hyperparameter settings and training details could be elaborated further to facilitate easier replication of results.
While the proposed method shows promising results, the paper does not address potential limitations such as the computational complexity of the model and its scalability to larger datasets. Additionally, the reliance on specific datasets may limit the generalizability of the findings to other speech enhancement scenarios.
The proposed framework has significant implications for various applications in telecommunications, smart devices, and hearing aids, where effective speech enhancement is crucial. By improving phase modeling, the method could lead to advancements in real-time speech processing systems, enhancing user experience in noisy environments. The paper presents a significant advancement in speech enhancement through a novel phase modeling approach that respects the geometric properties of phase data. The methodology is innovative, and the results demonstrate substantial improvements over existing methods, marking a meaningful contribution to the field of audio processing and machine learning.
Time-frequency domain dual-path models have demonstrated strong performance and are widely used in source separation. Because their computational cost grows with the number of frequency bins, these models often use the band-split (BS) module in high-sampling-rate tasks such as music source separation (MSS) and cinematic audio source separation (CASS). The BS encoder compresses frequency information by encoding features for each predefined subband. It achieves effective compression by introducing an inductive bias that places greater emphasis on low-frequency parts. Despite its success, the BS module has two inherent limitations: (i) it is not input-adaptive, preventing the use of input-dependent information, and (ii) the parameter count is large, since each subband requires a dedicated module. To address these issues, we propose Spectral Feature Compression (SFC). SFC compresses the input using a single sequence modeling module, making it both input-adaptive and parameter-efficient. We investigate two variants of SFC, one based on cross-attention and the other on Mamba, and introduce inductive biases inspired by the BS module to make them suitable for frequency information compression. Experiments on MSS and CASS tasks demonstrate that the SFC module consistently outperforms the BS module across different separator sizes and compression ratios. We also provide an analysis showing that SFC adaptively captures frequency patterns from the input.
Primary: National Institute of Advanced Industrial Science and Technology (AIST)
All Institutions: National Institute of Advanced Industrial Science and Technology (AIST), Waseda University
The main contribution of this paper is the introduction of the Spectral Feature Compression module, which provides a novel, input-adaptive, and parameter-efficient approach to spectral feature compression for source separation tasks. This work represents a meaningful advancement in the field of audio processing, addressing key limitations of existing methods and demonstrating strong empirical results.
The paper introduces a novel approach to spectral feature compression through the Spectral Feature Compression (SFC) module, which utilizes sequence modeling techniques to create an input-adaptive and parameter-efficient method for source separation. The methodology is well-structured, addressing the limitations of the traditional band-split (BS) module by incorporating inductive biases and demonstrating the effectiveness of two variants based on cross-attention and Mamba. The approach is innovative in its attempt to adaptively capture frequency patterns, which is a significant advancement over previous methods.
The experiments are comprehensive, evaluating the proposed SFC module against the BS module across various tasks, including music source separation (MSS) and cinematic audio source separation (CASS). The results consistently show that SFC outperforms BS across different separator sizes and compression ratios, indicating a robust experimental design. However, details on the datasets used and the specific metrics for evaluation could be elaborated further to enhance clarity.
The paper lacks specific implementation details that would facilitate reproducibility, such as code availability or detailed descriptions of the experimental setup. While the methodology is sound, the absence of a project URL or demo could hinder other researchers from replicating the results.
One limitation is the reliance on inductive biases inspired by the BS module, which may not generalize well to all types of audio signals. Additionally, while the SFC module shows promise, its performance in real-world scenarios beyond the tested datasets remains unverified.
The proposed method has significant implications for audio processing applications, particularly in enhancing the quality of source separation in music and cinematic audio. The input-adaptive nature of the SFC module could lead to more efficient and effective audio processing systems, potentially influencing both academic research and industry practices. The main contribution of this paper is the introduction of the Spectral Feature Compression module, which provides a novel, input-adaptive, and parameter-efficient approach to spectral feature compression for source separation tasks. This work represents a meaningful advancement in the field of audio processing, addressing key limitations of existing methods and demonstrating strong empirical results.
Synthesizing coherent soundtracks for long-form videos remains a formidable challenge, currently stalled by three critical impediments: computational scalability, temporal coherence, and, most critically, a pervasive semantic blindness to evolving narrative logic. To bridge these gaps, we propose NarraScore, a hierarchical framework predicated on the core insight that emotion serves as a high-density compression of narrative logic. Uniquely, we repurpose frozen Vision-Language Models (VLMs) as continuous affective sensors, distilling high-dimensional visual streams into dense, narrative-aware Valence-Arousal trajectories. Mechanistically, NarraScore employs a Dual-Branch Injection strategy to reconcile global structure with local dynamism: a \textit{Global Semantic Anchor} ensures stylistic stability, while a surgical \textit{Token-Level Affective Adapter} modulates local tension via direct element-wise residual injection. This minimalist design bypasses the bottlenecks of dense attention and architectural cloning, effectively mitigating the overfitting risks associated with data scarcity. Experiments demonstrate that NarraScore achieves state-of-the-art consistency and narrative alignment with negligible computational overhead, establishing a fully autonomous paradigm for long-video soundtrack generation.
Primary: Tsinghua University
All Institutions: Tsinghua University, ByteDance
NarraScore represents a significant advancement in the synthesis of soundtracks for long-form videos by establishing a novel framework that connects visual narratives with musical dynamics through emotional control. The approach is innovative and addresses critical challenges in the field, although further validation and reproducibility efforts are needed to solidify its impact.
The methodology presented in NarraScore is innovative, leveraging frozen Vision-Language Models (VLMs) as affective sensors to convert visual narratives into Valence-Arousal trajectories. The Dual-Branch Injection strategy is particularly noteworthy, as it effectively balances global coherence with local dynamism, addressing the common pitfalls of dense attention mechanisms. The minimalist design is a strong point, as it aims to reduce overfitting risks associated with data scarcity, which is a prevalent issue in machine learning applications involving audio synthesis. However, the paper could benefit from a more detailed explanation of the training process and the specific architectures employed.
The experiments demonstrate that NarraScore achieves state-of-the-art performance in terms of coherence and narrative alignment. The authors provide sufficient empirical evidence to support their claims, although the paper lacks a comprehensive comparison with existing methods beyond a cursory mention. The results are promising, but the absence of detailed metrics or benchmarks makes it difficult to fully gauge the significance of the improvements claimed. Future work should include a broader set of comparisons to further validate the approach.
The paper does not provide sufficient details regarding the implementation, datasets, or training procedures, which raises concerns about reproducibility. Clearer guidelines and access to code or data would significantly enhance the ability of other researchers to replicate the findings. The lack of a demo or project URL further complicates this aspect.
The authors acknowledge limitations related to the temporal granularity of affective control, which could hinder synchronization with rapid visual events. Additionally, the cascaded design may lead to error propagation, which could affect the overall performance. Addressing these limitations in future work will be crucial for improving the robustness of the framework.
The potential applications of NarraScore are significant, particularly in the fields of film, gaming, and content creation where automated soundtrack generation could enhance user experience. The ability to generate music that aligns with narrative emotion could also open new avenues for interactive media. However, ethical considerations regarding the use of AI-generated content and its implications for creative industries should be discussed further. NarraScore represents a significant advancement in the synthesis of soundtracks for long-form videos by establishing a novel framework that connects visual narratives with musical dynamics through emotional control. The approach is innovative and addresses critical challenges in the field, although further validation and reproducibility efforts are needed to solidify its impact.
Synthesizing coherent soundtracks for long-form videos remains a formidable challenge, currently stalled by three critical impediments: computational scalability, temporal coherence, and, most critically, a pervasive semantic blindness to evolving narrative logic. To bridge these gaps, we propose NarraScore, a hierarchical framework predicated on the core insight that emotion serves as a high-density compression of narrative logic. Uniquely, we repurpose frozen Vision-Language Models (VLMs) as continuous affective sensors, distilling high-dimensional visual streams into dense, narrative-aware Valence-Arousal trajectories. Mechanistically, NarraScore employs a Dual-Branch Injection strategy to reconcile global structure with local dynamism: a \textit{Global Semantic Anchor} ensures stylistic stability, while a surgical \textit{Token-Level Affective Adapter} modulates local tension via direct element-wise residual injection. This minimalist design bypasses the bottlenecks of dense attention and architectural cloning, effectively mitigating the overfitting risks associated with data scarcity. Experiments demonstrate that NarraScore achieves state-of-the-art consistency and narrative alignment with negligible computational overhead, establishing a fully autonomous paradigm for long-video soundtrack generation.
Primary: Tsinghua University
All Institutions: Tsinghua University, ByteDance
NarraScore presents a pioneering approach to soundtrack generation by leveraging emotional dynamics derived from visual narratives. Its innovative methodology and potential applications mark a significant contribution to the field of audio-visual synthesis, though further work is needed to enhance reproducibility and address identified limitations.
The methodology presented in NarraScore is innovative, particularly in its use of frozen Vision-Language Models (VLMs) as affective sensors to derive Valence-Arousal trajectories from visual inputs. The Dual-Branch Injection strategy is a noteworthy contribution, effectively balancing global narrative coherence with local musical dynamics. This approach allows for a nuanced control of emotional expression in soundtracks, which is a significant advancement in the field of audio-visual synthesis. However, the paper could benefit from a more detailed explanation of the implementation of the Token-Level Affective Adapter and how it interacts with the Global Semantic Anchor.
The experiments conducted demonstrate the effectiveness of NarraScore in generating coherent soundtracks that align with visual narratives. The claim of achieving state-of-the-art performance is supported by comparative analyses, though the paper lacks specific quantitative metrics and benchmarks against existing methods. More comprehensive evaluations, including user studies or qualitative assessments, would strengthen the findings.
The paper does not provide sufficient details regarding the implementation, datasets, or experimental setups necessary for full reproducibility. While the conceptual framework is clear, the absence of code or data sharing limits the ability of other researchers to replicate the results.
The authors acknowledge limitations related to the temporal granularity of affective control, which may hinder precise synchronization with rapid visual events. Additionally, the potential for error propagation in the cascaded design is a valid concern that could impact the overall performance of the system. Future work is suggested to address these issues, but specific strategies are not fully fleshed out.
The implications of NarraScore are significant, particularly in the realm of automated content creation for media and entertainment. By enabling more emotionally resonant soundtracks for long-form videos, this research could enhance viewer engagement and experience. Moreover, the framework could be adapted for other applications in multimedia synthesis, potentially influencing areas such as game design, virtual reality, and interactive storytelling. NarraScore presents a pioneering approach to soundtrack generation by leveraging emotional dynamics derived from visual narratives. Its innovative methodology and potential applications mark a significant contribution to the field of audio-visual synthesis, though further work is needed to enhance reproducibility and address identified limitations.
Open-vocabulary keyword spotting (OV-KWS) enables personalized device control via arbitrary voice commands. Recently, researchers have explored using audio-text joint embeddings, allowing users to enroll phrases with text, and proposed techniques to disambiguate similar utterances. We find that existing OV-KWS solutions often overly bias the beginning phonemes of an enrollment, causing false triggers when negative enrollment-query-pairs share a prefix (``turn the volume up'' vs. ``turn the volume down''). We trace this to two factors: training data bias and position-biased cross-modal scoring. To address these limitations, we introduce the Partial Overlap Benchmark (POB) with two datasets, POB-Spark and POB-LibriPhrase (POB-LP), containing mismatched audio-text pairs with shared prefixes, and propose Equal-weighting Position Scoring (EPS), a lightweight decision layer. Using EPS alone reduces EER on POB-Spark from 64.4\% to 29.3\% and improves POB-LP accuracy from 87.6\% to 96.8\%, while maintaining performance on LibriPhrase and Google Speech Commands (GSC). With POB data added in training, our work achieves the best POB benchmark results while incurring the least amount of degradation on prior metrics among baselines. This degradation is most pronounced in GSC, which contains only one-word commands. We surface mitigating this trade-off as future work.
Primary: University of California San Diego
All Institutions: University of California San Diego, Bose Corporation
This paper makes a significant contribution to the field of machine learning by addressing a critical issue in open-vocabulary keyword spotting and providing innovative solutions that enhance the robustness of voice recognition systems. The combination of novel datasets and a lightweight scoring mechanism positions this work as a valuable resource for future research and practical applications in audio processing.
The paper introduces a novel approach to address prefix bias in open-vocabulary keyword spotting (OV-KWS) through the creation of the Partial Overlap Benchmark (POB) and the Equal-weighting Position Scoring (EPS) module. The methodology is well-structured, with a clear definition of the problem and innovative solutions that include both dataset creation and a lightweight scoring mechanism. The EPS module is particularly noteworthy for its simplicity and effectiveness in mitigating prefix bias without adding complexity to the model architecture.
The experiments are comprehensive, utilizing both the newly created POB datasets and established benchmarks like LibriPhrase and Google Speech Commands (GSC). The results demonstrate significant improvements in performance metrics, particularly in the context of partial overlap scenarios, which are often overlooked in existing datasets. The paper effectively presents controlled experiments that isolate the impact of the EPS module and the POB data, providing a clear narrative of the contributions to the field.
The paper provides sufficient implementation details, including model architectures, training procedures, and dataset descriptions, which enhance reproducibility. The authors have also made their datasets publicly available, which is a positive step towards ensuring that other researchers can validate and build upon their work.
One limitation noted is the trade-off in performance when adding POB data during training, particularly affecting short commands in datasets like GSC. This suggests a potential area for further research to optimize data composition and balance performance across different command lengths. Additionally, the paper does not explore the implications of the EPS module in more complex or varied acoustic environments, which could limit its applicability.
The advancements in OV-KWS have significant implications for personalized device control and user interaction with technology. By addressing prefix bias, the proposed methods could lead to more robust and user-friendly voice recognition systems, enhancing accessibility and usability in various applications, from smart home devices to gaming. This paper makes a significant contribution to the field of machine learning by addressing a critical issue in open-vocabulary keyword spotting and providing innovative solutions that enhance the robustness of voice recognition systems. The combination of novel datasets and a lightweight scoring mechanism positions this work as a valuable resource for future research and practical applications in audio processing.
Sound source tracking is commonly performed using classical array-processing algorithms, while machine-learning approaches typically rely on precise source position labels that are expensive or impractical to obtain. This paper introduces a physics-guided variational model capable of fully unsupervised single-source sound source tracking. The method combines a variational encoder with a physics-based decoder that injects geometric constraints into the latent space through analytically derived pairwise time-delay likelihoods. Without requiring ground-truth labels, the model learns to estimate source directions directly from microphone array signals. Experiments on real-world data demonstrate that the proposed approach outperforms traditional baselines and achieves accuracy and computational complexity comparable to state-of-the-art supervised models. We further show that the method generalizes well to mismatched array geometries and exhibits strong robustness to corrupted microphone position metadata. Finally, we outline a natural extension of the approach to multi-source tracking and present the theoretical modifications required to support it.
Primary: Eindhoven University of Technology
All Institutions: Eindhoven University of Technology, NXP Semiconductors
The paper presents a novel physics-guided variational model for unsupervised sound source tracking, effectively bridging machine learning and traditional signal processing techniques. The methodology is innovative, and the results demonstrate significant potential for real-world applications, marking a meaningful contribution to the field of audio processing.
The proposed methodology integrates a variational autoencoder with a physics-based decoder, effectively combining machine learning with established physical principles to enhance sound source tracking. This innovative approach allows for unsupervised learning by leveraging geometric constraints without requiring labeled data, which is a significant advancement over traditional methods. The use of a von Mises-Fisher distribution for directional statistics is particularly noteworthy, as it is well-suited for the problem at hand. The architecture is designed for efficiency, allowing for parallel processing, which is crucial for real-time applications.
The experiments conducted are robust, comparing the proposed method against classical and state-of-the-art supervised models across various scenarios, including noise and uncertainty in microphone positioning. The results demonstrate that the proposed model not only outperforms classical methods but also competes well with supervised approaches, showcasing its effectiveness in real-world applications. The use of real-world data for testing adds credibility to the findings.
The paper provides a detailed description of the methodology, including the architecture and loss functions, which facilitates reproducibility. However, the lack of a publicly available code repository or demo limits the ease with which other researchers can replicate the results.
One limitation is the focus on single-source tracking, which may restrict the applicability of the method in more complex environments with multiple sound sources. Additionally, while the model performs well under various conditions, its performance relative to supervised models in all scenarios may not be consistent, particularly in highly reverberant environments.
This research has the potential to significantly impact fields such as robotics, hearing aids, and surveillance systems, where accurate sound source localization is critical. The unsupervised nature of the model could lead to more accessible implementations in devices that cannot afford extensive labeled training data. The paper presents a novel physics-guided variational model for unsupervised sound source tracking, effectively bridging machine learning and traditional signal processing techniques. The methodology is innovative, and the results demonstrate significant potential for real-world applications, marking a meaningful contribution to the field of audio processing.
Dysarthric speech exhibits high variability and limited labeled data, posing major challenges for both automatic speech recognition (ASR) and assistive speech technologies. Existing approaches rely on synthetic data augmentation or speech reconstruction, yet often entangle speaker identity with pathological articulation, limiting controllability and robustness. In this paper, we propose ProtoDisent-TTS, a prototype-based disentanglement TTS framework built on a pre-trained text-to-speech backbone that factorizes speaker timbre and dysarthric articulation within a unified latent space. A pathology prototype codebook provides interpretable and controllable representations of healthy and dysarthric speech patterns, while a dual-classifier objective with a gradient reversal layer enforces invariance of speaker embeddings to pathological attributes. Experiments on the TORGO dataset demonstrate that this design enables bidirectional transformation between healthy and dysarthric speech, leading to consistent ASR performance gains and robust, speaker-aware speech reconstruction.
Primary: The Hong Kong Polytechnic University
All Institutions: The Hong Kong Polytechnic University
The paper presents ProtoDisent-TTS, a prototype-based disentanglement TTS framework that effectively synthesizes dysarthric speech while preserving speaker identity. The innovative methodology and promising experimental results position this work as a valuable contribution to the field of speech synthesis and assistive technologies.
The proposed ProtoDisent-TTS framework introduces a novel approach to disentangling speaker identity from dysarthric articulation by utilizing a prototype-based codebook and a dual-classifier objective. This method is innovative as it combines elements of text-to-speech synthesis with a clear focus on pathology, allowing for controlled speech generation. The use of a gradient reversal layer to enforce invariance of speaker embeddings to dysarthric attributes is particularly noteworthy, as it addresses a significant challenge in the field of speech synthesis.
The experiments conducted on the TORGO dataset are well-structured and demonstrate the effectiveness of the proposed framework. The results show consistent improvements in ASR performance and speaker identity preservation, validating the utility of synthetic data generated by ProtoDisent-TTS. However, the paper could benefit from more extensive comparisons with existing state-of-the-art methods to better contextualize the results.
The implementation details provided are thorough, including specifics on the architecture, training procedures, and hyperparameters. However, the absence of a publicly accessible code repository limits the reproducibility of the results. The authors mention using a pre-trained Index-TTS model, but it would be beneficial to provide access to this model or detailed instructions for replication.
One limitation of the study is the reliance on a relatively small dataset (TORGO), which may affect the generalizability of the findings. Additionally, while the framework shows promise for dysarthric speech synthesis, its performance in real-world applications and with diverse speaker populations remains to be evaluated.
The work has significant implications for assistive speech technologies, particularly for individuals with dysarthria. By enabling controllable and interpretable speech synthesis, the framework could enhance communication for those affected by speech disorders. This research could also inspire further studies in related areas, such as voice conversion and personalized speech synthesis. The paper presents ProtoDisent-TTS, a prototype-based disentanglement TTS framework that effectively synthesizes dysarthric speech while preserving speaker identity. The innovative methodology and promising experimental results position this work as a valuable contribution to the field of speech synthesis and assistive technologies.
While existing Singing Voice Synthesis systems achieve high-fidelity solo performances, they are constrained by global timbre control, failing to address dynamic multi-singer arrangement and vocal texture within a single song. To address this, we propose Tutti, a unified framework designed for structured multi-singer generation. Specifically, we introduce a Structure-Aware Singer Prompt to enable flexible singer scheduling evolving with musical structure, and propose Complementary Texture Learning via Condition-Guided VAE to capture implicit acoustic textures (e.g., spatial reverberation and spectral fusion) that are complementary to explicit controls. Experiments demonstrate that Tutti excels in precise multi-singer scheduling and significantly enhances the acoustic realism of choral generation, offering a novel paradigm for complex multi-singer arrangement. Audio samples are available at https://annoauth123-ctrl.github.io/Tutii_Demo/.
Primary: Wuhan University of Technology
All Institutions: Wuhan University of Technology, Tencent Inc.
The paper presents Tutti, a novel framework for dynamic multi-singer synthesis that significantly enhances the acoustic realism and artistic cohesion of choral generation. The innovative methodology and comprehensive experimental validation position this work as a meaningful contribution to the field of machine learning and audio synthesis.
The methodology presented in this paper is robust and innovative, introducing the Tutti framework for multi-singer synthesis. The Structure-Aware Singer Prompt and the Complementary Texture Learning via Condition-Guided VAE are significant contributions that address the limitations of existing Singing Voice Synthesis (SVS) systems. The integration of these components allows for dynamic scheduling of singers and captures complex vocal textures, which are crucial for realistic multi-singer arrangements. The use of a Latent Diffusion Transformer (DiT) backbone enhances the model's ability to manage long musical sequences effectively.
The experimental setup is comprehensive, utilizing a large dataset for training and rigorous evaluation metrics, including both objective and subjective assessments. The results demonstrate significant improvements in multi-singer scheduling and acoustic realism compared to existing models. The ablation studies effectively highlight the contributions of each component of the proposed framework, reinforcing the importance of the adaptive fuser and texture learning in achieving high-quality synthesis.
The paper provides detailed implementation and training configurations, including model architecture, training parameters, and evaluation protocols. This level of detail supports reproducibility, allowing other researchers to replicate the experiments. However, the lack of a publicly available code repository limits accessibility for broader validation and experimentation.
The paper acknowledges limitations, such as the assumption that verse sections contain only a single singer, which may not reflect real-world scenarios. Additionally, the model's performance in melodicity and emotional expressiveness is noted as an area for improvement. These limitations suggest that while the framework is innovative, it may require further refinement to handle more complex musical arrangements.
The Tutti framework has the potential to significantly impact the field of music generation and synthesis, particularly in applications involving choral music and multi-singer arrangements. By enhancing the realism and expressiveness of synthesized singing voices, this research could facilitate advancements in music production, virtual performances, and interactive music applications. The implications extend to creative industries, education, and entertainment, where realistic vocal synthesis can enhance user experiences. The paper presents Tutti, a novel framework for dynamic multi-singer synthesis that significantly enhances the acoustic realism and artistic cohesion of choral generation. The innovative methodology and comprehensive experimental validation position this work as a meaningful contribution to the field of machine learning and audio synthesis.
Open-vocabulary keyword spotting (OV-KWS) enables personalized device control via arbitrary voice commands. Recently, researchers have explored using audio-text joint embeddings, allowing users to enroll phrases with text, and proposed techniques to disambiguate similar utterances. We find that existing OV-KWS solutions often overly bias the beginning phonemes of an enrollment, causing false triggers when negative enrollment-query-pairs share a prefix (``turn the volume up'' vs. ``turn the volume down''). We trace this to two factors: training data bias and position-biased cross-modal scoring. To address these limitations, we introduce the Partial Overlap Benchmark (POB) with two datasets, POB-Spark and POB-LibriPhrase (POB-LP), containing mismatched audio-text pairs with shared prefixes, and propose Equal-weighting Position Scoring (EPS), a lightweight decision layer. Using EPS alone reduces EER on POB-Spark from 64.4\% to 29.3\% and improves POB-LP accuracy from 87.6\% to 96.8\%, while maintaining performance on LibriPhrase and Google Speech Commands (GSC). With POB data added in training, our work achieves the best POB benchmark results while incurring the least amount of degradation on prior metrics among baselines. This degradation is most pronounced in GSC, which contains only one-word commands. We surface mitigating this trade-off as future work.
Primary: University of California San Diego
All Institutions: University of California San Diego, Bose Corporation
This paper makes a meaningful contribution to the field of machine learning by addressing a critical challenge in open-vocabulary keyword spotting and proposing effective solutions. The combination of innovative methodology and practical applications positions this work as a valuable reference for future research in audio processing and multimodal representation learning.
The paper introduces a novel approach to mitigating prefix bias in open-vocabulary keyword spotting (OV-KWS) through the development of the Equal-weighting Position Scoring (EPS) module and the Partial Overlap Benchmark (POB). The methodology is sound, as it identifies and addresses specific shortcomings in existing OV-KWS systems, particularly in handling phrases with shared prefixes. The creation of two datasets (POB-Spark and POB-LibriPhrase) is a significant contribution, providing a basis for evaluating the performance of OV-KWS under more realistic conditions. The EPS module's design is lightweight and interpretable, which is beneficial for deployment in edge devices.
The experiments are well-structured, comparing the proposed methods against established baselines (SLiCK and PhonMatchNet) under various training conditions. The results demonstrate a clear improvement in performance metrics, particularly in reducing the equal error rate (EER) on the POB datasets. The paper effectively highlights the trade-offs involved when incorporating the POB data during training, providing a nuanced understanding of the model's performance across different scenarios.
The paper provides sufficient implementation details, including model architectures, training procedures, and dataset descriptions, which facilitate reproducibility. The authors mention using a specific framework (PyTorch) and provide links to their datasets, which is a positive aspect for researchers looking to replicate or build upon their work.
One limitation noted in the paper is the performance degradation on single-word commands when using POB data for training. This suggests that while the proposed methods improve robustness for longer phrases, they may inadvertently compromise performance on shorter commands. Additionally, the paper does not explore the potential for more complex scoring mechanisms that could further mitigate prefix bias without introducing new biases.
The findings have significant implications for the development of more robust voice-controlled systems, particularly in consumer electronics and smart devices. By improving the accuracy of OV-KWS, this research could enhance user experience and broaden the applicability of voice command technologies in various domains, including accessibility, gaming, and home automation. This paper makes a meaningful contribution to the field of machine learning by addressing a critical challenge in open-vocabulary keyword spotting and proposing effective solutions. The combination of innovative methodology and practical applications positions this work as a valuable reference for future research in audio processing and multimodal representation learning.
Current audio formats present a fundamental trade-off between file size and functionality: lossless formats like FLAC preserve quality but lack adaptability, while lossy formats reduce size at the cost of fidelity and offer no stem-level access.We introduce the Stem-Native Codec (SNC), a novel audio container format that stores music as independently encoded stems plus a low-energy mastering residual. By exploiting the lower information entropy of separated stems compared to mixed audio, SNC achieves a 38.2% file size reduction versus FLAC (7.76 MB vs. 12.55 MB for a 2:18 test track) while maintaining perceptual transparency (STOI = 0.996). Unlike existing formats, SNC enables context-aware adaptive playback, spatial audio rendering, and user-controlled remixing without requiring additional storage. Our experimental validation demonstrates that the stems-plus residual architecture successfully decouples the conflicting requirements of compression efficiency and feature richness, offering a practical path toward next-generation audio distribution systems.
Primary: Wubble AI
All Institutions: Wubble AI
The main contribution of this paper is the introduction of the Stem-Native Codec (SNC), which innovatively combines efficient lossless audio storage with adaptive playback capabilities. This work presents a significant advancement in audio compression technology, addressing key limitations of existing formats and paving the way for future developments in audio distribution systems.
The methodology is well-structured, introducing the Stem-Native Codec (SNC) as a novel approach to audio storage that separates audio into independently encoded stems and a mastering residual. The theoretical framework is grounded in information theory, establishing a strong basis for the claim that separated stems have lower information entropy than mixed audio. The choice of using Opus for encoding stems is justified, and the detailed description of the encoding and decoding processes demonstrates a comprehensive understanding of audio compression techniques. However, the paper could benefit from clearer references to the sections mentioned in the contributions, as they are currently marked as [REF].
The experimental validation is robust, showcasing a significant file size reduction of 38.2% compared to FLAC while maintaining high perceptual quality (STOI = 0.996). The use of objective metrics such as spectral convergence and SNR adds credibility to the results. The paper effectively compares SNC with existing formats and highlights its advantages in terms of adaptive playback and spatial audio rendering. However, the experiments rely on a single test track, which may limit the generalizability of the findings.
The paper provides open-source encoder and decoder implementations, which is a strong point for reproducibility. The detailed encoding parameters and procedures are well-documented, allowing for potential replication of the results. However, the lack of a demo or project URL limits accessibility for interested researchers.
The primary limitation identified is the dependency on high-quality stems for effective encoding. The paper acknowledges that AI separation methods may introduce artifacts, which could affect the performance of SNC. Additionally, the decoding complexity is slightly higher than traditional formats, which may pose challenges for some applications. The need for standardized metadata schemas for adaptive playback features is also a potential barrier to widespread adoption.
The SNC has the potential to significantly influence music distribution by enabling smaller file sizes and enhanced playback experiences tailored to diverse environments. It opens up new avenues for artists to engage with their audience through remixing capabilities and adaptive features. The proposed format could also lead to reduced storage and bandwidth costs for streaming platforms, making advanced audio formats more accessible. The main contribution of this paper is the introduction of the Stem-Native Codec (SNC), which innovatively combines efficient lossless audio storage with adaptive playback capabilities. This work presents a significant advancement in audio compression technology, addressing key limitations of existing formats and paving the way for future developments in audio distribution systems.
While recent years have witnessed rapid progress in speech synthesis, open-source singing voice synthesis (SVS) systems still face significant barriers to industrial deployment, particularly in terms of robustness and zero-shot generalization. In this report, we introduce SoulX-Singer, a high-quality open-source SVS system designed with practical deployment considerations in mind. SoulX-Singer supports controllable singing generation conditioned on either symbolic musical scores (MIDI) or melodic representations, enabling flexible and expressive control in real-world production workflows. Trained on more than 42,000 hours of vocal data, the system supports Mandarin Chinese, English, and Cantonese and consistently achieves state-of-the-art synthesis quality across languages under diverse musical conditions. Furthermore, to enable reliable evaluation of zero-shot SVS performance in practical scenarios, we construct SoulX-Singer-Eval, a dedicated benchmark with strict training-test disentanglement, facilitating systematic assessment in zero-shot settings.
Primary: Soul-AI Lab
All Institutions: Soul-AI Lab
SoulX-Singer represents a significant advancement in zero-shot singing voice synthesis, combining a large-scale dataset with innovative modeling techniques to achieve high-quality, flexible vocal generation across multiple languages. The comprehensive evaluation and robust methodology position this work as a valuable contribution to the field of machine learning and audio synthesis.
The methodology of SoulX-Singer is robust, leveraging a large-scale dataset of over 42,000 hours of vocal recordings to enhance zero-shot generalization capabilities. The dual-control mechanism (melody-control and score-control modes) is innovative, allowing for flexible synthesis based on different input types. The data processing pipeline is well-structured, ensuring high-quality vocal extraction and annotation, which is crucial for training effective models. The use of flow matching and a dedicated Singing Content Encoder to manage multimodal inputs is a significant advancement in the field.
The experimental evaluation is thorough, utilizing two distinct benchmarks (GMO-SVS and SoulX-Singer-Eval) to assess performance across multiple dimensions, including melodic accuracy, intelligibility, and overall singing quality. The results consistently demonstrate that SoulX-Singer outperforms existing state-of-the-art models, showcasing its effectiveness in both controlled and zero-shot scenarios. The comprehensive metrics used for evaluation provide a clear picture of the model's capabilities.
The paper provides sufficient detail regarding the architecture, training process, and evaluation metrics, which supports reproducibility. The availability of the dataset and code on GitHub further enhances the potential for other researchers to replicate the study. However, the reliance on specific pretrained models for vocal extraction and transcription may pose some challenges in reproducing the exact results without access to those models.
One limitation of the study is the potential for voice impersonation and ethical concerns associated with the use of synthesized voices, which the authors acknowledge. Additionally, while the model shows strong performance across multiple languages, the dataset's composition may still limit its generalization to other languages or dialects not represented in the training data.
SoulX-Singer has significant implications for the music production industry, enabling creators to synthesize high-quality singing voices without the need for extensive vocal recordings. This technology could democratize music creation, allowing individuals without access to professional singers to produce high-quality vocal tracks. However, the ethical considerations surrounding voice synthesis and potential misuse must be addressed to ensure responsible deployment. SoulX-Singer represents a significant advancement in zero-shot singing voice synthesis, combining a large-scale dataset with innovative modeling techniques to achieve high-quality, flexible vocal generation across multiple languages. The comprehensive evaluation and robust methodology position this work as a valuable contribution to the field of machine learning and audio synthesis.