The maturation of Large Audio Language Models (LALMs) has raised growing expectations for them to comprehend complex audio much like humans. Current efforts primarily replicate text-based reasoning by contextualizing audio content through a one-time encoding, which introduces a critical information bottleneck. Drawing inspiration from human cognition, we propose audio-interleaved reasoning to break through this bottleneck. It treats audio as an active reasoning component, enabling sustained audio engagement and perception-grounded analysis. To instantiate it, we introduce a two-stage training framework, first teaching LALMs to localize salient audio segments through supervised fine-tuning, and then incentivizing proficient re-listening via reinforcement learning. In parallel, a structured data generation pipeline is developed to produce high-quality training data. Consequently, we present Echo, a LALM capable of dynamically re-listening to audio in demand during reasoning. On audio comprehension benchmarks, Echo achieves overall superiority in both challenging expert-level and general-purpose tasks. Comprehensive analysis further confirms the efficiency and generalizability of audio-interleaved reasoning, establishing it as a promising direction for advancing audio comprehension. Project page: https://github.com/wdqqdw/Echo.
Primary: Tsinghua University
All Institutions: Tsinghua University, ByteDance China, Department of Psychological and Cognitive Sciences, School of Information Science and Technology, ShanghaiTech University
The main contribution of this paper is the introduction of audio-interleaved reasoning, which significantly enhances the audio comprehension capabilities of LALMs by allowing them to engage with audio data dynamically during reasoning tasks. This innovative approach, combined with a robust training framework and comprehensive evaluation, positions the work as a significant advancement in the field of audio machine learning.
The paper introduces a novel approach called audio-interleaved reasoning, which allows Large Audio Language Models (LALMs) to actively engage with audio data during reasoning tasks. This is achieved through a two-stage training framework that combines supervised fine-tuning and reinforcement learning, enabling the model to dynamically re-listen to salient audio segments. The methodology is well-structured, leveraging human cognitive processes as inspiration, and includes a comprehensive data generation pipeline that produces high-quality training data. The approach is innovative in its treatment of audio as an active component rather than a static context, which is a significant departure from existing methods.
The experiments are rigorously designed, utilizing multiple audio comprehension benchmarks to validate the effectiveness of the proposed methodology. The results demonstrate that Echo outperforms existing LALMs, including advanced proprietary models, in both expert-level and general-purpose tasks. The paper provides detailed comparisons and analyses, showcasing the advantages of the audio-interleaved reasoning format over traditional methods. The evaluation metrics are appropriate, and the results are statistically significant, reinforcing the claims made by the authors.
The paper includes a detailed description of the training framework, data generation pipeline, and evaluation settings, which supports reproducibility. The authors express a commitment to releasing the complete code and dataset in the future, which is crucial for enabling further research and validation of their findings.
While the proposed method shows promise, the authors acknowledge that the implementation remains relatively straightforward and that there is room for refinement. The current approach may not fully exploit the potential of audio re-listening, and the automated generation of CoT annotations lacks human heuristics, which could lead to biases in the training data. Additionally, the reliance on existing datasets may limit the generalizability of the findings.
The advancements in audio comprehension capabilities have significant implications for various applications, including human-computer interaction, accessibility technologies, and educational tools. By improving how machines understand and reason about audio, this research could lead to more intuitive and effective systems that better mimic human cognitive processes. The potential for future research in this area is substantial, particularly in enhancing the interaction between audio and other modalities. The main contribution of this paper is the introduction of audio-interleaved reasoning, which significantly enhances the audio comprehension capabilities of LALMs by allowing them to engage with audio data dynamically during reasoning tasks. This innovative approach, combined with a robust training framework and comprehensive evaluation, positions the work as a significant advancement in the field of audio machine learning.
Due to recent advancements in Large Audio-Language Models (LALMs) that demonstrate remarkable performance across a range of sound-, speech- and music-related tasks, there is a growing interest in proposing benchmarks to assess these models. Existing benchmarks generally focus only on reasoning with internal knowledge, neglecting real-world scenarios that require external information grounding. To bridge this gap, we introduce AudioRAG, a novel benchmark designed to evaluate audio-based reasoning augmented by information retrieval in realistic web environments. This benchmark comprises both LLM-generated and manually curated question-answer pairs. Our evaluations reveal that even the state-of-the-art LALMs struggle to answer these questions. We therefore propose an agentic pipeline that integrates audio reasoning with retrieval-augmented generation, providing a stronger baseline for future research.
Primary: National University of Singapore
All Institutions: National University of Singapore, The Chinese University of Hong Kong, Tianjin University
The main contribution of this paper is the introduction of AudioRAG, a benchmark for evaluating audio reasoning in conjunction with information retrieval, alongside the development of an agentic pipeline that improves performance on this benchmark. This work significantly advances the understanding of audio-Language Models' limitations and proposes a novel approach to enhance their reasoning capabilities through external knowledge integration.
The methodology is well-structured, introducing AudioRAG as a benchmark that combines audio reasoning with information retrieval. The authors employ both LLM-generated and manually curated questions, which is a thoughtful approach to ensure diversity and relevance in the dataset. The use of an agentic pipeline that integrates audio processing and retrieval-augmented generation is innovative and addresses the limitations of existing LALMs. However, the paper could benefit from more detailed descriptions of the audio processing tool and its integration with the reasoning LLM, as well as clearer explanations of the filtering process for question validity and answer correctness.
The experimental evaluation is thorough, assessing multiple state-of-the-art LALMs against the AudioRAG benchmark. The results clearly demonstrate the challenges faced by current models, highlighting the need for improved reasoning capabilities. The comparison between raw models and the agentic pipeline provides compelling evidence of the pipeline's effectiveness. However, the paper lacks detailed statistical analyses and visualizations that could further substantiate the findings.
The paper provides a GitHub repository link for the dataset, which is a positive step towards reproducibility. However, it lacks detailed implementation instructions for the agentic pipeline and the specific configurations used in experiments. This could hinder other researchers from replicating the results accurately.
One limitation is the reliance on LLMs for generating questions and answers, which may introduce biases or inaccuracies inherent in the models. Additionally, the benchmark's scope may not cover all real-world scenarios, potentially limiting its applicability. The increase in invalid answers from the agentic pipeline suggests that the complexity of multi-hop reasoning may lead to logical errors.
The proposed benchmark and agentic pipeline have significant implications for enhancing audio-based reasoning systems. By addressing the challenges of integrating external knowledge with audio processing, this work could lead to more robust applications in various fields, including education, entertainment, and information retrieval systems. The main contribution of this paper is the introduction of AudioRAG, a benchmark for evaluating audio reasoning in conjunction with information retrieval, alongside the development of an agentic pipeline that improves performance on this benchmark. This work significantly advances the understanding of audio-Language Models' limitations and proposes a novel approach to enhance their reasoning capabilities through external knowledge integration.
Music stem generation, the task of producing musically-synchronized and isolated instrument audio clips, offers the potential of greater user control and better alignment with musician workflows compared to conventional text-to-music models. Existing stem generation approaches, however, either rely on fixed architectures that output a predefined set of stems in parallel, or generate only one stem at a time, resulting in slow inference despite flexibility in stem combination. We propose Stemphonic, a diffusion-/flow-based framework that overcomes this trade-off and generates a variable set of synchronized stems in one inference pass. During training, we treat each stem as a batch element, group synchronized stems in a batch, and apply a shared noise latent to each group. At inference-time, we use a shared initial noise latent and stem-specific text inputs to generate synchronized multi-stem outputs in one pass. We further expand our approach to enable one-pass conditional multi-stem generation and stem-wise activity controls to empower users to iteratively generate and orchestrate the temporal layering of a mix. We benchmark our results on multiple open-source stem evaluation sets and show that Stemphonic produces higher-quality outputs while accelerating the full mix generation process by 25 to 50%. Demos at: https://stemphonic-demo.vercel.app.
Primary: Adobe Research
All Institutions: Adobe Research
The paper introduces Stemphonic, a novel framework for efficient multi-stem music generation, significantly advancing the field of audio generation through innovative methodologies and promising experimental results.
The methodology presents a novel framework that integrates diffusion and flow-based models for music stem generation, addressing the limitations of existing approaches by allowing for variable and synchronized stem outputs in a single inference pass. The introduction of techniques such as stem grouping and noise sharing during training is particularly innovative, as it enhances inter-stem cohesion and synchronization, which are critical in music generation tasks. The approach is well-structured and builds upon established generative models, showcasing a clear progression from theory to practical application.
The experiments are comprehensive, utilizing multiple datasets and evaluation metrics to assess the quality of generated stems and mixes. The results demonstrate significant improvements in generation quality and efficiency, with quantitative metrics such as Fréchet Audio Distance (FAD) providing a robust framework for evaluation. The ablation studies effectively highlight the contributions of the proposed techniques, reinforcing the validity of the claims made by the authors.
The paper provides detailed implementation specifics, including architecture choices, training procedures, and dataset descriptions, which facilitate reproducibility. However, the absence of a publicly available code repository limits the ease with which other researchers can replicate the results.
One limitation is the reliance on specific datasets for training and evaluation, which may not fully capture the diversity of music styles and genres. Additionally, while the model shows promise in generating synchronized stems, the quality of generated audio may still vary depending on the complexity of the input prompts and conditions.
The proposed framework has significant implications for music production, enabling greater creative control for musicians and content creators. By facilitating the generation of isolated instrument tracks, it can streamline workflows in music composition and production, potentially democratizing music creation for non-experts. The ability to generate stems on-demand could also enhance collaborative efforts in music-making. The paper introduces Stemphonic, a novel framework for efficient multi-stem music generation, significantly advancing the field of audio generation through innovative methodologies and promising experimental results.
Deep Neural Networks (DNNs) often struggle to suppress noise at low signal-to-noise ratios (SNRs). This paper addresses speech enhancement in scenarios dominated by harmonic noise and proposes a framework that integrates cyclostationarity-aware preprocessing with lightweight DNN-based denoising. A cyclic minimum power distortionless response (cMPDR) spectral beamformer is used as a preprocessing block. It exploits the spectral correlations of cyclostationary noise to suppress harmonic components prior to learning-based enhancement and does not require modifications to the DNN architecture. The proposed pipeline is evaluated in a single-channel setting using two DNN architectures: a simple and lightweight convolutional recurrent neural network (CRNN), and a state-of-the-art model, namely ultra-low complexity network (ULCNet). Experiments on synthetic data and real-world recordings dominated by rotating machinery noise demonstrate consistent improvements over end-to-end DNN baselines, particularly at low SNRs. Remarkably, a parameter-efficient CRNN with cMPDR preprocessing surpasses the performance of the larger ULCNet operating on raw or Wiener-filtered inputs. These results indicate that explicitly incorporating cyclostationarity as a signal prior is more effective than increasing model capacity alone for suppressing harmonic interference.
Primary: Delft University of Technology
All Institutions: Delft University of Technology, Bang & Olufsen
This paper presents a novel hybrid framework for speech enhancement that effectively combines cyclostationarity-aware preprocessing with DNN-based denoising, showcasing significant performance improvements in low-SNR scenarios. The methodology is well-supported by rigorous experimentation, and the findings could have substantial implications for real-world applications in noisy environments.
The proposed methodology effectively integrates cyclostationarity-aware preprocessing with DNN-based denoising, utilizing a cyclic minimum power distortionless response (cMPDR) beamformer to enhance speech in low-SNR environments. This two-step approach is innovative as it leverages the unique properties of cyclostationary noise without necessitating modifications to the DNN architecture, thus maintaining a lightweight model. The choice of using both a simple convolutional recurrent neural network (CRNN) and a more complex ultra-low complexity network (ULCNet) for evaluation provides a robust comparison of the method's effectiveness across different model complexities.
The experimental evaluation is thorough, employing both synthetic and real-world datasets to demonstrate the method's effectiveness. The results consistently show significant improvements in performance metrics such as SI-SDR and DNSMOS, particularly in low-SNR conditions. The paper clearly delineates the performance gains achieved through the proposed preprocessing step, establishing a strong case for the benefits of incorporating cyclostationarity in speech enhancement tasks.
The paper provides sufficient implementation details, including architecture specifications, training protocols, and hyperparameters, which facilitate reproducibility. The availability of the code on GitHub further enhances the potential for other researchers to replicate the study and build upon the findings.
One limitation noted is the reliance on stable noise frequencies for the cMPDR to be effective, which may not hold in all real-world scenarios. Additionally, the method's performance on non-cyclostationary noise types could be less effective, as indicated by the results on the DNS dataset.
The proposed approach has significant implications for applications in industrial environments where effective speech communication is crucial amidst high levels of noise. By improving speech enhancement technologies, this work could enhance the usability of hearing aids and communication devices in challenging acoustic conditions, potentially benefiting a wide range of users. This paper presents a novel hybrid framework for speech enhancement that effectively combines cyclostationarity-aware preprocessing with DNN-based denoising, showcasing significant performance improvements in low-SNR scenarios. The methodology is well-supported by rigorous experimentation, and the findings could have substantial implications for real-world applications in noisy environments.
We present a decoder-only Conformer for automatic speech recognition (ASR) that processes speech and text in a single stack without external speech encoders or pretrained large language models (LLM). The model uses a modality-aware sparse mixture of experts (MoE): disjoint expert pools for speech and text with hard routing and top-1 selection, embedded in hybrid-causality Conformer blocks (bidirectional for speech, causal for text). Training combines CTC on speech positions with label-smoothed cross-entropy for text generation. Our 113M-parameter model consistently improves WER over a 139M AED baseline on Librispeech (2.8% vs. 3.2% test-clean; 5.6% vs. 6.0% test-other). On Common Voice 16.1 with a single multilingual model across five languages, our approach reduces average WER from 12.2% to 10.6%. To our knowledge, this is the first randomly initialized decoder-only ASR that surpasses strong AED baselines via modality-aware routing and sparse MoE, achieving better accuracy with fewer active parameters and without alignment/adaptation modules.
Primary: unknown
All Institutions: unknown
This paper presents a decoder-only Conformer architecture that effectively integrates modality-aware sparse mixtures of experts for automatic speech recognition. The innovative approach and solid experimental results position it as a valuable contribution to the field, although further work is needed to enhance reproducibility and address practical deployment challenges.
The paper introduces a novel decoder-only Conformer architecture that integrates modality-aware sparse mixtures of experts (MoE) for automatic speech recognition (ASR). The methodology is well-structured, leveraging a single stack to process both speech and text without the need for external encoders or pretrained models. The use of disjoint expert pools for speech and text, along with hard routing and top-1 selection, is innovative and addresses the challenge of heterogeneous modality integration effectively. The hybrid causality approach is also a significant contribution, allowing for bidirectional processing of speech while maintaining causal generation for text. However, the paper could benefit from a more detailed explanation of the routing mechanism and its implications on model performance.
The experiments are robust, demonstrating consistent improvements in word error rates (WER) over strong baselines across multiple datasets, including Librispeech and Common Voice 16.1. The results validate the proposed model's effectiveness, showing that it can outperform traditional encoder-decoder architectures while maintaining a lower parameter count. The comparative analysis against various baselines is thorough, but additional ablation studies could further clarify the contributions of individual components, such as the modality-aware routing and load-balancing loss.
The paper provides sufficient implementation details, including model configurations, training epochs, and data augmentation techniques, which facilitate reproducibility. However, the absence of a publicly available code repository or demo limits the ability for other researchers to replicate the results independently. Including a link to the code would significantly enhance the paper's reproducibility.
While the proposed model shows promising results, it relies on a relatively complex architecture that may pose challenges in practical deployment scenarios, especially in real-time applications. Additionally, the paper does not address the potential computational overhead introduced by the MoE mechanism, which may affect inference speed. Future work should also consider the scalability of the model to larger datasets and more diverse languages.
The research has significant implications for the field of automatic speech recognition, particularly in unifying speech and text processing within a single framework. This could lead to more efficient and effective ASR systems, especially in multilingual contexts. The approach may also inspire further research into modality-aware architectures in other domains, such as natural language processing and computer vision. This paper presents a decoder-only Conformer architecture that effectively integrates modality-aware sparse mixtures of experts for automatic speech recognition. The innovative approach and solid experimental results position it as a valuable contribution to the field, although further work is needed to enhance reproducibility and address practical deployment challenges.
The maturation of Large Audio Language Models (LALMs) has raised growing expectations for them to comprehend complex audio much like humans. Current efforts primarily replicate text-based reasoning by contextualizing audio content through a one-time encoding, which introduces a critical information bottleneck. Drawing inspiration from human cognition, we propose audio-interleaved reasoning to break through this bottleneck. It treats audio as an active reasoning component, enabling sustained audio engagement and perception-grounded analysis. To instantiate it, we introduce a two-stage training framework, first teaching LALMs to localize salient audio segments through supervised fine-tuning, and then incentivizing proficient re-listening via reinforcement learning. In parallel, a structured data generation pipeline is developed to produce high-quality training data. Consequently, we present Echo, a LALM capable of dynamically re-listening to audio in demand during reasoning. On audio comprehension benchmarks, Echo achieves overall superiority in both challenging expert-level and general-purpose tasks. Comprehensive analysis further confirms the efficiency and generalizability of audio-interleaved reasoning, establishing it as a promising direction for advancing audio comprehension. Project page: https://github.com/wdqqdw/Echo.
Primary: Tsinghua University
All Institutions: Tsinghua University, ByteDance China, Department of Psychological and Cognitive Sciences, School of Information Science and Technology, ShanghaiTech University
The main contribution of this paper is the introduction of audio-interleaved reasoning, which significantly enhances the audio comprehension capabilities of LALMs by allowing them to engage with audio data dynamically during reasoning tasks. This innovative approach, combined with a robust training framework and comprehensive evaluation, positions the work as a significant advancement in the field of audio machine learning.
The paper introduces a novel approach called audio-interleaved reasoning, which allows Large Audio Language Models (LALMs) to actively engage with audio data during reasoning tasks. This is achieved through a two-stage training framework that combines supervised fine-tuning and reinforcement learning, enabling the model to dynamically re-listen to salient audio segments. The methodology is well-structured, leveraging human cognitive processes as inspiration, and includes a comprehensive data generation pipeline that produces high-quality training data. The approach is innovative in its treatment of audio as an active component rather than a static context, which is a significant departure from existing methods.
The experiments are rigorously designed, utilizing multiple audio comprehension benchmarks to validate the effectiveness of the proposed methodology. The results demonstrate that Echo outperforms existing LALMs, including advanced proprietary models, in both expert-level and general-purpose tasks. The paper provides detailed comparisons and analyses, showcasing the advantages of the audio-interleaved reasoning format over traditional methods. The evaluation metrics are appropriate, and the results are statistically significant, reinforcing the claims made by the authors.
The paper includes a detailed description of the training framework, data generation pipeline, and evaluation settings, which supports reproducibility. The authors express a commitment to releasing the complete code and dataset in the future, which is crucial for enabling further research and validation of their findings.
While the proposed method shows promise, the authors acknowledge that the implementation remains relatively straightforward and that there is room for refinement. The current approach may not fully exploit the potential of audio re-listening, and the automated generation of CoT annotations lacks human heuristics, which could lead to biases in the training data. Additionally, the reliance on existing datasets may limit the generalizability of the findings.
The advancements in audio comprehension capabilities have significant implications for various applications, including human-computer interaction, accessibility technologies, and educational tools. By improving how machines understand and reason about audio, this research could lead to more intuitive and effective systems that better mimic human cognitive processes. The potential for future research in this area is substantial, particularly in enhancing the interaction between audio and other modalities. The main contribution of this paper is the introduction of audio-interleaved reasoning, which significantly enhances the audio comprehension capabilities of LALMs by allowing them to engage with audio data dynamically during reasoning tasks. This innovative approach, combined with a robust training framework and comprehensive evaluation, positions the work as a significant advancement in the field of audio machine learning.
Accurate upsampling of Head-Related Transfer Functions (HRTFs) from sparse measurements is crucial for personalized spatial audio rendering. Traditional interpolation methods, such as kernel-based weighting or basis function expansions, rely on measurements from a single subject and are limited by the spatial sampling theorem, resulting in significant performance degradation under sparse sampling. Recent learning-based methods alleviate this limitation by leveraging cross-subject information, yet most existing neural architectures primarily focus on modeling spatial relationships across directions, while spectral dependencies along the frequency dimension are often modeled implicitly or treated independently. However, HRTF magnitude responses exhibit strong local continuity and long-range structure in the frequency domain, which are not fully exploited. This work investigates frequency-domain feature modeling by examining how different architectural choices, ranging from per-frequency multilayer perceptrons to convolutional, dilated convolutional, and attention-based models, affect performance under varying sparsity levels, showing that explicit spectral modeling consistently improves reconstruction accuracy, particularly under severe sparsity. Motivated by this observation, a frequency-domain Conformer-based architecture is adopted to jointly capture local spectral continuity and long-range frequency correlations. Experimental results on the SONICOM and HUTUBS datasets demonstrate that the proposed method achieves state-of-the-art performance in terms of interaural level difference and log-spectral distortion.
Primary: University of Technology Sydney
All Institutions: University of Technology Sydney, Monash University
This paper makes a substantial contribution to the field of audio processing by introducing a frequency-domain modeling approach for HRTF magnitude upsampling, demonstrating its effectiveness through rigorous experimentation and analysis. The findings highlight the importance of architectural choices in modeling spectral features, paving the way for future innovations in personalized audio rendering.
The paper proposes a novel approach to HRTF magnitude upsampling through frequency-domain feature modeling. It critically examines various architectural choices, including per-frequency MLPs, convolutional models, and a Conformer-based architecture, to effectively capture both local spectral continuity and long-range frequency correlations. The methodology is well-structured, with a clear separation between spatial mapping and frequency-domain modeling, which allows for a comprehensive exploration of the design space. The integration of spectral gradient loss alongside log-spectral distortion as a training objective is a thoughtful addition that enhances the model's ability to preserve spectral features.
The experiments are robust, utilizing two well-established datasets (SONICOM and HUTUBS) to evaluate the proposed method's performance under varying sparsity levels. The results demonstrate that the FD-Conformer consistently outperforms existing methods in terms of interaural level difference (ILD) and log-spectral distortion (LSD), particularly in sparse measurement scenarios. The ablation studies provide valuable insights into the contributions of different components of the architecture, reinforcing the importance of frequency-domain modeling.
The paper includes sufficient details regarding the experimental setup, including the datasets used, preprocessing steps, model architecture, and training protocols. The availability of the source code on GitHub enhances reproducibility, allowing other researchers to validate and build upon the findings.
While the proposed method shows significant improvements, it may still be sensitive to the choice of hyperparameters and the specific configurations of the datasets used. Additionally, the performance in extremely sparse scenarios, while improved, may still not meet practical requirements for all applications, indicating a potential area for further research.
The advancements in HRTF upsampling have significant implications for personalized spatial audio rendering, which is increasingly relevant in virtual reality, gaming, and immersive audio applications. By improving the accuracy of HRTF estimations from sparse measurements, this research could enhance user experiences in various audio applications, making spatial audio more accessible and effective. This paper makes a substantial contribution to the field of audio processing by introducing a frequency-domain modeling approach for HRTF magnitude upsampling, demonstrating its effectiveness through rigorous experimentation and analysis. The findings highlight the importance of architectural choices in modeling spectral features, paving the way for future innovations in personalized audio rendering.
Although lip-to-speech synthesis (L2S) has achieved significant progress in recent years, current state-of-the-art methods typically rely on intermediate representations such as mel-spectrograms or discrete self-supervised learning (SSL) tokens. The potential of latent diffusion models (LDMs) in this task remains largely unexplored. In this paper, we introduce SLD-L2S, a novel L2S framework built upon a hierarchical subspace latent diffusion model. Our method aims to directly map visual lip movements to the continuous latent space of a pre-trained neural audio codec, thereby avoiding the information loss inherent in traditional intermediate representations. The core of our method is a hierarchical architecture that processes visual representations through multiple parallel subspaces, initiated by a subspace decomposition module. To efficiently enhance interactions within and between these subspaces, we design the diffusion convolution block (DiCB) as our network backbone. Furthermore, we employ a reparameterized flow matching technique to directly generate the target latent vectors. This enables a principled inclusion of speech language model (SLM) and semantic losses during training, moving beyond conventional flow matching objectives and improving synthesized speech quality. Our experiments show that SLD-L2S achieves state-of-the-art generation quality on multiple benchmark datasets, surpassing existing methods in both objective and subjective evaluations.
Primary: unknown
All Institutions: unknown
The paper presents SLD-L2S, a novel framework for high-fidelity lip-to-speech synthesis that leverages a hierarchical subspace latent diffusion model, achieving state-of-the-art results in synthesis quality. The methodology is innovative and addresses critical challenges in the field, while the experimental evaluation supports its effectiveness, though the lack of a publicly available implementation may hinder reproducibility.
The paper introduces a novel framework, SLD-L2S, which employs a hierarchical subspace latent diffusion model to directly map visual lip movements to the latent space of a pre-trained audio codec. The methodology is innovative in its use of diffusion convolution blocks (DiCB) and a reparameterized flow matching technique, which enhances the model's ability to generate high-fidelity speech without relying on traditional intermediate representations like mel-spectrograms. The hierarchical architecture and subspace decomposition approach are well-justified, addressing the inherent challenges of lip-to-speech synthesis effectively.
The experiments are robust, utilizing multiple benchmark datasets (LRS3-TED and LRS2-BBC) to validate the performance of the proposed method. The results demonstrate that SLD-L2S achieves state-of-the-art performance in both objective and subjective evaluations, significantly outperforming existing methods. The use of comprehensive metrics, including UTMOS, SCOREQ, WER, and subjective MOS tests, provides a well-rounded assessment of the model's capabilities.
The paper provides detailed implementation details, including architecture configurations, training procedures, and hyperparameter settings, which are essential for reproducibility. However, the absence of a publicly available code repository or demo URL limits the practical reproducibility of the results.
One notable limitation is the lack of a clear discussion on the potential computational costs associated with the proposed method, particularly in real-world applications. Additionally, the paper does not address the scalability of the model to different languages or accents, which could impact its generalizability.
The proposed SLD-L2S framework has significant implications for various applications, including automated video dubbing, assistive technologies for individuals with speech impairments, and enhancing communication in noisy environments. By improving the quality and intelligibility of synthesized speech from visual inputs, this work could facilitate more natural interactions in human-computer interfaces. The paper presents SLD-L2S, a novel framework for high-fidelity lip-to-speech synthesis that leverages a hierarchical subspace latent diffusion model, achieving state-of-the-art results in synthesis quality. The methodology is innovative and addresses critical challenges in the field, while the experimental evaluation supports its effectiveness, though the lack of a publicly available implementation may hinder reproducibility.
Due to recent advancements in Large Audio-Language Models (LALMs) that demonstrate remarkable performance across a range of sound-, speech- and music-related tasks, there is a growing interest in proposing benchmarks to assess these models. Existing benchmarks generally focus only on reasoning with internal knowledge, neglecting real-world scenarios that require external information grounding. To bridge this gap, we introduce AudioRAG, a novel benchmark designed to evaluate audio-based reasoning augmented by information retrieval in realistic web environments. This benchmark comprises both LLM-generated and manually curated question-answer pairs. Our evaluations reveal that even the state-of-the-art LALMs struggle to answer these questions. We therefore propose an agentic pipeline that integrates audio reasoning with retrieval-augmented generation, providing a stronger baseline for future research.
Primary: National University of Singapore
All Institutions: National University of Singapore, The Chinese University of Hong Kong, Tianjin University
The main contribution of this paper is the introduction of AudioRAG, a benchmark for evaluating audio reasoning in conjunction with information retrieval, alongside the development of an agentic pipeline that improves performance on this benchmark. This work significantly advances the understanding of audio-Language Models' limitations and proposes a novel approach to enhance their reasoning capabilities through external knowledge integration.
The methodology is well-structured, introducing AudioRAG as a benchmark that combines audio reasoning with information retrieval. The authors employ both LLM-generated and manually curated questions, which is a thoughtful approach to ensure diversity and relevance in the dataset. The use of an agentic pipeline that integrates audio processing and retrieval-augmented generation is innovative and addresses the limitations of existing LALMs. However, the paper could benefit from more detailed descriptions of the audio processing tool and its integration with the reasoning LLM, as well as clearer explanations of the filtering process for question validity and answer correctness.
The experimental evaluation is thorough, assessing multiple state-of-the-art LALMs against the AudioRAG benchmark. The results clearly demonstrate the challenges faced by current models, highlighting the need for improved reasoning capabilities. The comparison between raw models and the agentic pipeline provides compelling evidence of the pipeline's effectiveness. However, the paper lacks detailed statistical analyses and visualizations that could further substantiate the findings.
The paper provides a GitHub repository link for the dataset, which is a positive step towards reproducibility. However, it lacks detailed implementation instructions for the agentic pipeline and the specific configurations used in experiments. This could hinder other researchers from replicating the results accurately.
One limitation is the reliance on LLMs for generating questions and answers, which may introduce biases or inaccuracies inherent in the models. Additionally, the benchmark's scope may not cover all real-world scenarios, potentially limiting its applicability. The increase in invalid answers from the agentic pipeline suggests that the complexity of multi-hop reasoning may lead to logical errors.
The proposed benchmark and agentic pipeline have significant implications for enhancing audio-based reasoning systems. By addressing the challenges of integrating external knowledge with audio processing, this work could lead to more robust applications in various fields, including education, entertainment, and information retrieval systems. The main contribution of this paper is the introduction of AudioRAG, a benchmark for evaluating audio reasoning in conjunction with information retrieval, alongside the development of an agentic pipeline that improves performance on this benchmark. This work significantly advances the understanding of audio-Language Models' limitations and proposes a novel approach to enhance their reasoning capabilities through external knowledge integration.
Large Audio Language Models (LALMs) have demonstrated strong capabilities in audio understanding and reasoning. However, their performance on fine grained auditory perception remains unreliable, and existing approaches largely rely on data intensive training to internalize perceptual abilities. We propose AudioRouter, a reinforcement learning framework that enables LALMs to improve audio understanding by learning when and how to use external audio tools. Rather than tightly coupling tool usage with audio reasoning, AudioRouter formulates tool use as an explicit decision making problem and optimizes a lightweight routing policy while keeping the underlying reasoning model frozen. Experimental results show that AudioRouter achieves substantial improvements on standard audio understanding benchmarks while requiring up to 600x less training data to learn tool usage compared with conventional training paradigms. These findings suggest that learning effective tool usage offers a data efficient and scalable alternative to internalizing perceptual abilities in LALMs.
Primary: University of California
All Institutions: University of California, The University of Queensland
The main contribution of this paper is the introduction of AudioRouter, a reinforcement learning framework that enhances audio understanding in large audio language models by optimizing tool usage while significantly reducing the amount of required training data. This innovative approach not only improves performance but also offers a scalable alternative to traditional data-intensive training methods, marking a significant advancement in the field of audio processing and reasoning.
The methodology presented in the paper is innovative as it decouples tool usage from the reasoning model, allowing for a more efficient learning process. The use of reinforcement learning to optimize a routing policy for tool invocation is a significant departure from traditional end-to-end training approaches. The authors effectively formulate tool usage as a discrete decision-making problem, which is a novel perspective in the context of audio language models. The decision to keep the reasoning model frozen while training the router is a strategic choice that enhances data efficiency and reduces complexity.
The experimental evaluation is robust, demonstrating the effectiveness of AudioRouter across multiple benchmarks (MMAU-mini and MMAR). The results indicate substantial improvements in performance while requiring significantly less training data compared to conventional methods. The paper provides clear comparisons against baseline models, showcasing the advantages of the proposed framework. However, the experiments could benefit from a broader range of datasets and tasks to further validate the generalizability of the approach.
The paper includes sufficient details regarding the experimental setup, including model architectures, training data, and reinforcement learning specifics. However, the lack of URLs for code or project repositories limits the reproducibility of the results. Providing access to the trained models or implementation would enhance the ability of other researchers to replicate the findings.
The paper acknowledges that the relative outcome reward relies on a fixed reasoning model, which may limit the Router's learning signal. Additionally, the focus on short-form, closed-set audio reasoning tasks with a limited set of audio tools may restrict the applicability of the findings. Future work should explore extending the framework to more complex reasoning tasks and diverse tool capabilities.
The proposed AudioRouter framework has the potential to significantly advance the field of audio understanding by providing a more data-efficient method for leveraging external tools. This approach could lead to broader applications in various domains, including audio analysis, multimedia processing, and interactive AI systems. By reducing the reliance on large annotated datasets, it may also democratize access to advanced audio processing capabilities. The main contribution of this paper is the introduction of AudioRouter, a reinforcement learning framework that enhances audio understanding in large audio language models by optimizing tool usage while significantly reducing the amount of required training data. This innovative approach not only improves performance but also offers a scalable alternative to traditional data-intensive training methods, marking a significant advancement in the field of audio processing and reasoning.
Discrete audio tokenizers are fundamental to empowering large language models with native audio processing and generation capabilities. Despite recent progress, existing approaches often rely on pretrained encoders, semantic distillation, or heterogeneous CNN-based architectures. These designs introduce fixed inductive biases that limit reconstruction fidelity and hinder effective scaling. In this paper, we argue that discrete audio tokenization should be learned fully end-to-end using a homogeneous and scalable architecture. To this end, we first propose CAT (Causal Audio Tokenizer with Transformer), a purely Transformer-based architecture that jointly optimizes the encoder, quantizer, and decoder from scratch for high-fidelity reconstruction. Building on the CAT architecture, we develop MOSS-Audio-Tokenizer, a large-scale audio tokenizer featuring 1.6 billion parameters, pre-trained on 3 million hours of diverse, general audio data. We show that this simple, fully end-to-end approach built from homogeneous, causal Transformer blocks scales gracefully and supports high-fidelity reconstruction across diverse audio domains. Across speech, sound, and music, MOSS-Audio-Tokenizer consistently outperforms prior codecs over a wide range of bitrates, while exhibiting predictable improvements with increased scale. Notably, leveraging the discrete tokens from our model, we develop the first purely autoregressive TTS model that surpasses prior non-autoregressive and cascaded systems. Furthermore, MOSS-Audio-Tokenizer enables competitive ASR performance without auxiliary encoders. Our findings position the CAT architecture as a unified, scalable interface for the next generation of native audio foundation models.
Primary: Fudan University
All Institutions: Fudan University, MOSI Intelligence, Shanghai Innovation Institute
The paper presents MOSS-Audio-Tokenizer, a novel end-to-end audio tokenizer that significantly improves audio processing capabilities for autoregressive models. Its comprehensive methodology and robust experimental validation establish it as a noteworthy contribution to the field of machine learning and audio processing.
The paper introduces the Causal Audio Tokenizer (CAT), a novel architecture that employs a fully end-to-end approach to audio tokenization using a homogeneous stack of causal Transformer blocks. This design minimizes fixed inductive biases, allowing for high-fidelity audio reconstruction across diverse domains. The architecture's simplicity and scalability are emphasized, with joint optimization of the encoder, quantizer, decoder, and discriminator, which is a significant departure from existing methods that often rely on pretrained components or complex architectures. The methodology is well-structured, with clear explanations of the training objectives and the integration of semantic modeling through audio-to-text tasks.
The authors conduct extensive experiments to evaluate the performance of MOSS-Audio-Tokenizer against existing audio tokenizers across various bitrate regimes. The results demonstrate state-of-the-art reconstruction quality in speech, sound, and music, with a clear advantage in low-bitrate scenarios. The use of both objective and subjective evaluation metrics strengthens the findings, providing a comprehensive assessment of the model's capabilities. The experiments are well-designed, showcasing the effectiveness of the proposed Progressive Sequence Dropout training strategy and the model's robustness across different conditions.
The paper provides detailed implementation information, including architecture specifications, training schedules, and optimization strategies. However, it lacks a publicly accessible code repository or demo URL, which could hinder reproducibility. The absence of shared code or datasets limits the ability for other researchers to validate the findings independently.
While the paper presents a strong technical contribution, it does not sufficiently address potential limitations, such as the computational resources required for training the large-scale model and the generalizability of the results to real-world applications. Additionally, the reliance on a large dataset for training may not be feasible for all researchers.
The development of MOSS-Audio-Tokenizer has significant implications for the field of audio processing and generation, particularly in enhancing the capabilities of autoregressive models. Its ability to provide high-fidelity audio reconstruction and support various downstream tasks like text-to-speech and automatic speech recognition positions it as a valuable tool for future audio foundation models. The research could lead to advancements in applications such as virtual assistants, content creation, and accessibility technologies. The paper presents MOSS-Audio-Tokenizer, a novel end-to-end audio tokenizer that significantly improves audio processing capabilities for autoregressive models. Its comprehensive methodology and robust experimental validation establish it as a noteworthy contribution to the field of machine learning and audio processing.
Discrete audio tokenizers are fundamental to empowering large language models with native audio processing and generation capabilities. Despite recent progress, existing approaches often rely on pretrained encoders, semantic distillation, or heterogeneous CNN-based architectures. These designs introduce fixed inductive biases that limit reconstruction fidelity and hinder effective scaling. In this paper, we argue that discrete audio tokenization should be learned fully end-to-end using a homogeneous and scalable architecture. To this end, we first propose CAT (Causal Audio Tokenizer with Transformer), a purely Transformer-based architecture that jointly optimizes the encoder, quantizer, and decoder from scratch for high-fidelity reconstruction. Building on the CAT architecture, we develop MOSS-Audio-Tokenizer, a large-scale audio tokenizer featuring 1.6 billion parameters, pre-trained on 3 million hours of diverse, general audio data. We show that this simple, fully end-to-end approach built from homogeneous, causal Transformer blocks scales gracefully and supports high-fidelity reconstruction across diverse audio domains. Across speech, sound, and music, MOSS-Audio-Tokenizer consistently outperforms prior codecs over a wide range of bitrates, while exhibiting predictable improvements with increased scale. Notably, leveraging the discrete tokens from our model, we develop the first purely autoregressive TTS model that surpasses prior non-autoregressive and cascaded systems. Furthermore, MOSS-Audio-Tokenizer enables competitive ASR performance without auxiliary encoders. Our findings position the CAT architecture as a unified, scalable interface for the next generation of native audio foundation models.
Primary: Fudan University
All Institutions: Fudan University, MOSI Intelligence, Shanghai Innovation Institute
The paper introduces MOSS-Audio-Tokenizer, a scalable and effective audio tokenizer that leverages a fully end-to-end Transformer architecture to achieve high-fidelity audio reconstruction and competitive performance in downstream tasks. This work represents a significant advancement in audio processing methodologies, emphasizing the importance of simplicity and scalability in model design.
The paper presents a novel architecture, MOSS-Audio-Tokenizer, built on the Causal Audio Tokenizer (CAT) framework, which utilizes a purely Transformer-based approach for audio tokenization. This end-to-end model optimizes the encoder, quantizer, and decoder jointly, which is a significant departure from existing methods that rely on pretrained encoders or complex architectures. The use of residual vector quantization and a multi-task learning strategy to align audio representations with text further enhances the methodology. The design principles emphasize simplicity, scalability, and causality, making it suitable for autoregressive modeling.
The experiments are comprehensive, evaluating the model across various audio domains including speech, sound, and music. The authors provide both objective and subjective metrics for reconstruction quality, demonstrating that MOSS-Audio-Tokenizer consistently outperforms existing codecs across different bitrates. The results indicate a clear advantage in reconstruction fidelity and robustness, particularly in low-bitrate scenarios, showcasing the effectiveness of the proposed architecture.
The paper includes detailed implementation specifics, including architecture configurations, training schedules, and optimization strategies. However, the lack of a publicly available code repository or demo limits the reproducibility of the results. The authors do mention training on a substantial dataset (3 million hours of audio), but without access to the code or data, independent verification of results could be challenging.
One limitation is the reliance on a large-scale dataset for training, which may not be readily available to all researchers. Additionally, while the model shows strong performance across various tasks, the scalability of the architecture in real-world applications and its performance in edge cases or less common audio types remains to be fully explored.
The MOSS-Audio-Tokenizer has the potential to significantly advance the field of audio processing by providing a unified framework for audio generation and understanding. Its applications could extend to various domains including speech synthesis, automatic speech recognition, and audio content generation, making it a valuable tool for both researchers and practitioners in the field. The paper introduces MOSS-Audio-Tokenizer, a scalable and effective audio tokenizer that leverages a fully end-to-end Transformer architecture to achieve high-fidelity audio reconstruction and competitive performance in downstream tasks. This work represents a significant advancement in audio processing methodologies, emphasizing the importance of simplicity and scalability in model design.
Standardized laboratory characterizations for absorbing materials rely on idealized sound field assumptions, which deviate largely from real-life conditions. Consequently, \emph{in-situ} acoustic characterization has become essential for accurate diagnosis and virtual prototyping. We propose a physics-informed neural field that reconstructs local, near-surface broadband sound fields from sparse pressure samples to directly infer complex surface impedance. A parallel, multi-frequency architecture enables a broadband impedance retrieval within runtimes on the order of seconds to minutes. To validate the method, we developed a compact microphone array with low hardware complexity. Numerical verifications and laboratory experiments demonstrate accurate impedance retrieval with a small number of sensors under realistic conditions. We further showcase the approach in a vehicle cabin to provide practical guidance on measurement locations that avoid strong interference. Here, we show that this approach offers a robust means of characterizing \emph{in-situ} boundary conditions for architectural and automotive acoustics.
Primary: Technical University of Denmark
All Institutions: Technical University of Denmark
The main contribution of this paper is the development of a physics-informed neural network framework for rapid in-situ characterization of surface impedance from sparse acoustic data, which significantly advances the state-of-the-art in acoustic material characterization. The methodology combines innovative neural network architecture with practical experimental validation, addressing critical challenges in the field of acoustics.
The paper introduces a novel physics-informed neural network architecture for inferring surface impedance from sparse acoustic data, which is a significant advancement over traditional methods that rely on dense sensor arrays and idealized conditions. The use of a parallel multi-frequency architecture allows for efficient processing and inference, addressing computational bottlenecks associated with broadband sound field reconstruction. The methodology is well-structured, incorporating automatic differentiation to infer particle velocity, and employs a composite loss function that integrates data fidelity, physical constraints, and regularization terms, which enhances the robustness of the model.
The experimental validation is thorough, encompassing both numerical simulations and laboratory experiments in anechoic and reverberant environments. The results demonstrate the framework's capability to accurately retrieve impedance under realistic conditions, showcasing its practical applicability in complex acoustic environments such as vehicle cabins. The sensitivity analysis and parametric sweeps provide valuable insights into the performance of the proposed microphone array configurations, further reinforcing the robustness of the method.
The paper provides detailed descriptions of the experimental setups, training protocols, and evaluation metrics, which facilitate reproducibility. However, the lack of publicly available code and data at this stage may hinder independent validation of the results. The authors mention plans to establish a public repository upon acceptance, which would enhance reproducibility.
One limitation noted is the sensitivity of the method to local sound field complexity, particularly in the presence of strong nodal lines and reflections, which can degrade inference accuracy. Additionally, the reliance on specific microphone configurations may limit the generalizability of the findings to other setups or environments. The paper also acknowledges the challenges posed by measurement noise, especially in the context of near-rigid surfaces.
The proposed framework has significant implications for in-situ acoustic characterization in various fields, including architectural acoustics and automotive design. By enabling rapid and accurate impedance retrieval, this method can improve the design and optimization of sound-absorbing materials and structures, ultimately enhancing acoustic performance in real-world applications. The integration of machine learning with physics-informed approaches represents a promising direction for future research in acoustic engineering. The main contribution of this paper is the development of a physics-informed neural network framework for rapid in-situ characterization of surface impedance from sparse acoustic data, which significantly advances the state-of-the-art in acoustic material characterization. The methodology combines innovative neural network architecture with practical experimental validation, addressing critical challenges in the field of acoustics.
Passive acoustic monitoring has become a key strategy in biodiversity assessment, conservation, and behavioral ecology, especially as Internet-of-Things (IoT) devices enable continuous in situ audio collection at scale. While recent self-supervised learning (SSL)-based audio encoders, such as BEATs and AVES, have shown strong performance in bioacoustic tasks, their computational cost and limited robustness to unseen environments hinder deployment on resource-constrained platforms. In this work, we introduce BioME, a resource-efficient audio encoder designed for bioacoustic applications. BioME is trained via layer-to-layer distillation from a high-capacity teacher model, enabling strong representational transfer while reducing the parameter count by 75%. To further improve ecological generalization, the model is pretrained on multi-domain data spanning speech, environmental sounds, and animal vocalizations. A key contribution is the integration of modulation-aware acoustic features via FiLM conditioning, injecting a DSP-inspired inductive bias that enhances feature disentanglement in low-capacity regimes. Across multiple bioacoustic tasks, BioME matches or surpasses the performance of larger models, including its teacher, while being suitable for resource-constrained IoT deployments. For reproducibility, code and pretrained checkpoints are publicly available.
Primary: Institut national de la recherche scientifique (INRS - EMT)
All Institutions: Institut national de la recherche scientifique (INRS - EMT)
The main contribution of this paper is the introduction of BioME, a resource-efficient audio encoder designed for bioacoustic applications, which achieves state-of-the-art performance while significantly reducing computational costs. This work represents a meaningful advancement in the field of audio representation learning, particularly in the context of ecological monitoring, and demonstrates the potential of integrating traditional signal processing techniques with modern deep learning approaches.
The methodology presented in this paper is robust and innovative, leveraging layer-to-layer knowledge distillation to create a compact audio encoder, BioME, that retains high performance on bioacoustic tasks. The integration of modulation-aware features via FiLM conditioning is particularly noteworthy, as it introduces a novel inductive bias that enhances feature disentanglement, which is crucial for effective audio representation in resource-constrained environments. The use of a multi-domain pretraining strategy further strengthens the model's generalization capabilities across diverse bioacoustic tasks.
The experimental evaluation is thorough, utilizing a variety of datasets and benchmarks, including the BEANS benchmark for bioacoustic tasks. The results demonstrate that BioME outperforms larger models, including its teacher model, in several scenarios, particularly in resource-constrained setups. The ablation studies provide clear insights into the contributions of different architectural components, validating the effectiveness of the proposed modifications and confirming the model's robustness across various tasks.
The authors have made efforts to ensure reproducibility by providing code and pretrained checkpoints publicly. However, the paper lacks specific URLs for accessing these resources, which could enhance reproducibility further. Detailed descriptions of the datasets and training procedures are included, which aids in replicating the experiments.
One limitation is the potential overfitting observed in larger model configurations, particularly in specific tasks like binary classification for beehive monitoring. Additionally, while the model shows promise, the paper does not extensively discuss the trade-offs between model size and performance in all contexts, which could be important for practical applications.
The implications of this work are significant for ecological monitoring and conservation efforts, as it enables efficient and effective bioacoustic monitoring using resource-constrained IoT devices. The advancements in self-supervised learning for audio representation can also influence broader applications in machine learning, particularly in fields requiring real-time audio processing and analysis. The main contribution of this paper is the introduction of BioME, a resource-efficient audio encoder designed for bioacoustic applications, which achieves state-of-the-art performance while significantly reducing computational costs. This work represents a meaningful advancement in the field of audio representation learning, particularly in the context of ecological monitoring, and demonstrates the potential of integrating traditional signal processing techniques with modern deep learning approaches.
Music stem generation, the task of producing musically-synchronized and isolated instrument audio clips, offers the potential of greater user control and better alignment with musician workflows compared to conventional text-to-music models. Existing stem generation approaches, however, either rely on fixed architectures that output a predefined set of stems in parallel, or generate only one stem at a time, resulting in slow inference despite flexibility in stem combination. We propose Stemphonic, a diffusion-/flow-based framework that overcomes this trade-off and generates a variable set of synchronized stems in one inference pass. During training, we treat each stem as a batch element, group synchronized stems in a batch, and apply a shared noise latent to each group. At inference-time, we use a shared initial noise latent and stem-specific text inputs to generate synchronized multi-stem outputs in one pass. We further expand our approach to enable one-pass conditional multi-stem generation and stem-wise activity controls to empower users to iteratively generate and orchestrate the temporal layering of a mix. We benchmark our results on multiple open-source stem evaluation sets and show that Stemphonic produces higher-quality outputs while accelerating the full mix generation process by 25 to 50%. Demos at: https://stemphonic-demo.vercel.app.
Primary: Adobe Research
All Institutions: Adobe Research
The paper introduces Stemphonic, a novel framework for efficient multi-stem music generation, significantly advancing the field of audio generation through innovative methodologies and promising experimental results.
The methodology presents a novel framework that integrates diffusion and flow-based models for music stem generation, addressing the limitations of existing approaches by allowing for variable and synchronized stem outputs in a single inference pass. The introduction of techniques such as stem grouping and noise sharing during training is particularly innovative, as it enhances inter-stem cohesion and synchronization, which are critical in music generation tasks. The approach is well-structured and builds upon established generative models, showcasing a clear progression from theory to practical application.
The experiments are comprehensive, utilizing multiple datasets and evaluation metrics to assess the quality of generated stems and mixes. The results demonstrate significant improvements in generation quality and efficiency, with quantitative metrics such as Fréchet Audio Distance (FAD) providing a robust framework for evaluation. The ablation studies effectively highlight the contributions of the proposed techniques, reinforcing the validity of the claims made by the authors.
The paper provides detailed implementation specifics, including architecture choices, training procedures, and dataset descriptions, which facilitate reproducibility. However, the absence of a publicly available code repository limits the ease with which other researchers can replicate the results.
One limitation is the reliance on specific datasets for training and evaluation, which may not fully capture the diversity of music styles and genres. Additionally, while the model shows promise in generating synchronized stems, the quality of generated audio may still vary depending on the complexity of the input prompts and conditions.
The proposed framework has significant implications for music production, enabling greater creative control for musicians and content creators. By facilitating the generation of isolated instrument tracks, it can streamline workflows in music composition and production, potentially democratizing music creation for non-experts. The ability to generate stems on-demand could also enhance collaborative efforts in music-making. The paper introduces Stemphonic, a novel framework for efficient multi-stem music generation, significantly advancing the field of audio generation through innovative methodologies and promising experimental results.
In this work, we present Covo-Audio, a 7B-parameter end-to-end LALM that directly processes continuous audio inputs and generates audio outputs within a single unified architecture. Through large-scale curated pretraining and targeted post-training, Covo-Audio achieves state-of-the-art or competitive performance among models of comparable scale across a broad spectrum of tasks, including speech-text modeling, spoken dialogue, speech understanding, audio understanding, and full-duplex voice interaction. Extensive evaluations demonstrate that the pretrained foundation model exhibits strong speech-text comprehension and semantic reasoning capabilities on multiple benchmarks, outperforming representative open-source models of comparable scale. Furthermore, Covo-Audio-Chat, the dialogue-oriented variant, demonstrates strong spoken conversational abilities, including understanding, contextual reasoning, instruction following, and generating contextually appropriate and empathetic responses, validating its applicability to real-world conversational assistant scenarios. Covo-Audio-Chat-FD, the evolved full-duplex model, achieves substantially superior performance on both spoken dialogue capabilities and full-duplex interaction behaviors, demonstrating its competence in practical robustness. To mitigate the high cost of deploying end-to-end LALMs for natural conversational systems, we propose an intelligence-speaker decoupling strategy that separates dialogue intelligence from voice rendering, enabling flexible voice customization with minimal text-to-speech (TTS) data while preserving dialogue performance. Overall, our results highlight the strong potential of 7B-scale models to integrate sophisticated audio intelligence with high-level semantic reasoning, and suggest a scalable path toward more capable and versatile LALMs.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of Covo-Audio, a novel end-to-end LALM that effectively integrates audio processing and semantic reasoning, demonstrating strong performance across various tasks. This work represents a significant advancement in the field of audio machine learning, particularly in its approach to conversational systems and dialogue intelligence.
The methodology presented in Covo-Audio is innovative as it integrates a large-scale end-to-end LALM capable of processing continuous audio inputs and generating audio outputs. The architecture is designed for various tasks, including speech-text modeling and full-duplex voice interaction, which demonstrates a comprehensive approach to audio processing. The intelligence-speaker decoupling strategy is particularly noteworthy as it allows for flexible voice customization while maintaining dialogue performance, showcasing a novel approach to reducing deployment costs.
The experiments are extensive, covering multiple benchmarks and demonstrating strong performance against representative open-source models. The paper provides quantitative results that validate the model's capabilities in speech-text comprehension and conversational abilities. However, the paper could benefit from more detailed comparisons with existing models to better contextualize its performance.
The paper lacks detailed implementation specifics that would facilitate reproducibility. While it mentions large-scale pretraining and post-training, the absence of code or a project URL limits the ability for other researchers to replicate the findings or build upon the work.
One limitation is the high parameter count of the model, which may hinder accessibility for researchers with limited computational resources. Additionally, while the decoupling strategy is innovative, its practical implications and potential trade-offs in performance are not thoroughly explored.
The potential applications of Covo-Audio are significant, particularly in developing more capable conversational assistants that can handle complex audio interactions. The model's ability to generate empathetic responses could enhance user experience in real-world applications, making it a valuable contribution to the field of audio processing and conversational AI. The main contribution of this paper is the introduction of Covo-Audio, a novel end-to-end LALM that effectively integrates audio processing and semantic reasoning, demonstrating strong performance across various tasks. This work represents a significant advancement in the field of audio machine learning, particularly in its approach to conversational systems and dialogue intelligence.
Real-time voice conversion and speaker anonymization require causal, low-latency synthesis without sacrificing intelligibility or naturalness. Current systems have a core representational mismatch: content is time-varying, while speaker identity is injected as a static global embedding. We introduce a streamable speech synthesizer that aligns the temporal granularity of identity and content via a content-synchronous, time-varying timbre (TVT) representation. A Global Timbre Memory expands a global timbre instance into multiple compact facets; frame-level content attends to this memory, a gate regulates variation, and spherical interpolation preserves identity geometry while enabling smooth local changes. In addition, a factorized vector-quantized bottleneck regularizes content to reduce residual speaker leakage. The resulting system is streamable end-to-end, with <80 ms GPU latency. Experiments show improvements in naturalness, speaker transfer, and anonymization compared to SOTA streaming baselines, establishing TVT as a scalable approach for privacy-preserving and expressive speech synthesis under strict latency budgets.
Primary: unknown
All Institutions: unknown
The paper presents TVTSyn, a novel streaming voice conversion and anonymization system that effectively synchronizes speaker identity with content through a time-varying timbre representation, demonstrating significant advancements in privacy and expressivity under strict latency constraints. The methodology is innovative, and the experimental results suggest a strong potential for real-world applications, although further work is needed to address limitations and enhance reproducibility.
The proposed methodology introduces a novel time-varying timbre (TVT) representation that synchronizes speaker identity with content, addressing the static-dynamic mismatch prevalent in existing voice conversion systems. The architecture is well-structured, comprising a Global Timbre Memory (GTM) that enhances the expressivity of speaker identity while maintaining low latency, which is crucial for real-time applications. The use of a factorized vector-quantized bottleneck to regularize content and reduce speaker leakage is a significant innovation that contributes to the overall effectiveness of the system. The integration of causal convolutional networks and self-attention mechanisms demonstrates a sophisticated approach to maintaining temporal coherence in streaming scenarios.
The experiments are comprehensive, evaluating the proposed system against state-of-the-art (SOTA) methods across multiple metrics, including naturalness, speaker transfer, and anonymization effectiveness. The use of perceptual listening tests alongside objective metrics provides a well-rounded assessment of performance. The results indicate that TVTSyn achieves a favorable balance between privacy and utility, outperforming several baselines in terms of both speaker similarity and anonymization quality. However, the paper could benefit from a more detailed exploration of the datasets used and the specific configurations of the baseline models for clearer comparisons.
The paper provides a detailed account of the architecture, training procedures, and evaluation metrics, which supports reproducibility. However, the absence of a publicly available code repository limits the ability for others to replicate the results fully. The authors mention that the model was trained on specific datasets, but more information on data preprocessing and augmentation techniques would enhance reproducibility.
One notable limitation is the reliance on a fixed number of pseudo-speakers, which may restrict the model's adaptability to diverse speaker characteristics in real-world applications. Additionally, while the model performs well under controlled conditions, its robustness in noisy or variable environments has not been thoroughly evaluated. Future work should also address the scalability of the system in terms of processing power and memory requirements, especially for deployment in resource-constrained settings.
The implications of this research are significant, particularly in the context of privacy-preserving technologies for voice communication. The ability to anonymize speaker identity while maintaining intelligibility and naturalness is crucial for applications in teleconferencing, live translation, and other real-time voice interfaces. As privacy concerns continue to grow, the development of effective voice conversion and anonymization systems like TVTSyn could play a vital role in enhancing user security and trust in voice technologies. The paper presents TVTSyn, a novel streaming voice conversion and anonymization system that effectively synchronizes speaker identity with content through a time-varying timbre representation, demonstrating significant advancements in privacy and expressivity under strict latency constraints. The methodology is innovative, and the experimental results suggest a strong potential for real-world applications, although further work is needed to address limitations and enhance reproducibility.