Audio is a critical component of multimodal perception, and any truly intelligent system must demonstrate a wide range of auditory capabilities. These capabilities include transcription, classification, retrieval, reasoning, segmentation, clustering, reranking, and reconstruction. Fundamentally, each task involves transforming a raw audio signal into a meaningful 'embedding' - be it a single vector, a sequence of continuous or discrete representations, or another structured form - which then serves as the basis for generating the task's final response. To accelerate progress towards robust machine auditory intelligence, we present the Massive Sound Embedding Benchmark (MSEB): an extensible framework designed to evaluate the auditory components of any multimodal system. In its first release, MSEB offers a comprehensive suite of eight core tasks, with more planned for the future, supported by diverse datasets, including the new, large-scale Simple Voice Questions (SVQ) dataset. Our initial experiments establish clear performance headrooms, highlighting the significant opportunity to improve real-world multimodal experiences where audio is a core signal. We encourage the research community to use MSEB to assess their algorithms and contribute to its growth. The library is publicly hosted at github.
Primary: Google Research
All Institutions: Google Research
The paper presents the Massive Sound Embedding Benchmark (MSEB), a comprehensive framework for evaluating auditory capabilities in multimodal systems. The proposed methodology and initial experiments highlight significant opportunities for improvement in machine auditory intelligence, although further details on implementation and rigorous benchmarking against existing methods would enhance its impact.
The paper introduces the Massive Sound Embedding Benchmark (MSEB), which is a novel framework aimed at evaluating auditory capabilities in multimodal systems. The methodology is well-structured, presenting eight core tasks that cover a wide range of audio processing capabilities. The inclusion of the Simple Voice Questions (SVQ) dataset is a significant addition, as it provides a large-scale resource for benchmarking. The tasks are clearly defined, and the framework is extensible, allowing for future enhancements. However, the paper could benefit from more detailed descriptions of the specific algorithms or techniques used to generate embeddings for each task.
The initial experiments reported in the paper establish performance benchmarks across the eight tasks, indicating clear performance headrooms. While the results are promising, the paper lacks detailed quantitative results and comparisons with existing benchmarks, which would strengthen the claims of improvement. Additionally, the experimental setup and metrics used for evaluation are not thoroughly discussed, which raises questions about the robustness of the findings.
The paper mentions that the library is publicly hosted on GitHub, which is a positive aspect for reproducibility. However, there is limited information on the specific implementation details, such as the versions of libraries used, the hardware setup, and the training procedures. This lack of detail could hinder other researchers from effectively reproducing the results.
One limitation is the potential overfitting to the benchmark tasks, as the initial experiments may not fully represent real-world scenarios. Furthermore, the paper does not address the scalability of the framework or how it performs with varying audio qualities and conditions. The reliance on a single dataset (SVQ) for initial experiments may also limit the generalizability of the findings.
The MSEB framework has the potential to significantly impact the field of machine auditory intelligence by providing a standardized way to evaluate and compare different algorithms. This could accelerate advancements in multimodal systems that rely on audio processing, with applications in areas such as human-computer interaction, accessibility technologies, and automated content generation. The paper presents the Massive Sound Embedding Benchmark (MSEB), a comprehensive framework for evaluating auditory capabilities in multimodal systems. The proposed methodology and initial experiments highlight significant opportunities for improvement in machine auditory intelligence, although further details on implementation and rigorous benchmarking against existing methods would enhance its impact.
Large audio-language models (LALMs) exhibit strong zero-shot capabilities in multiple downstream tasks, such as audio question answering (AQA) and abstract reasoning; however, these models still lag behind specialized models for certain discriminative tasks (e.g., audio classification). Recent studies show that sparse subsets of attention heads within an LALM can serve as strong discriminative feature extractors for downstream tasks such as classification via simple voting schemes. However, these methods assign uniform weights to all selected heads, implicitly assuming that each head contributes equally across all semantic categories. In this work, we propose Class-Conditional Sparse Attention Vectors for Large Audio-Language Models, a few-shot classification method that learns class-dependent importance weights over attention heads. This formulation allows individual heads to specialize in distinct semantic categories and to contribute to ensemble predictions proportionally to their estimated reliability. Experiments on multiple few-shot audio and audiovisual classification benchmarks and tasks demonstrate that our method consistently outperforms state-of-the-art uniform voting-based approaches by up to 14.52%, 1.53%, 8.35% absolute gains for audio classification, audio-visual classification, and spoofing detection respectively.
Primary: MIT-IBM Watson AI Lab
All Institutions: MIT-IBM Watson AI Lab, Tuebingen AI Center
The main contribution of this paper is the introduction of a class-dependent weighting mechanism for attention heads in large audio-language models, which significantly enhances their performance in few-shot classification tasks. This work represents a meaningful advancement in the field of audio processing and machine learning, addressing existing limitations in model performance and paving the way for future research in adaptive attention mechanisms.
The proposed method, Class-Conditional Sparse Attention Vectors, introduces a novel approach to weighting attention heads based on class-specific importance, which is a significant departure from previous methods that treated all heads equally. This class-dependent weighting mechanism allows for more nuanced feature extraction tailored to specific tasks, enhancing the model's performance in few-shot classification scenarios. The methodology is well-structured and builds upon existing frameworks in audio-language processing, demonstrating a clear understanding of the limitations of uniform voting schemes.
The experiments conducted across various benchmarks for audio classification, audio-visual classification, and spoofing detection are robust. The reported improvements over state-of-the-art methods by notable margins (up to 14.52% in audio classification) indicate that the proposed method is not only effective but also competitive in real-world applications. However, the paper would benefit from a more detailed description of the datasets used and the specific metrics for evaluation to enhance transparency.
The paper lacks sufficient implementation details and code availability, which are critical for reproducibility. While the methodology is sound, without access to the code or a clear description of the experimental setup, it would be challenging for other researchers to replicate the results.
One limitation is the reliance on few-shot learning, which may not generalize well to all audio classification tasks, particularly those requiring extensive training data. Additionally, the paper does not address potential biases in the attention heads or the implications of class imbalance in the datasets used.
The implications of this research are significant for the development of more efficient audio-language models that can be applied in various domains, including accessibility technologies, automated content moderation, and interactive AI systems. By improving the performance of LALMs in discriminative tasks, this work could enhance user experiences in applications such as voice assistants and audio-based search engines. The main contribution of this paper is the introduction of a class-dependent weighting mechanism for attention heads in large audio-language models, which significantly enhances their performance in few-shot classification tasks. This work represents a meaningful advancement in the field of audio processing and machine learning, addressing existing limitations in model performance and paving the way for future research in adaptive attention mechanisms.
Speech tokenizers are foundational to speech language models, yet existing approaches face two major challenges: (1) balancing trade-offs between encoding semantics for understanding and acoustics for reconstruction, and (2) achieving low bit rates and low token rates. We propose Speech Diffusion Tokenizer (SiTok), a diffusion autoencoder that jointly learns semantic-rich representations through supervised learning and enables high-fidelity audio reconstruction with diffusion. We scale SiTok to 1.6B parameters and train it on 2 million hours of speech. Experiments show that SiTok outperforms strong baselines on understanding, reconstruction and generation tasks, at an extremely low token rate of $12.5$ Hz and a bit-rate of 200 bits-per-second.
Primary: Meta
All Institutions: Meta
The main contribution of this paper is the introduction of SiTok, a novel speech tokenizer that utilizes a diffusion autoencoder to achieve high-quality speech representation and reconstruction while maintaining low bit and token rates. This work significantly advances the field of speech processing by addressing key challenges in existing methodologies and providing a robust framework for future research and applications.
The proposed methodology of the Speech Diffusion Tokenizer (SiTok) is innovative, leveraging a diffusion autoencoder to jointly optimize quantization and reconstruction. The introduction of semantic regularization through a CTC decoder is a significant advancement, allowing the model to maintain semantic integrity while achieving high compression rates. The architecture effectively combines the strengths of diffusion models with the need for efficient speech tokenization, addressing the limitations of previous approaches that often relied on heuristic compromises. The design choices, such as the use of mel-spectrograms and the focus on low token rates, are well-justified and align with the objectives of scalable language modeling.
The experiments conducted are extensive, utilizing a large dataset of 2 million hours of speech, which enhances the robustness of the findings. The paper provides a comprehensive evaluation across various tasks, including speech reconstruction, emotion recognition, and automatic speech recognition, demonstrating that SiTok outperforms existing baselines significantly. The results are well-presented, with clear metrics for comparison, and the ablation studies effectively highlight the contributions of different components of the model.
The paper includes detailed descriptions of the model architecture, training settings, and evaluation protocols, which are crucial for reproducibility. The authors have made efforts to ensure that their work can be replicated, which is commendable. However, the absence of a publicly available code repository limits the ease of reproducibility for practitioners in the field.
While the proposed model shows promising results, it may still face challenges in real-world applications, such as the potential for overfitting due to the large number of parameters (1.6B) and the reliance on extensive training data. Additionally, the computational efficiency during inference, although improved with shortcut fine-tuning, may still be a concern for deployment in resource-constrained environments. The paper does not address the ethical implications of misuse in generating synthetic speech, which is an important consideration in today's landscape.
The development of SiTok has significant implications for speech technology, particularly in applications such as automatic speech recognition, text-to-speech systems, and conversational agents. By enabling high-fidelity audio reconstruction at low bit rates, this work could enhance accessibility and usability in various domains, including assistive technologies and real-time communication systems. The potential for misuse, such as generating deceptive synthetic speech, highlights the need for responsible deployment and monitoring of such technologies. The main contribution of this paper is the introduction of SiTok, a novel speech tokenizer that utilizes a diffusion autoencoder to achieve high-quality speech representation and reconstruction while maintaining low bit and token rates. This work significantly advances the field of speech processing by addressing key challenges in existing methodologies and providing a robust framework for future research and applications.
Large Audio Language Models (LALMs) have demonstrated strong capabilities in audio understanding and reasoning. However, their performance on fine grained auditory perception remains unreliable, and existing approaches largely rely on data intensive training to internalize perceptual abilities. We propose AudioRouter, a reinforcement learning framework that enables LALMs to improve audio understanding by learning when and how to use external audio tools. Rather than tightly coupling tool usage with audio reasoning, AudioRouter formulates tool use as an explicit decision making problem and optimizes a lightweight routing policy while keeping the underlying reasoning model frozen. Experimental results show that AudioRouter achieves substantial improvements on standard audio understanding benchmarks while requiring up to 600x less training data to learn tool usage compared with conventional training paradigms. These findings suggest that learning effective tool usage offers a data efficient and scalable alternative to internalizing perceptual abilities in LALMs.
Primary: University of California
All Institutions: University of California, The University of Queensland
The main contribution of this paper is the introduction of AudioRouter, a reinforcement learning framework that enhances audio understanding in large audio language models by optimizing tool usage while significantly reducing the amount of required training data. This innovative approach not only improves performance but also offers a scalable alternative to traditional data-intensive training methods, marking a significant advancement in the field of audio processing and reasoning.
The methodology presented in the paper is innovative as it decouples tool usage from the reasoning model, allowing for a more efficient learning process. The use of reinforcement learning to optimize a routing policy for tool invocation is a significant departure from traditional end-to-end training approaches. The authors effectively formulate tool usage as a discrete decision-making problem, which is a novel perspective in the context of audio language models. The decision to keep the reasoning model frozen while training the router is a strategic choice that enhances data efficiency and reduces complexity.
The experimental evaluation is robust, demonstrating the effectiveness of AudioRouter across multiple benchmarks (MMAU-mini and MMAR). The results indicate substantial improvements in performance while requiring significantly less training data compared to conventional methods. The paper provides clear comparisons against baseline models, showcasing the advantages of the proposed framework. However, the experiments could benefit from a broader range of datasets and tasks to further validate the generalizability of the approach.
The paper includes sufficient details regarding the experimental setup, including model architectures, training data, and reinforcement learning specifics. However, the lack of URLs for code or project repositories limits the reproducibility of the results. Providing access to the trained models or implementation would enhance the ability of other researchers to replicate the findings.
The paper acknowledges that the relative outcome reward relies on a fixed reasoning model, which may limit the Router's learning signal. Additionally, the focus on short-form, closed-set audio reasoning tasks with a limited set of audio tools may restrict the applicability of the findings. Future work should explore extending the framework to more complex reasoning tasks and diverse tool capabilities.
The proposed AudioRouter framework has the potential to significantly advance the field of audio understanding by providing a more data-efficient method for leveraging external tools. This approach could lead to broader applications in various domains, including audio analysis, multimedia processing, and interactive AI systems. By reducing the reliance on large annotated datasets, it may also democratize access to advanced audio processing capabilities. The main contribution of this paper is the introduction of AudioRouter, a reinforcement learning framework that enhances audio understanding in large audio language models by optimizing tool usage while significantly reducing the amount of required training data. This innovative approach not only improves performance but also offers a scalable alternative to traditional data-intensive training methods, marking a significant advancement in the field of audio processing and reasoning.
Discrete audio tokenizers are fundamental to empowering large language models with native audio processing and generation capabilities. Despite recent progress, existing approaches often rely on pretrained encoders, semantic distillation, or heterogeneous CNN-based architectures. These designs introduce fixed inductive biases that limit reconstruction fidelity and hinder effective scaling. In this paper, we argue that discrete audio tokenization should be learned fully end-to-end using a homogeneous and scalable architecture. To this end, we first propose CAT (Causal Audio Tokenizer with Transformer), a purely Transformer-based architecture that jointly optimizes the encoder, quantizer, and decoder from scratch for high-fidelity reconstruction. Building on the CAT architecture, we develop MOSS-Audio-Tokenizer, a large-scale audio tokenizer featuring 1.6 billion parameters, pre-trained on 3 million hours of diverse, general audio data. We show that this simple, fully end-to-end approach built from homogeneous, causal Transformer blocks scales gracefully and supports high-fidelity reconstruction across diverse audio domains. Across speech, sound, and music, MOSS-Audio-Tokenizer consistently outperforms prior codecs over a wide range of bitrates, while exhibiting predictable improvements with increased scale. Notably, leveraging the discrete tokens from our model, we develop the first purely autoregressive TTS model that surpasses prior non-autoregressive and cascaded systems. Furthermore, MOSS-Audio-Tokenizer enables competitive ASR performance without auxiliary encoders. Our findings position the CAT architecture as a unified, scalable interface for the next generation of native audio foundation models.
Primary: Fudan University
All Institutions: Fudan University, MOSI Intelligence, Shanghai Innovation Institute
The paper presents MOSS-Audio-Tokenizer, a novel end-to-end audio tokenizer that significantly improves audio processing capabilities for autoregressive models. Its comprehensive methodology and robust experimental validation establish it as a noteworthy contribution to the field of machine learning and audio processing.
The paper introduces the Causal Audio Tokenizer (CAT), a novel architecture that employs a fully end-to-end approach to audio tokenization using a homogeneous stack of causal Transformer blocks. This design minimizes fixed inductive biases, allowing for high-fidelity audio reconstruction across diverse domains. The architecture's simplicity and scalability are emphasized, with joint optimization of the encoder, quantizer, decoder, and discriminator, which is a significant departure from existing methods that often rely on pretrained components or complex architectures. The methodology is well-structured, with clear explanations of the training objectives and the integration of semantic modeling through audio-to-text tasks.
The authors conduct extensive experiments to evaluate the performance of MOSS-Audio-Tokenizer against existing audio tokenizers across various bitrate regimes. The results demonstrate state-of-the-art reconstruction quality in speech, sound, and music, with a clear advantage in low-bitrate scenarios. The use of both objective and subjective evaluation metrics strengthens the findings, providing a comprehensive assessment of the model's capabilities. The experiments are well-designed, showcasing the effectiveness of the proposed Progressive Sequence Dropout training strategy and the model's robustness across different conditions.
The paper provides detailed implementation information, including architecture specifications, training schedules, and optimization strategies. However, it lacks a publicly accessible code repository or demo URL, which could hinder reproducibility. The absence of shared code or datasets limits the ability for other researchers to validate the findings independently.
While the paper presents a strong technical contribution, it does not sufficiently address potential limitations, such as the computational resources required for training the large-scale model and the generalizability of the results to real-world applications. Additionally, the reliance on a large dataset for training may not be feasible for all researchers.
The development of MOSS-Audio-Tokenizer has significant implications for the field of audio processing and generation, particularly in enhancing the capabilities of autoregressive models. Its ability to provide high-fidelity audio reconstruction and support various downstream tasks like text-to-speech and automatic speech recognition positions it as a valuable tool for future audio foundation models. The research could lead to advancements in applications such as virtual assistants, content creation, and accessibility technologies. The paper presents MOSS-Audio-Tokenizer, a novel end-to-end audio tokenizer that significantly improves audio processing capabilities for autoregressive models. Its comprehensive methodology and robust experimental validation establish it as a noteworthy contribution to the field of machine learning and audio processing.
Passive acoustic monitoring has become a key strategy in biodiversity assessment, conservation, and behavioral ecology, especially as Internet-of-Things (IoT) devices enable continuous in situ audio collection at scale. While recent self-supervised learning (SSL)-based audio encoders, such as BEATs and AVES, have shown strong performance in bioacoustic tasks, their computational cost and limited robustness to unseen environments hinder deployment on resource-constrained platforms. In this work, we introduce BioME, a resource-efficient audio encoder designed for bioacoustic applications. BioME is trained via layer-to-layer distillation from a high-capacity teacher model, enabling strong representational transfer while reducing the parameter count by 75%. To further improve ecological generalization, the model is pretrained on multi-domain data spanning speech, environmental sounds, and animal vocalizations. A key contribution is the integration of modulation-aware acoustic features via FiLM conditioning, injecting a DSP-inspired inductive bias that enhances feature disentanglement in low-capacity regimes. Across multiple bioacoustic tasks, BioME matches or surpasses the performance of larger models, including its teacher, while being suitable for resource-constrained IoT deployments. For reproducibility, code and pretrained checkpoints are publicly available.
Primary: Institut national de la recherche scientifique (INRS - EMT)
All Institutions: Institut national de la recherche scientifique (INRS - EMT)
The main contribution of this paper is the introduction of BioME, a resource-efficient audio encoder designed for bioacoustic applications, which achieves state-of-the-art performance while significantly reducing computational costs. This work represents a meaningful advancement in the field of audio representation learning, particularly in the context of ecological monitoring, and demonstrates the potential of integrating traditional signal processing techniques with modern deep learning approaches.
The methodology presented in this paper is robust and innovative, leveraging layer-to-layer knowledge distillation to create a compact audio encoder, BioME, that retains high performance on bioacoustic tasks. The integration of modulation-aware features via FiLM conditioning is particularly noteworthy, as it introduces a novel inductive bias that enhances feature disentanglement, which is crucial for effective audio representation in resource-constrained environments. The use of a multi-domain pretraining strategy further strengthens the model's generalization capabilities across diverse bioacoustic tasks.
The experimental evaluation is thorough, utilizing a variety of datasets and benchmarks, including the BEANS benchmark for bioacoustic tasks. The results demonstrate that BioME outperforms larger models, including its teacher model, in several scenarios, particularly in resource-constrained setups. The ablation studies provide clear insights into the contributions of different architectural components, validating the effectiveness of the proposed modifications and confirming the model's robustness across various tasks.
The authors have made efforts to ensure reproducibility by providing code and pretrained checkpoints publicly. However, the paper lacks specific URLs for accessing these resources, which could enhance reproducibility further. Detailed descriptions of the datasets and training procedures are included, which aids in replicating the experiments.
One limitation is the potential overfitting observed in larger model configurations, particularly in specific tasks like binary classification for beehive monitoring. Additionally, while the model shows promise, the paper does not extensively discuss the trade-offs between model size and performance in all contexts, which could be important for practical applications.
The implications of this work are significant for ecological monitoring and conservation efforts, as it enables efficient and effective bioacoustic monitoring using resource-constrained IoT devices. The advancements in self-supervised learning for audio representation can also influence broader applications in machine learning, particularly in fields requiring real-time audio processing and analysis. The main contribution of this paper is the introduction of BioME, a resource-efficient audio encoder designed for bioacoustic applications, which achieves state-of-the-art performance while significantly reducing computational costs. This work represents a meaningful advancement in the field of audio representation learning, particularly in the context of ecological monitoring, and demonstrates the potential of integrating traditional signal processing techniques with modern deep learning approaches.
Music stem generation, the task of producing musically-synchronized and isolated instrument audio clips, offers the potential of greater user control and better alignment with musician workflows compared to conventional text-to-music models. Existing stem generation approaches, however, either rely on fixed architectures that output a predefined set of stems in parallel, or generate only one stem at a time, resulting in slow inference despite flexibility in stem combination. We propose Stemphonic, a diffusion-/flow-based framework that overcomes this trade-off and generates a variable set of synchronized stems in one inference pass. During training, we treat each stem as a batch element, group synchronized stems in a batch, and apply a shared noise latent to each group. At inference-time, we use a shared initial noise latent and stem-specific text inputs to generate synchronized multi-stem outputs in one pass. We further expand our approach to enable one-pass conditional multi-stem generation and stem-wise activity controls to empower users to iteratively generate and orchestrate the temporal layering of a mix. We benchmark our results on multiple open-source stem evaluation sets and show that Stemphonic produces higher-quality outputs while accelerating the full mix generation process by 25 to 50%. Demos at: https://stemphonic-demo.vercel.app.
Primary: Adobe Research
All Institutions: Adobe Research
The paper introduces Stemphonic, a novel framework for efficient multi-stem music generation, significantly advancing the field of audio generation through innovative methodologies and promising experimental results.
The methodology presents a novel framework that integrates diffusion and flow-based models for music stem generation, addressing the limitations of existing approaches by allowing for variable and synchronized stem outputs in a single inference pass. The introduction of techniques such as stem grouping and noise sharing during training is particularly innovative, as it enhances inter-stem cohesion and synchronization, which are critical in music generation tasks. The approach is well-structured and builds upon established generative models, showcasing a clear progression from theory to practical application.
The experiments are comprehensive, utilizing multiple datasets and evaluation metrics to assess the quality of generated stems and mixes. The results demonstrate significant improvements in generation quality and efficiency, with quantitative metrics such as Fréchet Audio Distance (FAD) providing a robust framework for evaluation. The ablation studies effectively highlight the contributions of the proposed techniques, reinforcing the validity of the claims made by the authors.
The paper provides detailed implementation specifics, including architecture choices, training procedures, and dataset descriptions, which facilitate reproducibility. However, the absence of a publicly available code repository limits the ease with which other researchers can replicate the results.
One limitation is the reliance on specific datasets for training and evaluation, which may not fully capture the diversity of music styles and genres. Additionally, while the model shows promise in generating synchronized stems, the quality of generated audio may still vary depending on the complexity of the input prompts and conditions.
The proposed framework has significant implications for music production, enabling greater creative control for musicians and content creators. By facilitating the generation of isolated instrument tracks, it can streamline workflows in music composition and production, potentially democratizing music creation for non-experts. The ability to generate stems on-demand could also enhance collaborative efforts in music-making. The paper introduces Stemphonic, a novel framework for efficient multi-stem music generation, significantly advancing the field of audio generation through innovative methodologies and promising experimental results.
In this work, we present Covo-Audio, a 7B-parameter end-to-end LALM that directly processes continuous audio inputs and generates audio outputs within a single unified architecture. Through large-scale curated pretraining and targeted post-training, Covo-Audio achieves state-of-the-art or competitive performance among models of comparable scale across a broad spectrum of tasks, including speech-text modeling, spoken dialogue, speech understanding, audio understanding, and full-duplex voice interaction. Extensive evaluations demonstrate that the pretrained foundation model exhibits strong speech-text comprehension and semantic reasoning capabilities on multiple benchmarks, outperforming representative open-source models of comparable scale. Furthermore, Covo-Audio-Chat, the dialogue-oriented variant, demonstrates strong spoken conversational abilities, including understanding, contextual reasoning, instruction following, and generating contextually appropriate and empathetic responses, validating its applicability to real-world conversational assistant scenarios. Covo-Audio-Chat-FD, the evolved full-duplex model, achieves substantially superior performance on both spoken dialogue capabilities and full-duplex interaction behaviors, demonstrating its competence in practical robustness. To mitigate the high cost of deploying end-to-end LALMs for natural conversational systems, we propose an intelligence-speaker decoupling strategy that separates dialogue intelligence from voice rendering, enabling flexible voice customization with minimal text-to-speech (TTS) data while preserving dialogue performance. Overall, our results highlight the strong potential of 7B-scale models to integrate sophisticated audio intelligence with high-level semantic reasoning, and suggest a scalable path toward more capable and versatile LALMs.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of Covo-Audio, a novel end-to-end LALM that effectively integrates audio processing and semantic reasoning, demonstrating strong performance across various tasks. This work represents a significant advancement in the field of audio machine learning, particularly in its approach to conversational systems and dialogue intelligence.
The methodology presented in Covo-Audio is innovative as it integrates a large-scale end-to-end LALM capable of processing continuous audio inputs and generating audio outputs. The architecture is designed for various tasks, including speech-text modeling and full-duplex voice interaction, which demonstrates a comprehensive approach to audio processing. The intelligence-speaker decoupling strategy is particularly noteworthy as it allows for flexible voice customization while maintaining dialogue performance, showcasing a novel approach to reducing deployment costs.
The experiments are extensive, covering multiple benchmarks and demonstrating strong performance against representative open-source models. The paper provides quantitative results that validate the model's capabilities in speech-text comprehension and conversational abilities. However, the paper could benefit from more detailed comparisons with existing models to better contextualize its performance.
The paper lacks detailed implementation specifics that would facilitate reproducibility. While it mentions large-scale pretraining and post-training, the absence of code or a project URL limits the ability for other researchers to replicate the findings or build upon the work.
One limitation is the high parameter count of the model, which may hinder accessibility for researchers with limited computational resources. Additionally, while the decoupling strategy is innovative, its practical implications and potential trade-offs in performance are not thoroughly explored.
The potential applications of Covo-Audio are significant, particularly in developing more capable conversational assistants that can handle complex audio interactions. The model's ability to generate empathetic responses could enhance user experience in real-world applications, making it a valuable contribution to the field of audio processing and conversational AI. The main contribution of this paper is the introduction of Covo-Audio, a novel end-to-end LALM that effectively integrates audio processing and semantic reasoning, demonstrating strong performance across various tasks. This work represents a significant advancement in the field of audio machine learning, particularly in its approach to conversational systems and dialogue intelligence.
Real-time voice conversion and speaker anonymization require causal, low-latency synthesis without sacrificing intelligibility or naturalness. Current systems have a core representational mismatch: content is time-varying, while speaker identity is injected as a static global embedding. We introduce a streamable speech synthesizer that aligns the temporal granularity of identity and content via a content-synchronous, time-varying timbre (TVT) representation. A Global Timbre Memory expands a global timbre instance into multiple compact facets; frame-level content attends to this memory, a gate regulates variation, and spherical interpolation preserves identity geometry while enabling smooth local changes. In addition, a factorized vector-quantized bottleneck regularizes content to reduce residual speaker leakage. The resulting system is streamable end-to-end, with <80 ms GPU latency. Experiments show improvements in naturalness, speaker transfer, and anonymization compared to SOTA streaming baselines, establishing TVT as a scalable approach for privacy-preserving and expressive speech synthesis under strict latency budgets.
Primary: unknown
All Institutions: unknown
The paper presents TVTSyn, a novel streaming voice conversion and anonymization system that effectively synchronizes speaker identity with content through a time-varying timbre representation, demonstrating significant advancements in privacy and expressivity under strict latency constraints. The methodology is innovative, and the experimental results suggest a strong potential for real-world applications, although further work is needed to address limitations and enhance reproducibility.
The proposed methodology introduces a novel time-varying timbre (TVT) representation that synchronizes speaker identity with content, addressing the static-dynamic mismatch prevalent in existing voice conversion systems. The architecture is well-structured, comprising a Global Timbre Memory (GTM) that enhances the expressivity of speaker identity while maintaining low latency, which is crucial for real-time applications. The use of a factorized vector-quantized bottleneck to regularize content and reduce speaker leakage is a significant innovation that contributes to the overall effectiveness of the system. The integration of causal convolutional networks and self-attention mechanisms demonstrates a sophisticated approach to maintaining temporal coherence in streaming scenarios.
The experiments are comprehensive, evaluating the proposed system against state-of-the-art (SOTA) methods across multiple metrics, including naturalness, speaker transfer, and anonymization effectiveness. The use of perceptual listening tests alongside objective metrics provides a well-rounded assessment of performance. The results indicate that TVTSyn achieves a favorable balance between privacy and utility, outperforming several baselines in terms of both speaker similarity and anonymization quality. However, the paper could benefit from a more detailed exploration of the datasets used and the specific configurations of the baseline models for clearer comparisons.
The paper provides a detailed account of the architecture, training procedures, and evaluation metrics, which supports reproducibility. However, the absence of a publicly available code repository limits the ability for others to replicate the results fully. The authors mention that the model was trained on specific datasets, but more information on data preprocessing and augmentation techniques would enhance reproducibility.
One notable limitation is the reliance on a fixed number of pseudo-speakers, which may restrict the model's adaptability to diverse speaker characteristics in real-world applications. Additionally, while the model performs well under controlled conditions, its robustness in noisy or variable environments has not been thoroughly evaluated. Future work should also address the scalability of the system in terms of processing power and memory requirements, especially for deployment in resource-constrained settings.
The implications of this research are significant, particularly in the context of privacy-preserving technologies for voice communication. The ability to anonymize speaker identity while maintaining intelligibility and naturalness is crucial for applications in teleconferencing, live translation, and other real-time voice interfaces. As privacy concerns continue to grow, the development of effective voice conversion and anonymization systems like TVTSyn could play a vital role in enhancing user security and trust in voice technologies. The paper presents TVTSyn, a novel streaming voice conversion and anonymization system that effectively synchronizes speaker identity with content through a time-varying timbre representation, demonstrating significant advancements in privacy and expressivity under strict latency constraints. The methodology is innovative, and the experimental results suggest a strong potential for real-world applications, although further work is needed to address limitations and enhance reproducibility.
Blind room impulse response (RIR) estimation is a core task for capturing and transferring acoustic properties; yet existing methods often suffer from limited modeling capability and degraded performance under unseen conditions. Moreover, emerging generative audio applications call for more flexible impulse response generation methods. We propose Gencho, a diffusion-transformer-based model that predicts complex spectrogram RIRs from reverberant speech. A structure-aware encoder leverages isolation between early and late reflections to encode the input audio into a robust representation for conditioning, while the diffusion decoder generates diverse and perceptually realistic impulse responses from it. Gencho integrates modularly with standard speech processing pipelines for acoustic matching. Results show richer generated RIRs than non-generative baselines while maintaining strong performance in standard RIR metrics. We further demonstrate its application to text-conditioned RIR generation, highlighting Gencho's versatility for controllable acoustic simulation and generative audio tasks.
Primary: University of Maryland
All Institutions: University of Illinois Urbana-Champaign, University of Maryland, Adobe, Paris Smaragdis
The paper presents Gencho, a novel diffusion-transformer model for generating room impulse responses from reverberant speech, significantly advancing the state of the art in acoustic matching and generative audio applications. The comprehensive methodology and experimental validation underscore its potential impact on the field of audio processing.
The methodology presented in this paper is innovative, leveraging a diffusion-transformer architecture to generate room impulse responses (RIRs) from reverberant speech. The proposed structure-aware encoder effectively separates early and late reflections, which is a notable improvement over traditional methods that treat the input as a monolithic signal. This separation allows for more accurate modeling of the acoustic environment. The use of a diffusion-based decoder enhances the model's ability to generate diverse and perceptually realistic outputs, addressing the limitations of previous non-generative approaches. The integration of text conditioning for RIR generation further demonstrates the versatility of the proposed method.
The experiments are well-structured, utilizing a variety of datasets to evaluate the model's performance in different scenarios. The comparison with baseline models, particularly the regression-based FiNS variants, effectively highlights the advantages of the proposed Gencho model. The results indicate significant improvements in standard RIR metrics, showcasing the model's ability to generalize across unseen data. The hybrid approach combining the strengths of both generative and non-generative methods is a valuable addition to the experimental evaluation, demonstrating practical applications in real-world settings.
The paper provides sufficient details regarding the model architecture, training procedures, and evaluation metrics, which supports reproducibility. However, the absence of a publicly available code repository limits the ease with which other researchers can replicate the results. The authors could enhance reproducibility by providing access to their implementation and datasets used for training and evaluation.
One limitation of the proposed method is its reliance on high-quality input data for optimal performance. The model may struggle with noisy or poorly recorded reverberant speech, which could affect the accuracy of the generated RIRs. Additionally, while the text-to-RIR generation shows promise, the model's performance may vary based on the quality and specificity of the text prompts provided.
The implications of this research are significant for various applications in audio processing, including automated dialogue replacement, immersive audio experiences in AR/VR, and generative audio content creation. By enabling more flexible and realistic acoustic simulations, this work could enhance the quality of synthetic speech and audio in numerous contexts, ultimately contributing to advancements in the field of machine learning and audio technology. The paper presents Gencho, a novel diffusion-transformer model for generating room impulse responses from reverberant speech, significantly advancing the state of the art in acoustic matching and generative audio applications. The comprehensive methodology and experimental validation underscore its potential impact on the field of audio processing.
While deep learning has advanced speech enhancement (SE), effective phase modeling remains challenging, as conventional networks typically operate within a flat Euclidean feature space, which is not easy to model the underlying circular topology of the phase. To address this, we propose a manifold-aware magnitude-phase dual-stream framework that aligns the phase stream with its intrinsic circular geometry by enforcing Global Rotation Equivariance (GRE) characteristic. Specifically, we introduce a Magnitude-Phase Interactive Convolutional Module (MPICM) for modulus-based information exchange and a Hybrid-Attention Dual-FFN (HADF) bottleneck for unified feature fusion, both of which are designed to preserve GRE in the phase stream. Comprehensive evaluations are conducted across phase retrieval, denoising, dereverberation, and bandwidth extension tasks to validate the superiority of the proposed method over multiple advanced baselines. Notably, the proposed architecture reduces Phase Distance by over 20\% in the phase retrieval task and improves PESQ by more than 0.1 in zero-shot cross-corpus denoising evaluations. The overall superiority is also established in universal SE tasks involving mixed distortions. Qualitative analysis further reveals that the learned phase features exhibit distinct periodic patterns, which are consistent with the intrinsic circular nature of the phase. The source code is available at https://github.com/wangchengzhong/RENet.
Primary: Institute of Acoustics, Chinese Academy of Sciences
All Institutions: Institute of Acoustics, Chinese Academy of Sciences, University of Chinese Academy of Sciences
The paper presents a significant advancement in speech enhancement through a novel phase modeling approach that respects the geometric properties of phase data. The methodology is innovative, and the results demonstrate substantial improvements over existing methods, marking a meaningful contribution to the field of audio processing and machine learning.
The paper introduces a novel manifold-aware framework for phase modeling in speech enhancement, emphasizing Global Rotation Equivariance (GRE) to address the circular topology of phase data. The methodology is well-structured, with two main components: the Magnitude-Phase Interactive Convolutional Module (MPICM) and the Hybrid-Attention Dual-FFN (HADF). These components facilitate effective interaction between magnitude and phase streams while preserving the intrinsic geometric properties of phase. The approach is innovative, as it fundamentally alters how phase information is processed in deep learning architectures, moving away from traditional Euclidean assumptions.
The authors conduct extensive experiments across various tasks, including phase retrieval, denoising, dereverberation, and bandwidth extension. They use established datasets like VoiceBank+DEMAND and DNS Challenge 2020, demonstrating the effectiveness of their method against multiple strong baselines. The results indicate significant improvements in phase modeling accuracy and perceptual quality metrics, showcasing the robustness of the proposed architecture in diverse acoustic conditions. However, the paper could benefit from more detailed comparisons with a wider range of state-of-the-art methods.
The paper provides a clear description of the proposed architecture and the experimental setup, including datasets and training configurations. The availability of the source code on GitHub enhances reproducibility, allowing other researchers to validate and build upon the work. However, specific hyperparameter settings and training details could be elaborated further to facilitate easier replication of results.
While the proposed method shows promising results, the paper does not address potential limitations such as the computational complexity of the model and its scalability to larger datasets. Additionally, the reliance on specific datasets may limit the generalizability of the findings to other speech enhancement scenarios.
The proposed framework has significant implications for various applications in telecommunications, smart devices, and hearing aids, where effective speech enhancement is crucial. By improving phase modeling, the method could lead to advancements in real-time speech processing systems, enhancing user experience in noisy environments. The paper presents a significant advancement in speech enhancement through a novel phase modeling approach that respects the geometric properties of phase data. The methodology is innovative, and the results demonstrate substantial improvements over existing methods, marking a meaningful contribution to the field of audio processing and machine learning.
Time-frequency domain dual-path models have demonstrated strong performance and are widely used in source separation. Because their computational cost grows with the number of frequency bins, these models often use the band-split (BS) module in high-sampling-rate tasks such as music source separation (MSS) and cinematic audio source separation (CASS). The BS encoder compresses frequency information by encoding features for each predefined subband. It achieves effective compression by introducing an inductive bias that places greater emphasis on low-frequency parts. Despite its success, the BS module has two inherent limitations: (i) it is not input-adaptive, preventing the use of input-dependent information, and (ii) the parameter count is large, since each subband requires a dedicated module. To address these issues, we propose Spectral Feature Compression (SFC). SFC compresses the input using a single sequence modeling module, making it both input-adaptive and parameter-efficient. We investigate two variants of SFC, one based on cross-attention and the other on Mamba, and introduce inductive biases inspired by the BS module to make them suitable for frequency information compression. Experiments on MSS and CASS tasks demonstrate that the SFC module consistently outperforms the BS module across different separator sizes and compression ratios. We also provide an analysis showing that SFC adaptively captures frequency patterns from the input.
Primary: National Institute of Advanced Industrial Science and Technology (AIST)
All Institutions: National Institute of Advanced Industrial Science and Technology (AIST), Waseda University
The main contribution of this paper is the introduction of the Spectral Feature Compression module, which provides a novel, input-adaptive, and parameter-efficient approach to spectral feature compression for source separation tasks. This work represents a meaningful advancement in the field of audio processing, addressing key limitations of existing methods and demonstrating strong empirical results.
The paper introduces a novel approach to spectral feature compression through the Spectral Feature Compression (SFC) module, which utilizes sequence modeling techniques to create an input-adaptive and parameter-efficient method for source separation. The methodology is well-structured, addressing the limitations of the traditional band-split (BS) module by incorporating inductive biases and demonstrating the effectiveness of two variants based on cross-attention and Mamba. The approach is innovative in its attempt to adaptively capture frequency patterns, which is a significant advancement over previous methods.
The experiments are comprehensive, evaluating the proposed SFC module against the BS module across various tasks, including music source separation (MSS) and cinematic audio source separation (CASS). The results consistently show that SFC outperforms BS across different separator sizes and compression ratios, indicating a robust experimental design. However, details on the datasets used and the specific metrics for evaluation could be elaborated further to enhance clarity.
The paper lacks specific implementation details that would facilitate reproducibility, such as code availability or detailed descriptions of the experimental setup. While the methodology is sound, the absence of a project URL or demo could hinder other researchers from replicating the results.
One limitation is the reliance on inductive biases inspired by the BS module, which may not generalize well to all types of audio signals. Additionally, while the SFC module shows promise, its performance in real-world scenarios beyond the tested datasets remains unverified.
The proposed method has significant implications for audio processing applications, particularly in enhancing the quality of source separation in music and cinematic audio. The input-adaptive nature of the SFC module could lead to more efficient and effective audio processing systems, potentially influencing both academic research and industry practices. The main contribution of this paper is the introduction of the Spectral Feature Compression module, which provides a novel, input-adaptive, and parameter-efficient approach to spectral feature compression for source separation tasks. This work represents a meaningful advancement in the field of audio processing, addressing key limitations of existing methods and demonstrating strong empirical results.
Synthesizing coherent soundtracks for long-form videos remains a formidable challenge, currently stalled by three critical impediments: computational scalability, temporal coherence, and, most critically, a pervasive semantic blindness to evolving narrative logic. To bridge these gaps, we propose NarraScore, a hierarchical framework predicated on the core insight that emotion serves as a high-density compression of narrative logic. Uniquely, we repurpose frozen Vision-Language Models (VLMs) as continuous affective sensors, distilling high-dimensional visual streams into dense, narrative-aware Valence-Arousal trajectories. Mechanistically, NarraScore employs a Dual-Branch Injection strategy to reconcile global structure with local dynamism: a \textit{Global Semantic Anchor} ensures stylistic stability, while a surgical \textit{Token-Level Affective Adapter} modulates local tension via direct element-wise residual injection. This minimalist design bypasses the bottlenecks of dense attention and architectural cloning, effectively mitigating the overfitting risks associated with data scarcity. Experiments demonstrate that NarraScore achieves state-of-the-art consistency and narrative alignment with negligible computational overhead, establishing a fully autonomous paradigm for long-video soundtrack generation.
Primary: Tsinghua University
All Institutions: Tsinghua University, ByteDance
NarraScore represents a significant advancement in the synthesis of soundtracks for long-form videos by establishing a novel framework that connects visual narratives with musical dynamics through emotional control. The approach is innovative and addresses critical challenges in the field, although further validation and reproducibility efforts are needed to solidify its impact.
The methodology presented in NarraScore is innovative, leveraging frozen Vision-Language Models (VLMs) as affective sensors to convert visual narratives into Valence-Arousal trajectories. The Dual-Branch Injection strategy is particularly noteworthy, as it effectively balances global coherence with local dynamism, addressing the common pitfalls of dense attention mechanisms. The minimalist design is a strong point, as it aims to reduce overfitting risks associated with data scarcity, which is a prevalent issue in machine learning applications involving audio synthesis. However, the paper could benefit from a more detailed explanation of the training process and the specific architectures employed.
The experiments demonstrate that NarraScore achieves state-of-the-art performance in terms of coherence and narrative alignment. The authors provide sufficient empirical evidence to support their claims, although the paper lacks a comprehensive comparison with existing methods beyond a cursory mention. The results are promising, but the absence of detailed metrics or benchmarks makes it difficult to fully gauge the significance of the improvements claimed. Future work should include a broader set of comparisons to further validate the approach.
The paper does not provide sufficient details regarding the implementation, datasets, or training procedures, which raises concerns about reproducibility. Clearer guidelines and access to code or data would significantly enhance the ability of other researchers to replicate the findings. The lack of a demo or project URL further complicates this aspect.
The authors acknowledge limitations related to the temporal granularity of affective control, which could hinder synchronization with rapid visual events. Additionally, the cascaded design may lead to error propagation, which could affect the overall performance. Addressing these limitations in future work will be crucial for improving the robustness of the framework.
The potential applications of NarraScore are significant, particularly in the fields of film, gaming, and content creation where automated soundtrack generation could enhance user experience. The ability to generate music that aligns with narrative emotion could also open new avenues for interactive media. However, ethical considerations regarding the use of AI-generated content and its implications for creative industries should be discussed further. NarraScore represents a significant advancement in the synthesis of soundtracks for long-form videos by establishing a novel framework that connects visual narratives with musical dynamics through emotional control. The approach is innovative and addresses critical challenges in the field, although further validation and reproducibility efforts are needed to solidify its impact.
Sound source tracking is commonly performed using classical array-processing algorithms, while machine-learning approaches typically rely on precise source position labels that are expensive or impractical to obtain. This paper introduces a physics-guided variational model capable of fully unsupervised single-source sound source tracking. The method combines a variational encoder with a physics-based decoder that injects geometric constraints into the latent space through analytically derived pairwise time-delay likelihoods. Without requiring ground-truth labels, the model learns to estimate source directions directly from microphone array signals. Experiments on real-world data demonstrate that the proposed approach outperforms traditional baselines and achieves accuracy and computational complexity comparable to state-of-the-art supervised models. We further show that the method generalizes well to mismatched array geometries and exhibits strong robustness to corrupted microphone position metadata. Finally, we outline a natural extension of the approach to multi-source tracking and present the theoretical modifications required to support it.
Primary: Eindhoven University of Technology
All Institutions: Eindhoven University of Technology, NXP Semiconductors
The paper presents a novel physics-guided variational model for unsupervised sound source tracking, effectively bridging machine learning and traditional signal processing techniques. The methodology is innovative, and the results demonstrate significant potential for real-world applications, marking a meaningful contribution to the field of audio processing.
The proposed methodology integrates a variational autoencoder with a physics-based decoder, effectively combining machine learning with established physical principles to enhance sound source tracking. This innovative approach allows for unsupervised learning by leveraging geometric constraints without requiring labeled data, which is a significant advancement over traditional methods. The use of a von Mises-Fisher distribution for directional statistics is particularly noteworthy, as it is well-suited for the problem at hand. The architecture is designed for efficiency, allowing for parallel processing, which is crucial for real-time applications.
The experiments conducted are robust, comparing the proposed method against classical and state-of-the-art supervised models across various scenarios, including noise and uncertainty in microphone positioning. The results demonstrate that the proposed model not only outperforms classical methods but also competes well with supervised approaches, showcasing its effectiveness in real-world applications. The use of real-world data for testing adds credibility to the findings.
The paper provides a detailed description of the methodology, including the architecture and loss functions, which facilitates reproducibility. However, the lack of a publicly available code repository or demo limits the ease with which other researchers can replicate the results.
One limitation is the focus on single-source tracking, which may restrict the applicability of the method in more complex environments with multiple sound sources. Additionally, while the model performs well under various conditions, its performance relative to supervised models in all scenarios may not be consistent, particularly in highly reverberant environments.
This research has the potential to significantly impact fields such as robotics, hearing aids, and surveillance systems, where accurate sound source localization is critical. The unsupervised nature of the model could lead to more accessible implementations in devices that cannot afford extensive labeled training data. The paper presents a novel physics-guided variational model for unsupervised sound source tracking, effectively bridging machine learning and traditional signal processing techniques. The methodology is innovative, and the results demonstrate significant potential for real-world applications, marking a meaningful contribution to the field of audio processing.
Dysarthric speech exhibits high variability and limited labeled data, posing major challenges for both automatic speech recognition (ASR) and assistive speech technologies. Existing approaches rely on synthetic data augmentation or speech reconstruction, yet often entangle speaker identity with pathological articulation, limiting controllability and robustness. In this paper, we propose ProtoDisent-TTS, a prototype-based disentanglement TTS framework built on a pre-trained text-to-speech backbone that factorizes speaker timbre and dysarthric articulation within a unified latent space. A pathology prototype codebook provides interpretable and controllable representations of healthy and dysarthric speech patterns, while a dual-classifier objective with a gradient reversal layer enforces invariance of speaker embeddings to pathological attributes. Experiments on the TORGO dataset demonstrate that this design enables bidirectional transformation between healthy and dysarthric speech, leading to consistent ASR performance gains and robust, speaker-aware speech reconstruction.
Primary: The Hong Kong Polytechnic University
All Institutions: The Hong Kong Polytechnic University
The paper presents ProtoDisent-TTS, a prototype-based disentanglement TTS framework that effectively synthesizes dysarthric speech while preserving speaker identity. The innovative methodology and promising experimental results position this work as a valuable contribution to the field of speech synthesis and assistive technologies.
The proposed ProtoDisent-TTS framework introduces a novel approach to disentangling speaker identity from dysarthric articulation by utilizing a prototype-based codebook and a dual-classifier objective. This method is innovative as it combines elements of text-to-speech synthesis with a clear focus on pathology, allowing for controlled speech generation. The use of a gradient reversal layer to enforce invariance of speaker embeddings to dysarthric attributes is particularly noteworthy, as it addresses a significant challenge in the field of speech synthesis.
The experiments conducted on the TORGO dataset are well-structured and demonstrate the effectiveness of the proposed framework. The results show consistent improvements in ASR performance and speaker identity preservation, validating the utility of synthetic data generated by ProtoDisent-TTS. However, the paper could benefit from more extensive comparisons with existing state-of-the-art methods to better contextualize the results.
The implementation details provided are thorough, including specifics on the architecture, training procedures, and hyperparameters. However, the absence of a publicly accessible code repository limits the reproducibility of the results. The authors mention using a pre-trained Index-TTS model, but it would be beneficial to provide access to this model or detailed instructions for replication.
One limitation of the study is the reliance on a relatively small dataset (TORGO), which may affect the generalizability of the findings. Additionally, while the framework shows promise for dysarthric speech synthesis, its performance in real-world applications and with diverse speaker populations remains to be evaluated.
The work has significant implications for assistive speech technologies, particularly for individuals with dysarthria. By enabling controllable and interpretable speech synthesis, the framework could enhance communication for those affected by speech disorders. This research could also inspire further studies in related areas, such as voice conversion and personalized speech synthesis. The paper presents ProtoDisent-TTS, a prototype-based disentanglement TTS framework that effectively synthesizes dysarthric speech while preserving speaker identity. The innovative methodology and promising experimental results position this work as a valuable contribution to the field of speech synthesis and assistive technologies.
While existing Singing Voice Synthesis systems achieve high-fidelity solo performances, they are constrained by global timbre control, failing to address dynamic multi-singer arrangement and vocal texture within a single song. To address this, we propose Tutti, a unified framework designed for structured multi-singer generation. Specifically, we introduce a Structure-Aware Singer Prompt to enable flexible singer scheduling evolving with musical structure, and propose Complementary Texture Learning via Condition-Guided VAE to capture implicit acoustic textures (e.g., spatial reverberation and spectral fusion) that are complementary to explicit controls. Experiments demonstrate that Tutti excels in precise multi-singer scheduling and significantly enhances the acoustic realism of choral generation, offering a novel paradigm for complex multi-singer arrangement. Audio samples are available at https://annoauth123-ctrl.github.io/Tutii_Demo/.
Primary: Wuhan University of Technology
All Institutions: Wuhan University of Technology, Tencent Inc.
The paper presents Tutti, a novel framework for dynamic multi-singer synthesis that significantly enhances the acoustic realism and artistic cohesion of choral generation. The innovative methodology and comprehensive experimental validation position this work as a meaningful contribution to the field of machine learning and audio synthesis.
The methodology presented in this paper is robust and innovative, introducing the Tutti framework for multi-singer synthesis. The Structure-Aware Singer Prompt and the Complementary Texture Learning via Condition-Guided VAE are significant contributions that address the limitations of existing Singing Voice Synthesis (SVS) systems. The integration of these components allows for dynamic scheduling of singers and captures complex vocal textures, which are crucial for realistic multi-singer arrangements. The use of a Latent Diffusion Transformer (DiT) backbone enhances the model's ability to manage long musical sequences effectively.
The experimental setup is comprehensive, utilizing a large dataset for training and rigorous evaluation metrics, including both objective and subjective assessments. The results demonstrate significant improvements in multi-singer scheduling and acoustic realism compared to existing models. The ablation studies effectively highlight the contributions of each component of the proposed framework, reinforcing the importance of the adaptive fuser and texture learning in achieving high-quality synthesis.
The paper provides detailed implementation and training configurations, including model architecture, training parameters, and evaluation protocols. This level of detail supports reproducibility, allowing other researchers to replicate the experiments. However, the lack of a publicly available code repository limits accessibility for broader validation and experimentation.
The paper acknowledges limitations, such as the assumption that verse sections contain only a single singer, which may not reflect real-world scenarios. Additionally, the model's performance in melodicity and emotional expressiveness is noted as an area for improvement. These limitations suggest that while the framework is innovative, it may require further refinement to handle more complex musical arrangements.
The Tutti framework has the potential to significantly impact the field of music generation and synthesis, particularly in applications involving choral music and multi-singer arrangements. By enhancing the realism and expressiveness of synthesized singing voices, this research could facilitate advancements in music production, virtual performances, and interactive music applications. The implications extend to creative industries, education, and entertainment, where realistic vocal synthesis can enhance user experiences. The paper presents Tutti, a novel framework for dynamic multi-singer synthesis that significantly enhances the acoustic realism and artistic cohesion of choral generation. The innovative methodology and comprehensive experimental validation position this work as a meaningful contribution to the field of machine learning and audio synthesis.
Open-vocabulary keyword spotting (OV-KWS) enables personalized device control via arbitrary voice commands. Recently, researchers have explored using audio-text joint embeddings, allowing users to enroll phrases with text, and proposed techniques to disambiguate similar utterances. We find that existing OV-KWS solutions often overly bias the beginning phonemes of an enrollment, causing false triggers when negative enrollment-query-pairs share a prefix (``turn the volume up'' vs. ``turn the volume down''). We trace this to two factors: training data bias and position-biased cross-modal scoring. To address these limitations, we introduce the Partial Overlap Benchmark (POB) with two datasets, POB-Spark and POB-LibriPhrase (POB-LP), containing mismatched audio-text pairs with shared prefixes, and propose Equal-weighting Position Scoring (EPS), a lightweight decision layer. Using EPS alone reduces EER on POB-Spark from 64.4\% to 29.3\% and improves POB-LP accuracy from 87.6\% to 96.8\%, while maintaining performance on LibriPhrase and Google Speech Commands (GSC). With POB data added in training, our work achieves the best POB benchmark results while incurring the least amount of degradation on prior metrics among baselines. This degradation is most pronounced in GSC, which contains only one-word commands. We surface mitigating this trade-off as future work.
Primary: University of California San Diego
All Institutions: University of California San Diego, Bose Corporation
This paper makes a meaningful contribution to the field of machine learning by addressing a critical challenge in open-vocabulary keyword spotting and proposing effective solutions. The combination of innovative methodology and practical applications positions this work as a valuable reference for future research in audio processing and multimodal representation learning.
The paper introduces a novel approach to mitigating prefix bias in open-vocabulary keyword spotting (OV-KWS) through the development of the Equal-weighting Position Scoring (EPS) module and the Partial Overlap Benchmark (POB). The methodology is sound, as it identifies and addresses specific shortcomings in existing OV-KWS systems, particularly in handling phrases with shared prefixes. The creation of two datasets (POB-Spark and POB-LibriPhrase) is a significant contribution, providing a basis for evaluating the performance of OV-KWS under more realistic conditions. The EPS module's design is lightweight and interpretable, which is beneficial for deployment in edge devices.
The experiments are well-structured, comparing the proposed methods against established baselines (SLiCK and PhonMatchNet) under various training conditions. The results demonstrate a clear improvement in performance metrics, particularly in reducing the equal error rate (EER) on the POB datasets. The paper effectively highlights the trade-offs involved when incorporating the POB data during training, providing a nuanced understanding of the model's performance across different scenarios.
The paper provides sufficient implementation details, including model architectures, training procedures, and dataset descriptions, which facilitate reproducibility. The authors mention using a specific framework (PyTorch) and provide links to their datasets, which is a positive aspect for researchers looking to replicate or build upon their work.
One limitation noted in the paper is the performance degradation on single-word commands when using POB data for training. This suggests that while the proposed methods improve robustness for longer phrases, they may inadvertently compromise performance on shorter commands. Additionally, the paper does not explore the potential for more complex scoring mechanisms that could further mitigate prefix bias without introducing new biases.
The findings have significant implications for the development of more robust voice-controlled systems, particularly in consumer electronics and smart devices. By improving the accuracy of OV-KWS, this research could enhance user experience and broaden the applicability of voice command technologies in various domains, including accessibility, gaming, and home automation. This paper makes a meaningful contribution to the field of machine learning by addressing a critical challenge in open-vocabulary keyword spotting and proposing effective solutions. The combination of innovative methodology and practical applications positions this work as a valuable reference for future research in audio processing and multimodal representation learning.
Current audio formats present a fundamental trade-off between file size and functionality: lossless formats like FLAC preserve quality but lack adaptability, while lossy formats reduce size at the cost of fidelity and offer no stem-level access.We introduce the Stem-Native Codec (SNC), a novel audio container format that stores music as independently encoded stems plus a low-energy mastering residual. By exploiting the lower information entropy of separated stems compared to mixed audio, SNC achieves a 38.2% file size reduction versus FLAC (7.76 MB vs. 12.55 MB for a 2:18 test track) while maintaining perceptual transparency (STOI = 0.996). Unlike existing formats, SNC enables context-aware adaptive playback, spatial audio rendering, and user-controlled remixing without requiring additional storage. Our experimental validation demonstrates that the stems-plus residual architecture successfully decouples the conflicting requirements of compression efficiency and feature richness, offering a practical path toward next-generation audio distribution systems.
Primary: Wubble AI
All Institutions: Wubble AI
The main contribution of this paper is the introduction of the Stem-Native Codec (SNC), which innovatively combines efficient lossless audio storage with adaptive playback capabilities. This work presents a significant advancement in audio compression technology, addressing key limitations of existing formats and paving the way for future developments in audio distribution systems.
The methodology is well-structured, introducing the Stem-Native Codec (SNC) as a novel approach to audio storage that separates audio into independently encoded stems and a mastering residual. The theoretical framework is grounded in information theory, establishing a strong basis for the claim that separated stems have lower information entropy than mixed audio. The choice of using Opus for encoding stems is justified, and the detailed description of the encoding and decoding processes demonstrates a comprehensive understanding of audio compression techniques. However, the paper could benefit from clearer references to the sections mentioned in the contributions, as they are currently marked as [REF].
The experimental validation is robust, showcasing a significant file size reduction of 38.2% compared to FLAC while maintaining high perceptual quality (STOI = 0.996). The use of objective metrics such as spectral convergence and SNR adds credibility to the results. The paper effectively compares SNC with existing formats and highlights its advantages in terms of adaptive playback and spatial audio rendering. However, the experiments rely on a single test track, which may limit the generalizability of the findings.
The paper provides open-source encoder and decoder implementations, which is a strong point for reproducibility. The detailed encoding parameters and procedures are well-documented, allowing for potential replication of the results. However, the lack of a demo or project URL limits accessibility for interested researchers.
The primary limitation identified is the dependency on high-quality stems for effective encoding. The paper acknowledges that AI separation methods may introduce artifacts, which could affect the performance of SNC. Additionally, the decoding complexity is slightly higher than traditional formats, which may pose challenges for some applications. The need for standardized metadata schemas for adaptive playback features is also a potential barrier to widespread adoption.
The SNC has the potential to significantly influence music distribution by enabling smaller file sizes and enhanced playback experiences tailored to diverse environments. It opens up new avenues for artists to engage with their audience through remixing capabilities and adaptive features. The proposed format could also lead to reduced storage and bandwidth costs for streaming platforms, making advanced audio formats more accessible. The main contribution of this paper is the introduction of the Stem-Native Codec (SNC), which innovatively combines efficient lossless audio storage with adaptive playback capabilities. This work presents a significant advancement in audio compression technology, addressing key limitations of existing formats and paving the way for future developments in audio distribution systems.
While recent years have witnessed rapid progress in speech synthesis, open-source singing voice synthesis (SVS) systems still face significant barriers to industrial deployment, particularly in terms of robustness and zero-shot generalization. In this report, we introduce SoulX-Singer, a high-quality open-source SVS system designed with practical deployment considerations in mind. SoulX-Singer supports controllable singing generation conditioned on either symbolic musical scores (MIDI) or melodic representations, enabling flexible and expressive control in real-world production workflows. Trained on more than 42,000 hours of vocal data, the system supports Mandarin Chinese, English, and Cantonese and consistently achieves state-of-the-art synthesis quality across languages under diverse musical conditions. Furthermore, to enable reliable evaluation of zero-shot SVS performance in practical scenarios, we construct SoulX-Singer-Eval, a dedicated benchmark with strict training-test disentanglement, facilitating systematic assessment in zero-shot settings.
Primary: Soul-AI Lab
All Institutions: Soul-AI Lab
SoulX-Singer represents a significant advancement in zero-shot singing voice synthesis, combining a large-scale dataset with innovative modeling techniques to achieve high-quality, flexible vocal generation across multiple languages. The comprehensive evaluation and robust methodology position this work as a valuable contribution to the field of machine learning and audio synthesis.
The methodology of SoulX-Singer is robust, leveraging a large-scale dataset of over 42,000 hours of vocal recordings to enhance zero-shot generalization capabilities. The dual-control mechanism (melody-control and score-control modes) is innovative, allowing for flexible synthesis based on different input types. The data processing pipeline is well-structured, ensuring high-quality vocal extraction and annotation, which is crucial for training effective models. The use of flow matching and a dedicated Singing Content Encoder to manage multimodal inputs is a significant advancement in the field.
The experimental evaluation is thorough, utilizing two distinct benchmarks (GMO-SVS and SoulX-Singer-Eval) to assess performance across multiple dimensions, including melodic accuracy, intelligibility, and overall singing quality. The results consistently demonstrate that SoulX-Singer outperforms existing state-of-the-art models, showcasing its effectiveness in both controlled and zero-shot scenarios. The comprehensive metrics used for evaluation provide a clear picture of the model's capabilities.
The paper provides sufficient detail regarding the architecture, training process, and evaluation metrics, which supports reproducibility. The availability of the dataset and code on GitHub further enhances the potential for other researchers to replicate the study. However, the reliance on specific pretrained models for vocal extraction and transcription may pose some challenges in reproducing the exact results without access to those models.
One limitation of the study is the potential for voice impersonation and ethical concerns associated with the use of synthesized voices, which the authors acknowledge. Additionally, while the model shows strong performance across multiple languages, the dataset's composition may still limit its generalization to other languages or dialects not represented in the training data.
SoulX-Singer has significant implications for the music production industry, enabling creators to synthesize high-quality singing voices without the need for extensive vocal recordings. This technology could democratize music creation, allowing individuals without access to professional singers to produce high-quality vocal tracks. However, the ethical considerations surrounding voice synthesis and potential misuse must be addressed to ensure responsible deployment. SoulX-Singer represents a significant advancement in zero-shot singing voice synthesis, combining a large-scale dataset with innovative modeling techniques to achieve high-quality, flexible vocal generation across multiple languages. The comprehensive evaluation and robust methodology position this work as a valuable contribution to the field of machine learning and audio synthesis.
Large audio-language models (LALMs) exhibit strong zero-shot capabilities in multiple downstream tasks, such as audio question answering (AQA) and abstract reasoning; however, these models still lag behind specialized models for certain discriminative tasks (e.g., audio classification). Recent studies show that sparse subsets of attention heads within an LALM can serve as strong discriminative feature extractors for downstream tasks such as classification via simple voting schemes. However, these methods assign uniform weights to all selected heads, implicitly assuming that each head contributes equally across all semantic categories. In this work, we propose Class-Conditional Sparse Attention Vectors for Large Audio-Language Models, a few-shot classification method that learns class-dependent importance weights over attention heads. This formulation allows individual heads to specialize in distinct semantic categories and to contribute to ensemble predictions proportionally to their estimated reliability. Experiments on multiple few-shot audio and audiovisual classification benchmarks and tasks demonstrate that our method consistently outperforms state-of-the-art uniform voting-based approaches by up to 14.52%, 1.53%, 8.35% absolute gains for audio classification, audio-visual classification, and spoofing detection respectively.
Primary: MIT-IBM Watson AI Lab
All Institutions: MIT-IBM Watson AI Lab, Tuebingen AI Center
The main contribution of this paper is the introduction of a class-dependent weighting mechanism for attention heads in large audio-language models, which significantly enhances their performance in few-shot classification tasks. This work represents a meaningful advancement in the field of audio processing and machine learning, addressing existing limitations in model performance and paving the way for future research in adaptive attention mechanisms.
The proposed method, Class-Conditional Sparse Attention Vectors, introduces a novel approach to weighting attention heads based on class-specific importance, which is a significant departure from previous methods that treated all heads equally. This class-dependent weighting mechanism allows for more nuanced feature extraction tailored to specific tasks, enhancing the model's performance in few-shot classification scenarios. The methodology is well-structured and builds upon existing frameworks in audio-language processing, demonstrating a clear understanding of the limitations of uniform voting schemes.
The experiments conducted across various benchmarks for audio classification, audio-visual classification, and spoofing detection are robust. The reported improvements over state-of-the-art methods by notable margins (up to 14.52% in audio classification) indicate that the proposed method is not only effective but also competitive in real-world applications. However, the paper would benefit from a more detailed description of the datasets used and the specific metrics for evaluation to enhance transparency.
The paper lacks sufficient implementation details and code availability, which are critical for reproducibility. While the methodology is sound, without access to the code or a clear description of the experimental setup, it would be challenging for other researchers to replicate the results.
One limitation is the reliance on few-shot learning, which may not generalize well to all audio classification tasks, particularly those requiring extensive training data. Additionally, the paper does not address potential biases in the attention heads or the implications of class imbalance in the datasets used.
The implications of this research are significant for the development of more efficient audio-language models that can be applied in various domains, including accessibility technologies, automated content moderation, and interactive AI systems. By improving the performance of LALMs in discriminative tasks, this work could enhance user experiences in applications such as voice assistants and audio-based search engines. The main contribution of this paper is the introduction of a class-dependent weighting mechanism for attention heads in large audio-language models, which significantly enhances their performance in few-shot classification tasks. This work represents a meaningful advancement in the field of audio processing and machine learning, addressing existing limitations in model performance and paving the way for future research in adaptive attention mechanisms.
Audio is a critical component of multimodal perception, and any truly intelligent system must demonstrate a wide range of auditory capabilities. These capabilities include transcription, classification, retrieval, reasoning, segmentation, clustering, reranking, and reconstruction. Fundamentally, each task involves transforming a raw audio signal into a meaningful 'embedding' - be it a single vector, a sequence of continuous or discrete representations, or another structured form - which then serves as the basis for generating the task's final response. To accelerate progress towards robust machine auditory intelligence, we present the Massive Sound Embedding Benchmark (MSEB): an extensible framework designed to evaluate the auditory components of any multimodal system. In its first release, MSEB offers a comprehensive suite of eight core tasks, with more planned for the future, supported by diverse datasets, including the new, large-scale Simple Voice Questions (SVQ) dataset. Our initial experiments establish clear performance headrooms, highlighting the significant opportunity to improve real-world multimodal experiences where audio is a core signal. We encourage the research community to use MSEB to assess their algorithms and contribute to its growth. The library is publicly hosted at github.
Primary: Google Research
All Institutions: Google Research
The paper presents the Massive Sound Embedding Benchmark (MSEB), a comprehensive framework for evaluating auditory capabilities in multimodal systems. The proposed methodology and initial experiments highlight significant opportunities for improvement in machine auditory intelligence, although further details on implementation and rigorous benchmarking against existing methods would enhance its impact.
The paper introduces the Massive Sound Embedding Benchmark (MSEB), which is a novel framework aimed at evaluating auditory capabilities in multimodal systems. The methodology is well-structured, presenting eight core tasks that cover a wide range of audio processing capabilities. The inclusion of the Simple Voice Questions (SVQ) dataset is a significant addition, as it provides a large-scale resource for benchmarking. The tasks are clearly defined, and the framework is extensible, allowing for future enhancements. However, the paper could benefit from more detailed descriptions of the specific algorithms or techniques used to generate embeddings for each task.
The initial experiments reported in the paper establish performance benchmarks across the eight tasks, indicating clear performance headrooms. While the results are promising, the paper lacks detailed quantitative results and comparisons with existing benchmarks, which would strengthen the claims of improvement. Additionally, the experimental setup and metrics used for evaluation are not thoroughly discussed, which raises questions about the robustness of the findings.
The paper mentions that the library is publicly hosted on GitHub, which is a positive aspect for reproducibility. However, there is limited information on the specific implementation details, such as the versions of libraries used, the hardware setup, and the training procedures. This lack of detail could hinder other researchers from effectively reproducing the results.
One limitation is the potential overfitting to the benchmark tasks, as the initial experiments may not fully represent real-world scenarios. Furthermore, the paper does not address the scalability of the framework or how it performs with varying audio qualities and conditions. The reliance on a single dataset (SVQ) for initial experiments may also limit the generalizability of the findings.
The MSEB framework has the potential to significantly impact the field of machine auditory intelligence by providing a standardized way to evaluate and compare different algorithms. This could accelerate advancements in multimodal systems that rely on audio processing, with applications in areas such as human-computer interaction, accessibility technologies, and automated content generation. The paper presents the Massive Sound Embedding Benchmark (MSEB), a comprehensive framework for evaluating auditory capabilities in multimodal systems. The proposed methodology and initial experiments highlight significant opportunities for improvement in machine auditory intelligence, although further details on implementation and rigorous benchmarking against existing methods would enhance its impact.
Speech tokenizers are foundational to speech language models, yet existing approaches face two major challenges: (1) balancing trade-offs between encoding semantics for understanding and acoustics for reconstruction, and (2) achieving low bit rates and low token rates. We propose Speech Diffusion Tokenizer (SiTok), a diffusion autoencoder that jointly learns semantic-rich representations through supervised learning and enables high-fidelity audio reconstruction with diffusion. We scale SiTok to 1.6B parameters and train it on 2 million hours of speech. Experiments show that SiTok outperforms strong baselines on understanding, reconstruction and generation tasks, at an extremely low token rate of $12.5$ Hz and a bit-rate of 200 bits-per-second.
Primary: Meta
All Institutions: Meta
The main contribution of this paper is the introduction of SiTok, a novel speech tokenizer that utilizes a diffusion autoencoder to achieve high-quality speech representation and reconstruction while maintaining low bit and token rates. This work significantly advances the field of speech processing by addressing key challenges in existing methodologies and providing a robust framework for future research and applications.
The proposed methodology of the Speech Diffusion Tokenizer (SiTok) is innovative, leveraging a diffusion autoencoder to jointly optimize quantization and reconstruction. The introduction of semantic regularization through a CTC decoder is a significant advancement, allowing the model to maintain semantic integrity while achieving high compression rates. The architecture effectively combines the strengths of diffusion models with the need for efficient speech tokenization, addressing the limitations of previous approaches that often relied on heuristic compromises. The design choices, such as the use of mel-spectrograms and the focus on low token rates, are well-justified and align with the objectives of scalable language modeling.
The experiments conducted are extensive, utilizing a large dataset of 2 million hours of speech, which enhances the robustness of the findings. The paper provides a comprehensive evaluation across various tasks, including speech reconstruction, emotion recognition, and automatic speech recognition, demonstrating that SiTok outperforms existing baselines significantly. The results are well-presented, with clear metrics for comparison, and the ablation studies effectively highlight the contributions of different components of the model.
The paper includes detailed descriptions of the model architecture, training settings, and evaluation protocols, which are crucial for reproducibility. The authors have made efforts to ensure that their work can be replicated, which is commendable. However, the absence of a publicly available code repository limits the ease of reproducibility for practitioners in the field.
While the proposed model shows promising results, it may still face challenges in real-world applications, such as the potential for overfitting due to the large number of parameters (1.6B) and the reliance on extensive training data. Additionally, the computational efficiency during inference, although improved with shortcut fine-tuning, may still be a concern for deployment in resource-constrained environments. The paper does not address the ethical implications of misuse in generating synthetic speech, which is an important consideration in today's landscape.
The development of SiTok has significant implications for speech technology, particularly in applications such as automatic speech recognition, text-to-speech systems, and conversational agents. By enabling high-fidelity audio reconstruction at low bit rates, this work could enhance accessibility and usability in various domains, including assistive technologies and real-time communication systems. The potential for misuse, such as generating deceptive synthetic speech, highlights the need for responsible deployment and monitoring of such technologies. The main contribution of this paper is the introduction of SiTok, a novel speech tokenizer that utilizes a diffusion autoencoder to achieve high-quality speech representation and reconstruction while maintaining low bit and token rates. This work significantly advances the field of speech processing by addressing key challenges in existing methodologies and providing a robust framework for future research and applications.
Spatial audio is crucial for creating compelling immersive 360-degree video experiences. However, generating realistic spatial audio, such as first-order ambisonics (FOA), from 360-degree videos in complex acoustic scenes remains challenging. Existing methods often overlook the dynamic nature and acoustic complexity of 360-degree scenes, fail to fully account for dynamic sound sources, and neglect complex environmental effects such as occlusion, reflections, and reverberation, which are influenced by scene geometries and materials. We propose DynFOA, a framework based on dynamic acoustic perception and conditional diffusion, for generating high-fidelity FOA from 360-degree videos. DynFOA first performs visual processing via a video encoder, which detects and localizes multiple dynamic sound sources, estimates their depth and semantics, and reconstructs the scene geometry and materials using a 3D Gaussian Splatting. This reconstruction technique accurately models occlusion, reflections, and reverberation based on the geometries and materials of the reconstructed 3D scene and the listener's viewpoint. The audio encoder then captures the spatial motion and temporal 4D sound source trajectories to fine-tune the diffusion-based FOA generator. The fine-tuned FOA generator adjusts spatial cues in real time, ensuring consistent directional fidelity during listener head rotation and complex environmental changes. Extensive evaluations demonstrate that DynFOA consistently outperforms existing methods across metrics such as spatial accuracy, acoustic fidelity, and distribution matching, while also improving the user experience. Therefore, DynFOA provides a robust and scalable approach to rendering realistic dynamic spatial audio for VR and immersive media applications.
Primary: Martha Stewart Enterprises
All Institutions: Martha Stewart Enterprises, Allied Widgets Research
DynFOA presents a significant advancement in the generation of spatial audio for complex acoustic environments. The integration of visual and acoustic processing through a conditional diffusion model marks a notable contribution to the field, addressing critical challenges in immersive audio rendering.
The methodology presented in DynFOA is robust, integrating a multi-modal approach that combines visual processing with audio generation through conditional diffusion. The use of 3D Gaussian Splatting for scene reconstruction is particularly innovative, allowing for a detailed understanding of the environment that enhances acoustic fidelity. The model's architecture, which includes separate encoders for video and audio, effectively captures the complexities of dynamic sound sources in 360-degree videos. However, the reliance on specific datasets and the complexity of the model may limit its applicability in diverse real-world scenarios.
The experimental evaluation is comprehensive, utilizing a well-structured dataset (Dyn360) that includes various acoustic scenarios. The results demonstrate a clear superiority of DynFOA over baseline methods across multiple metrics, including spatial accuracy and acoustic fidelity. The inclusion of both objective metrics and user studies strengthens the findings, providing a balanced view of the model's performance. However, the paper could benefit from a more detailed discussion of the statistical significance of the results.
The paper lacks specific implementation details that would facilitate reproducibility, such as code availability or detailed descriptions of the training process. While the methodology is described in depth, the absence of a public repository or demo limits the ability of other researchers to replicate the results.
Key limitations include the model's performance in uncontrolled environments, as the experiments were primarily conducted in indoor settings. Additionally, the approach may not generalize well to different acoustic conditions, such as underwater environments or those with varying material properties. The reliance on specific datasets could also introduce biases that affect the generalizability of the findings.
The potential applications of DynFOA are significant, particularly in the fields of virtual reality, augmented reality, and immersive media. By improving the realism of spatial audio, this work can enhance user experiences in gaming, film, and educational applications. The integration of visual and acoustic modalities could pave the way for more immersive storytelling and interactive experiences. DynFOA presents a significant advancement in the generation of spatial audio for complex acoustic environments. The integration of visual and acoustic processing through a conditional diffusion model marks a notable contribution to the field, addressing critical challenges in immersive audio rendering.
Realistic sound propagation is essential for immersion in a virtual scene, yet physically accurate wave-based simulations remain computationally prohibitive for real-time applications. Wave coding methods address this limitation by precomputing and compressing impulse responses of a given scene into a set of scalar acoustic parameters, which can reach unmanageable sizes in large environments with many source-receiver pairs. We introduce Reciprocal Latent Fields (RLF), a memory-efficient framework for encoding and predicting these acoustic parameters. The RLF framework employs a volumetric grid of trainable latent embeddings decoded with a symmetric function, ensuring acoustic reciprocity. We study a variety of decoders and show that leveraging Riemannian metric learning leads to a better reproduction of acoustic phenomena in complex scenes. Experimental validation demonstrates that RLF maintains replication quality while reducing the memory footprint by several orders of magnitude. Furthermore, a MUSHRA-like subjective listening test indicates that sound rendered via RLF is perceptually indistinguishable from ground-truth simulations.
Primary: unknown
All Institutions: unknown
The paper presents a novel framework for modeling sound propagation using latent embeddings, significantly improving memory efficiency and maintaining perceptual quality in audio rendering. The technical contributions, particularly the integration of Riemannian metric learning, position this work as a meaningful advancement in the field of audio machine learning, with practical applications in immersive environments.
The paper introduces the Reciprocal Latent Fields (RLF) framework, which innovatively utilizes a volumetric grid of trainable latent embeddings to encode and predict acoustic parameters. The methodology emphasizes the importance of acoustic reciprocity by employing symmetric functions in the decoding process. The use of Riemannian metric learning to enhance the accuracy of acoustic phenomena reproduction is a notable advancement over simpler Euclidean models. The approach is well-structured, with clear definitions and justifications for the chosen methods, including the training process and the architecture of the decoders.
The experimental validation is robust, featuring a variety of models and configurations tested across two distinct environments (Audio Gym and Wwise Audio Lab). The results demonstrate significant memory efficiency gains while maintaining high fidelity in sound reproduction, as evidenced by both quantitative metrics and qualitative assessments through MUSHRA-like listening tests. The paper provides a thorough analysis of the performance of different models, comparing their accuracy and computational costs effectively.
While the paper details the methodology and experimental setup comprehensively, it lacks explicit URLs for code or data repositories, which could hinder reproducibility. The description of the training data generation and model training processes is clear, but without access to the actual implementation, independent verification of results may be challenging.
The primary limitations identified include the lack of implementation for spatial compression of the latent fields and the restriction to static geometries, which limits the applicability of the RLF framework in dynamic environments. The authors acknowledge these limitations and suggest future work to address them, indicating an awareness of the framework's current constraints.
The RLF framework has significant implications for real-time audio rendering in virtual environments, particularly in gaming and simulation contexts. By reducing memory requirements while maintaining high-quality sound reproduction, this work could enhance user experiences in immersive environments. The potential for extending the framework to other reciprocal quantities also opens avenues for further research and applications beyond acoustics. The paper presents a novel framework for modeling sound propagation using latent embeddings, significantly improving memory efficiency and maintaining perceptual quality in audio rendering. The technical contributions, particularly the integration of Riemannian metric learning, position this work as a meaningful advancement in the field of audio machine learning, with practical applications in immersive environments.
AI music generators have advanced to the point where their outputs are often indistinguishable from human compositions. While detection methods have emerged, they are typically designed and validated in music streaming contexts with clean, full-length tracks. Broadcast audio, however, poses a different challenge: music appears as short excerpts, often masked by dominant speech, conditions under which existing detectors fail. In this work, we introduce AI-OpenBMAT, the first dataset tailored to broadcast-style AI-music detection. It contains 3,294 one-minute audio excerpts (54.9 hours) that follow the duration patterns and loudness relations of real television audio, combining human-made production music with stylistically matched continuations generated with Suno v3.5. We benchmark a CNN baseline and state-of-the-art SpectTTTra models to assess SNR and duration robustness, and evaluate on a full broadcast scenario. Across all settings, models that excel in streaming scenarios suffer substantial degradation, with F1-scores dropping below 60% when music is in the background or has a short duration. These results highlight speech masking and short music length as critical open challenges for AI music detection, and position AI-OpenBMAT as a benchmark for developing detectors capable of meeting industrial broadcast requirements.
Primary: Music Technology Group
All Institutions: BMAT Licensing S.L, Music Technology Group, XYZ agency
This paper presents a comprehensive approach to addressing the challenges of AI-generated music detection in broadcast environments, filling a critical gap in the existing literature and providing a valuable resource for future research. The introduction of the AI-OpenBMAT dataset and the systematic evaluation of current models under realistic conditions mark a significant contribution to the field of audio machine learning.
The paper introduces a novel dataset, AI-OpenBMAT, specifically designed for detecting AI-generated music in broadcast settings, which is a significant advancement over existing datasets that focus on streaming contexts. The methodology is robust, involving careful construction of audio excerpts that mimic real broadcast conditions, including variations in loudness and SNR. The benchmarking of existing models (CNN and SpectTTTra) under these conditions provides a clear framework for evaluating performance degradation, which is a critical aspect of the research.
The experiments are well-structured, focusing on SNR robustness, duration robustness, and full broadcast scenarios. The results clearly demonstrate the limitations of current detection models when faced with real-world challenges, such as speech masking and short music excerpts. The use of F1-scores as a performance metric is appropriate given the class imbalance in broadcast audio, and the detailed analysis of results provides valuable insights into the performance of different models.
The paper provides a clear description of the dataset creation process and the experimental setup, which enhances reproducibility. However, specific implementation details of the models used (e.g., hyperparameters, training procedures) are somewhat lacking, which could hinder full reproducibility of the results by other researchers.
One limitation is the reliance on a single dataset for evaluation, which may not capture the full diversity of broadcast audio scenarios. Additionally, while the paper highlights the performance drop in existing models, it does not propose specific improvements or new model architectures that could address these challenges.
The research has significant implications for the music industry, particularly for broadcasters and rights holders who need reliable detection methods for AI-generated music. The dataset and findings could stimulate further research in the area of audio detection and contribute to the development of more robust detection systems. This paper presents a comprehensive approach to addressing the challenges of AI-generated music detection in broadcast environments, filling a critical gap in the existing literature and providing a valuable resource for future research. The introduction of the AI-OpenBMAT dataset and the systematic evaluation of current models under realistic conditions mark a significant contribution to the field of audio machine learning.