Large Audio Language Models struggle to disentangle overlapping events in complex acoustic scenes, yielding temporally inconsistent captions and frequent hallucinations. We introduce Timestamped Audio Captioner (TAC), a model that produces temporally grounded audio descriptions at varying degrees of detail and resolution. TAC is trained with a synthetic data pipeline that constructs challenging and dynamic mixtures from real-world audio sources, enabling robust learning under realistic polyphonic conditions. Across event detection and dense captioning, TAC outperforms all competing methods, with a low hallucination rate and accurate temporal grounding. We also introduce TAC-V, an audio-visual pipeline to generate semantically rich audio-visual descriptions. We then show that TAC and TAC-V serves as a "semantic bridge" for a text-only reasoner: a simple TAC$\rightarrow$LLM and TAC-V$\rightarrow$LLM cascade achieves state-of-the-art scores on benchmarks for both audio (MMAU-Pro, MMSU, MMAR) and audio-visual (DailyOmni, VideoHolmes) understanding and reasoning respectively.
Primary: University of Maryland, College Park
All Institutions: University of Maryland, College Park, Adobe Research, OpenAI
The main contribution of this paper is the development of TAC, a model that produces temporally grounded audio captions with low hallucination rates, significantly advancing the state of audio understanding. This work addresses critical shortcomings in existing models and presents a robust framework for future research in audio and audio-visual reasoning.
The paper introduces the Timestamped Audio Captioner (TAC) and its extension TAC-V, which leverage a synthetic data pipeline to create temporally grounded audio descriptions. The methodology is innovative, utilizing a dynamic acoustic mixer to generate complex audio mixtures with precise temporal annotations, addressing the limitations of traditional audio captioning methods that often rely on sparse annotations. The approach of separating the audio captioning task from reasoning tasks through a cascade with a text-only LLM is particularly noteworthy, allowing for independent scaling and improved performance.
The experiments are comprehensive, comparing TAC against state-of-the-art models on multiple benchmarks, including MMAU-Pro, MMSU, and others. The results demonstrate significant improvements in temporal grounding and reduced hallucination rates, validating the effectiveness of the proposed methods. The ablation studies provide insights into the importance of various components of the model, further strengthening the findings.
The paper provides sufficient detail regarding the implementation, including the use of specific architectures (Qwen2-Audio) and training procedures (LoRA). However, the reliance on synthetic data may introduce challenges in replicating results in real-world scenarios, which could limit reproducibility.
The authors acknowledge limitations related to the synthetic data approach, including potential biases and a sim-to-real gap. Additionally, the model may struggle with fine-grained musical precision, which could affect its applicability in certain contexts.
The work has significant implications for improving the reliability of audio understanding systems, particularly in safety-critical applications and accessibility tools for the hearing impaired. However, the potential for misuse in surveillance contexts raises ethical considerations that must be addressed. The main contribution of this paper is the development of TAC, a model that produces temporally grounded audio captions with low hallucination rates, significantly advancing the state of audio understanding. This work addresses critical shortcomings in existing models and presents a robust framework for future research in audio and audio-visual reasoning.
Neural audio codecs (NACs) typically encode the short-term energy (gain) and normalized structure (shape) of speech/audio signals jointly within the same latent space. As a result, they are poorly robust to a global variation of the input signal level in the sense that such variation has strong influence on the embedding vectors at the output of the encoder and their quantization. This methodology is inherently inefficient, leading to codebook redundancy and suboptimal bitrate-distortion performance. To address these limitations, we propose to introduce shape-gain decomposition, widely used in classical speech/audio coding, into the NAC framework. The principle of the proposed Equalizer methodology is to decompose the input signal -- before the NAC encoder -- into gain and normalized shape vector on a short-term basis. The shape vector is processed by the NAC, while the gain is quantized with scalar quantization and transmitted separately. The output (decoded) signal is reconstructed from the normalized output of the NAC and the quantized gain. Our experiments conducted on speech signals show that this general methodology, easily applicable to any NAC, enables a substantial gain in bitrate-distortion performance, as well as a massive reduction in complexity.
Primary: Inria at Univ. Grenoble Alpes
All Institutions: Inria at Univ. Grenoble Alpes, CNRS, LJK, Univ. Grenoble Alpes, Grenoble-INP, GIPSA-lab
The main contribution of this paper is the introduction of The Equalizer, a novel methodology that applies shape-gain decomposition to enhance the performance of neural audio codecs. This work bridges classical signal processing techniques with modern machine learning approaches, providing a significant advancement in the efficiency and robustness of audio coding systems.
The proposed methodology, The Equalizer, introduces a novel shape-gain decomposition approach to neural audio codecs (NACs), which is a significant departure from traditional methods that encode gain and shape jointly. The paper effectively integrates classical signal processing concepts into modern NAC frameworks, demonstrating a clear understanding of both domains. The methodology is well-structured, involving the decomposition of input signals into gain and shape vectors before encoding, and the subsequent reconstruction of the output signal. This approach not only enhances bitrate-distortion performance but also reduces complexity, making it a valuable contribution to the field.
The experiments are robust, utilizing a substantial dataset (LibriSpeech) and comparing the proposed method against several state-of-the-art NACs. The evaluation metrics—STOI, PESQ, and SI-SDR—are appropriate for assessing audio quality and intelligibility. The results clearly demonstrate the advantages of the proposed method over traditional NACs, particularly in terms of robustness to gain variations and overall performance across different bitrates. The paper provides comprehensive experimental results that substantiate the claims made about the effectiveness of The Equalizer.
The paper includes detailed implementation details, including the training setup, evaluation metrics, and specific configurations used for the NACs. However, the lack of a publicly available project URL or demo limits the reproducibility of the results. Future work could benefit from making the code and models available to the community to facilitate further exploration and validation of the proposed methodology.
One limitation of the study is the focus on speech signals, which may not generalize to other audio types. Additionally, while the paper discusses the potential for future work, it does not explore the implications of the normalization on the embedding vectors in detail, which could be crucial for understanding the full impact of the proposed method.
The proposed methodology has significant implications for audio coding and compression, particularly in applications where efficient transmission and storage of audio data are critical, such as in telecommunications and streaming services. By improving the robustness and efficiency of NACs, this work could lead to better audio quality in various consumer and professional audio applications. The main contribution of this paper is the introduction of The Equalizer, a novel methodology that applies shape-gain decomposition to enhance the performance of neural audio codecs. This work bridges classical signal processing techniques with modern machine learning approaches, providing a significant advancement in the efficiency and robustness of audio coding systems.
We propose CC-G2PnP, a streaming grapheme-to-phoneme and prosody (G2PnP) model to connect large language model and text-to-speech in a streaming manner. CC-G2PnP is based on Conformer-CTC architecture. Specifically, the input grapheme tokens are processed chunk by chunk, which enables streaming inference of phonemic and prosodic (PnP) labels. By guaranteeing minimal look-ahead size to each input token, the proposed model can consider future context in each token, which leads to stable PnP label prediction. Unlike previous streaming methods that depend on explicit word boundaries, the CTC decoder in CC-G2PnP effectively learns the alignment between graphemes and phonemes during training, making it applicable to unsegmented languages. Experiments on a Japanese dataset, which has no explicit word boundaries, show that CC-G2PnP significantly outperforms the baseline streaming G2PnP model in the accuracy of PnP label prediction.
Primary: LY Corporation
All Institutions: LY Corporation
The paper presents CC-G2PnP, a novel streaming model for grapheme-to-phoneme and prosody conversion that addresses the challenges of unsegmented languages. Its innovative methodology and robust experimental results position it as a significant contribution to the field of audio processing and speech synthesis.
The proposed CC-G2PnP model employs a Conformer-CTC architecture that innovatively processes grapheme tokens in chunks, allowing for streaming inference of phonemic and prosodic labels. The introduction of minimum look-ahead (MLA) is a significant methodological advancement, as it addresses the limitations of previous streaming models that rely on explicit word boundaries. This approach is particularly beneficial for unsegmented languages like Japanese, where word boundaries are not clearly defined. The integration of self-conditioned CTC into the architecture further enhances the model's performance by allowing dynamic learning of alignments between graphemes and phonemes.
The experiments conducted on a Japanese dataset demonstrate the effectiveness of CC-G2PnP, showing significant improvements in character error rate (CER) and sentence error rate (SER) compared to baseline models. The use of both objective metrics and subjective assessments of TTS naturalness provides a comprehensive evaluation of the model's performance. The dataset preparation and experimental conditions are well-documented, allowing for a clear understanding of the model's capabilities and limitations.
While the paper provides detailed descriptions of the model architecture and training procedures, the lack of a publicly available code repository or demo URL limits reproducibility. The absence of specific hyperparameters and training configurations in a readily accessible format could hinder other researchers from replicating the results.
One limitation noted is the reliance on a large amount of training data to achieve optimal performance, which may not be feasible for all applications. Additionally, while the model performs well in terms of accuracy, the subjective evaluation of TTS naturalness could vary based on the speaker used during testing, which may not generalize across different voices.
The CC-G2PnP model has the potential to significantly enhance text-to-speech systems, particularly for languages without explicit word boundaries. This could lead to more natural and efficient human-machine interactions in various applications, including virtual assistants, language learning tools, and accessibility technologies for the visually impaired. The advancements in streaming G2PnP could also inspire further research in related areas, such as real-time speech synthesis and multilingual processing. The paper presents CC-G2PnP, a novel streaming model for grapheme-to-phoneme and prosody conversion that addresses the challenges of unsegmented languages. Its innovative methodology and robust experimental results position it as a significant contribution to the field of audio processing and speech synthesis.
In voice conversion (VC) applications, diffusion and flow-matching models have exhibited exceptional speech quality and speaker similarity performances. However, they are limited by slow conversion owing to their iterative inference. Consequently, we propose MeanVoiceFlow, a novel one-step nonparallel VC model based on mean flows, which can be trained from scratch without requiring pretraining or distillation. Unlike conventional flow matching that uses instantaneous velocity, mean flows employ average velocity to more accurately compute the time integral along the inference path in a single step. However, training the average velocity requires its derivative to compute the target velocity, which can cause instability. Therefore, we introduce a structural margin reconstruction loss as a zero-input constraint, which moderately regularizes the input-output behavior of the model without harmful statistical averaging. Furthermore, we propose conditional diffused-input training in which a mixture of noise and source data is used as input to the model during both training and inference. This enables the model to effectively leverage source information while maintaining consistency between training and inference. Experimental results validate the effectiveness of these techniques and demonstrate that MeanVoiceFlow achieves performance comparable to that of previous multi-step and distillation-based models, even when trained from scratch. Audio samples are available at https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/meanvoiceflow/.
Primary: NTT Corporation
All Institutions: NTT Corporation
The paper presents MeanVoiceFlow, a novel one-step nonparallel voice conversion model that significantly enhances conversion speed and efficiency. The technical contributions, particularly in addressing training stability and maintaining consistency between training and inference, are well-founded and have the potential to influence future work in voice conversion and related audio applications.
The proposed MeanVoiceFlow model introduces a novel approach to voice conversion by utilizing mean flows instead of traditional instantaneous velocities, which significantly enhances the speed and efficiency of the conversion process. The introduction of a structural margin reconstruction loss addresses training instability, while the conditional diffused-input training method effectively bridges the gap between training and inference, ensuring consistency in performance. The methodology is well-structured, with clear theoretical foundations and practical implementations that are rigorously justified.
The experimental validation is thorough, employing a variety of datasets and metrics to assess the model's performance. The results demonstrate that MeanVoiceFlow achieves performance on par with existing multi-step and distillation-based models, showcasing its effectiveness even when trained from scratch. The use of both objective and subjective evaluation metrics strengthens the credibility of the findings, although further details on the statistical significance of the results would enhance the robustness of the claims.
The paper provides sufficient implementation details, including the architecture of the neural networks and the training procedures, which should facilitate reproducibility. However, the absence of code availability or a public repository could hinder independent verification of the results. Including a clear description of the experimental setup and hyperparameters is beneficial, yet a shared codebase would greatly enhance reproducibility.
One limitation of the study is the reliance on specific datasets, which may affect the generalizability of the results to other voice conversion tasks or languages. Additionally, while the model performs well in zero-shot scenarios, its performance in more complex voice conversion tasks involving diverse accents or languages remains to be evaluated. The potential for over-smoothing in outputs due to the structural margin reconstruction loss also warrants further investigation.
The advancements presented in this paper have significant implications for real-time voice conversion applications, such as in virtual assistants, gaming, and entertainment. The ability to convert voices quickly and effectively without extensive pretraining could democratize access to high-quality voice synthesis technologies. Furthermore, the methodologies introduced may inspire future research in related fields, such as speech synthesis and audio processing. The paper presents MeanVoiceFlow, a novel one-step nonparallel voice conversion model that significantly enhances conversion speed and efficiency. The technical contributions, particularly in addressing training stability and maintaining consistency between training and inference, are well-founded and have the potential to influence future work in voice conversion and related audio applications.
Flow matching and diffusion bridge models have emerged as leading paradigms in generative speech enhancement, modeling stochastic processes between paired noisy and clean speech signals based on principles such as flow matching, score matching, and Schrödinger bridge. In this paper, we present a framework that unifies existing flow and diffusion bridge models by interpreting them as constructions of Gaussian probability paths with varying means and variances between paired data. Furthermore, we investigate the underlying consistency between the training/inference procedures of these generative models and conventional predictive models. Our analysis reveals that each sampling step of a well-trained flow or diffusion bridge model optimized with a data prediction loss is theoretically analogous to executing predictive speech enhancement. Motivated by this insight, we introduce an enhanced bridge model that integrates an effective probability path design with key elements from predictive paradigms, including improved network architecture, tailored loss functions, and optimized training strategies. Experiments on denoising and dereverberation tasks demonstrate that the proposed method outperforms existing flow and diffusion baselines with fewer parameters and reduced computational complexity. The results also highlight that the inherently predictive nature of this generative framework imposes limitations on its achievable upper-bound performance.
Primary: Nanjing University
All Institutions: Nanjing University
The main contribution of this paper is the introduction of a unified framework for flow and diffusion bridge models in speech enhancement, which enhances performance through innovative methodologies and insights. This work significantly advances the field by bridging generative and predictive modeling approaches, offering a comprehensive solution to challenges in speech enhancement.
The paper presents a unified framework that integrates flow matching and diffusion bridge models for speech enhancement, providing a novel interpretation of these models as Gaussian probability paths. The methodology is robust, combining theoretical insights with practical improvements in network architecture and training strategies. The introduction of a time embedding mechanism and an enhanced loss function demonstrates a thoughtful approach to optimizing performance while reducing complexity.
The experiments are well-structured, utilizing two datasets for denoising and dereverberation tasks. The results show a clear performance advantage over existing baselines, with comprehensive metrics that validate the effectiveness of the proposed model. The ablation studies further strengthen the findings by isolating the impact of various modifications.
The paper includes sufficient implementation details, including dataset descriptions, training configurations, and hyperparameter settings, which enhance reproducibility. The availability of code on GitHub supports this aspect, allowing other researchers to replicate the experiments.
While the proposed model shows significant improvements, the authors acknowledge that its inherently predictive nature may impose an upper limit on performance compared to purely predictive models. Additionally, the reliance on specific architectures may limit generalizability to other tasks or domains.
The research has potential applications in various speech processing tasks, including real-time communication systems, hearing aids, and assistive technologies for the hearing impaired. The integration of predictive paradigms into generative models could inspire further innovations in speech enhancement and related fields. The main contribution of this paper is the introduction of a unified framework for flow and diffusion bridge models in speech enhancement, which enhances performance through innovative methodologies and insights. This work significantly advances the field by bridging generative and predictive modeling approaches, offering a comprehensive solution to challenges in speech enhancement.
Audio-text retrieval is crucial for bridging acoustic signals and natural language. While contrastive dual-encoder architectures like CLAP have shown promise, they are fundamentally limited by the capacity of small-scale encoders. Specifically, the text encoders struggle to understand complex queries that require reasoning or world knowledge. In this paper, we propose AuroLA, a novel contrastive language-audio pre-training framework that re-purposes Multimodal Large Language Models (MLLMs) as a unified backbone for retrieval. Specifically, we make three contributions: (i) we construct a scalable data pipeline that curates diverse audio from multiple sources and generates multi-granular captions, ranging from long descriptions to structured tags, via automated annotation; (ii) we adapt an MLLM for retrieval by prompting it to summarize the audio/text input and using the hidden state of a special token as audio/text embeddings. For model training, we devise a novel Hybrid-NCE loss, which employs multi-granular supervision and hard-negative reweighting to robustly align audio with diverse textual supervision; and (iii) we design an MLLM-based bidirectional re-ranking module that refines retrieval candidates through deep cross-modal interaction. Extensive experiments demonstrate that AuroLA consistently outperforms state-of-the-art models, including the recent PE-AV, while utilizing only approximately 1% of PE-AV's training data. Lastly, we observe clear scaling trends regarding dataset size and model capacity, validating the effectiveness of MLLM as a unified backbone for audio-text retrieval. Code is available at https://github.com/Jazzcharles/AuroLA.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of AuroLA, a novel framework that effectively utilizes Multimodal Large Language Models for audio-text retrieval, demonstrating significant improvements over existing methods. The comprehensive analysis of the technical contributions, innovative methodology, and promising experimental results highlight its potential impact on the field of machine learning and audio processing.
The proposed AuroLA framework introduces a novel approach to audio-text retrieval by leveraging Multimodal Large Language Models (MLLMs) as a unified backbone. The methodology is well-structured, with a focus on creating a scalable data pipeline and a Hybrid-NCE loss that enhances the alignment of audio and text embeddings through multi-granular supervision. The adaptation of MLLMs for retrieval tasks is innovative, particularly the use of a special token's hidden state for embeddings. However, the paper could benefit from a more detailed explanation of the implementation of the Hybrid-NCE loss and its advantages over traditional contrastive losses.
The experiments conducted are extensive, demonstrating the superiority of AuroLA over existing state-of-the-art models, including PE-AV, while using significantly less training data. The results are compelling, showcasing clear scaling trends that validate the proposed framework. However, the paper lacks a thorough comparison with a broader range of models and datasets, which could provide a more comprehensive understanding of AuroLA's performance across different scenarios.
The paper mentions that code is available on GitHub, which is a positive aspect for reproducibility. However, the paper does not provide sufficient implementation details or hyperparameter settings that would allow other researchers to easily replicate the experiments. A more detailed supplementary material or appendix could enhance reproducibility.
One limitation is the reliance on the quality and diversity of the audio data curated for training, which may affect the generalizability of the model. Additionally, while the use of MLLMs is innovative, the computational cost associated with training and deploying such models could be a barrier to practical applications. The paper also does not address potential biases in the data or the model's performance across different languages or dialects.
The implications of this research are significant, particularly in applications such as multimedia search engines, accessibility tools for the hearing impaired, and content-based audio retrieval systems. By improving audio-text retrieval capabilities, this work could enhance user experiences in various domains, including education, entertainment, and information retrieval. The main contribution of this paper is the introduction of AuroLA, a novel framework that effectively utilizes Multimodal Large Language Models for audio-text retrieval, demonstrating significant improvements over existing methods. The comprehensive analysis of the technical contributions, innovative methodology, and promising experimental results highlight its potential impact on the field of machine learning and audio processing.
Despite recent breakthroughs, audio foundation models struggle in processing complex multi-source acoustic scenes. We refer to this challenging domain as audio stories, which can have multiple speakers and background/foreground sound effects. Compared to traditional audio processing tasks, audio stories introduce new layers of semantic, temporal, and physical complexity. To address this challenge, we propose AudioChat, a framework for developing audio foundation models that can generate, edit, and understand audio stories. AudioChat introduces a new paradigm in which LLM-based toolcalling agents simulate interactions between users and the system, and these simulated dialogues are used as training data. We also introduce a novel Audio Transfusion Forcing objective to train the AudioChat model, allowing it to simultaneously decompose high-level instructions via structured chain-of-thought reasoning and perform interactive multi-turn audio understanding/generation. To evaluate generation and editing performance, we develop three new metrics that directly measure task performance instead of relying upon distribution-based scoring. We highly encourage readers to visit our demo to better understand the capabilities of AudioChat: https://wanchichen.github.io/audiochat/.
Primary: Carnegie Mellon University
All Institutions: Carnegie Mellon University, Adobe Research, OpenAI
This paper introduces AudioChat, a pioneering framework for multi-source audio storytelling, editing, and understanding, which utilizes innovative methodologies to advance the field of audio processing in machine learning. The comprehensive evaluation of its technical contributions, methodology, and implications for future research underscores its significance in the domain.
The paper presents a novel framework, AudioChat, which integrates audio generation, editing, and understanding through a unified model. The methodology leverages a tool-calling agent, AudioCopilot, to synthesize training data through simulated user interactions, which is innovative in addressing the data scarcity issue in complex audio scene processing. The introduction of the Audio Transfusion Forcing objective is a significant advancement, allowing the model to perform structured reasoning and multi-turn interactions effectively. The architecture employs a continuous audio tokenizer and a multi-modal language model, which are well-justified and contribute to the model's performance.
The experiments are comprehensive, evaluating AudioChat against various baselines across multiple tasks including storytelling, editing, and understanding. The use of novel evaluation metrics like multiFLAM and editFLAM provides a more nuanced assessment of the model's capabilities compared to traditional metrics. The results indicate that AudioChat outperforms existing models, demonstrating its effectiveness in handling complex audio tasks. However, the paper could benefit from more detailed comparisons with a broader range of existing methods.
The authors provide ample details regarding the training data, hyperparameters, and methodology, which supports reproducibility. However, the proprietary nature of some training data may limit full replication of the results. The paper does a commendable job of outlining the architecture and training process, allowing for potential implementation by other researchers.
One limitation is the reliance on synthetic data generated by AudioCopilot, which may not capture the full diversity of real-world audio scenarios. Additionally, while the model shows promise, its performance in edge cases or highly nuanced audio tasks remains to be thoroughly evaluated. The potential ethical implications of audio generation technologies, such as misuse for impersonation, are acknowledged but not deeply explored.
The development of AudioChat has significant implications for various applications in multimedia, including film, gaming, and virtual reality, where immersive audio storytelling is crucial. The ability to generate and edit complex audio scenes could enhance user experiences in these domains. However, the potential for misuse in creating deceptive audio content raises ethical concerns that need to be addressed by the research community. This paper introduces AudioChat, a pioneering framework for multi-source audio storytelling, editing, and understanding, which utilizes innovative methodologies to advance the field of audio processing in machine learning. The comprehensive evaluation of its technical contributions, methodology, and implications for future research underscores its significance in the domain.
We propose CC-G2PnP, a streaming grapheme-to-phoneme and prosody (G2PnP) model to connect large language model and text-to-speech in a streaming manner. CC-G2PnP is based on Conformer-CTC architecture. Specifically, the input grapheme tokens are processed chunk by chunk, which enables streaming inference of phonemic and prosodic (PnP) labels. By guaranteeing minimal look-ahead size to each input token, the proposed model can consider future context in each token, which leads to stable PnP label prediction. Unlike previous streaming methods that depend on explicit word boundaries, the CTC decoder in CC-G2PnP effectively learns the alignment between graphemes and phonemes during training, making it applicable to unsegmented languages. Experiments on a Japanese dataset, which has no explicit word boundaries, show that CC-G2PnP significantly outperforms the baseline streaming G2PnP model in the accuracy of PnP label prediction.
Primary: LY Corporation
All Institutions: LY Corporation
The paper presents CC-G2PnP, a novel streaming model for grapheme-to-phoneme and prosody conversion that addresses the challenges of unsegmented languages. Its innovative methodology and robust experimental results position it as a significant contribution to the field of audio processing and speech synthesis.
The proposed CC-G2PnP model employs a Conformer-CTC architecture that innovatively processes grapheme tokens in chunks, allowing for streaming inference of phonemic and prosodic labels. The introduction of minimum look-ahead (MLA) is a significant methodological advancement, as it addresses the limitations of previous streaming models that rely on explicit word boundaries. This approach is particularly beneficial for unsegmented languages like Japanese, where word boundaries are not clearly defined. The integration of self-conditioned CTC into the architecture further enhances the model's performance by allowing dynamic learning of alignments between graphemes and phonemes.
The experiments conducted on a Japanese dataset demonstrate the effectiveness of CC-G2PnP, showing significant improvements in character error rate (CER) and sentence error rate (SER) compared to baseline models. The use of both objective metrics and subjective assessments of TTS naturalness provides a comprehensive evaluation of the model's performance. The dataset preparation and experimental conditions are well-documented, allowing for a clear understanding of the model's capabilities and limitations.
While the paper provides detailed descriptions of the model architecture and training procedures, the lack of a publicly available code repository or demo URL limits reproducibility. The absence of specific hyperparameters and training configurations in a readily accessible format could hinder other researchers from replicating the results.
One limitation noted is the reliance on a large amount of training data to achieve optimal performance, which may not be feasible for all applications. Additionally, while the model performs well in terms of accuracy, the subjective evaluation of TTS naturalness could vary based on the speaker used during testing, which may not generalize across different voices.
The CC-G2PnP model has the potential to significantly enhance text-to-speech systems, particularly for languages without explicit word boundaries. This could lead to more natural and efficient human-machine interactions in various applications, including virtual assistants, language learning tools, and accessibility technologies for the visually impaired. The advancements in streaming G2PnP could also inspire further research in related areas, such as real-time speech synthesis and multilingual processing. The paper presents CC-G2PnP, a novel streaming model for grapheme-to-phoneme and prosody conversion that addresses the challenges of unsegmented languages. Its innovative methodology and robust experimental results position it as a significant contribution to the field of audio processing and speech synthesis.
Current audio language models are predominantly text-first, either extending pre-trained text LLM backbones or relying on semantic-only audio tokens, limiting general audio modeling. This paper presents a systematic empirical study of native audio foundation models that apply next-token prediction to audio at scale, jointly modeling semantic content, acoustic details, and text to support both general audio generation and cross-modal capabilities. We provide comprehensive empirical insights for building such models: (1) We systematically investigate design choices -- data sources, text mixture ratios, and token composition -- establishing a validated training recipe. (2) We conduct the first scaling law study for discrete audio models via IsoFLOP analysis on 64 models spanning $3{\times}10^{18}$ to $3{\times}10^{20}$ FLOPs, finding that optimal data grows 1.6$\times$ faster than optimal model size. (3) We apply these lessons to train SODA (Scaling Open Discrete Audio), a suite of models from 135M to 4B parameters on 500B tokens, comparing against our scaling predictions and existing models. SODA serves as a flexible backbone for diverse audio/text tasks -- we demonstrate this by fine-tuning for voice-preserving speech-to-speech translation, using the same unified architecture.
Primary: Stanford University
All Institutions: Stanford University, SCB 10X, OpenAthena, University of Southern California, University of Cambridge
The main contribution of this paper is the introduction of SODA, a scalable audio foundation model that effectively integrates semantic, acoustic, and text tokens, providing a comprehensive framework for advancing audio modeling. This work significantly enhances the understanding of scaling laws in audio models and sets a foundation for future innovations in the field.
The methodology presented in the paper is robust and systematic, focusing on the design choices that influence the performance of audio foundation models. The authors thoroughly investigate various aspects, including data sources, text mixture ratios, and token composition, which are critical for optimizing model performance. The introduction of the SODA model, which integrates semantic, acoustic, and text tokens, represents a significant advancement in audio modeling. The use of next-token prediction at scale is a novel approach that extends the capabilities of existing models.
The paper includes a comprehensive empirical evaluation, particularly through the IsoFLOP analysis that examines scaling laws for discrete audio models. The authors provide extensive experimentation across 64 models, which is a commendable effort to validate their findings. The results indicate that optimal data grows faster than model size, which is a valuable insight for future research in this area. However, the paper could benefit from more detailed comparisons with existing models beyond the scaling predictions.
While the authors mention establishing a validated training recipe, the paper lacks specific implementation details that would facilitate reproducibility. Providing access to code or detailed hyperparameter settings would enhance the paper's contribution to the community and allow for independent verification of results.
One limitation is the reliance on a specific architecture for the SODA model, which may not generalize well to all audio tasks. Additionally, the paper does not address potential biases in the training data or the implications of using large-scale models in real-world applications. The scaling law findings, while insightful, may also be context-dependent and require further validation across diverse datasets.
The implications of this research are significant, as it opens up new avenues for audio generation and cross-modal tasks, such as speech-to-speech translation. The ability to model semantic content alongside acoustic details can enhance applications in various domains, including entertainment, accessibility, and communication technologies. The findings could influence future research directions and encourage the development of more sophisticated audio models. The main contribution of this paper is the introduction of SODA, a scalable audio foundation model that effectively integrates semantic, acoustic, and text tokens, providing a comprehensive framework for advancing audio modeling. This work significantly enhances the understanding of scaling laws in audio models and sets a foundation for future innovations in the field.
Probing is widely adopted in computer vision to faithfully evaluate self-supervised learning (SSL) embeddings, as fine-tuning may misrepresent their inherent quality. In contrast, audio SSL models still rely on fine-tuning because simple probing fails to unlock their full potential and alters their rankings when competing for SOTA on AudioSet. Hence, a robust and efficient probing mechanism is required to guide the trajectory of audio SSL towards reliable and reproducible methods. We introduce Convex Gated Probing (CGP), a prototype-based method that drastically closes the gap between fine-tuning and probing in audio. CGP efficiently utilizes all frozen layers via a gating mechanism and exposes the location of latent task-relevant information. Guided by CGP, we rework the entire SSL pipeline of current SOTA audio models that use legacy implementations of prior SSL methods. By refining data preprocessing, model architecture, and pre-training recipe, we introduce Better Audio Transformer (BAT), and establish new SOTA on audio benchmarks.
Primary: Ghent University
All Institutions: Ghent University, Fraunhofer IEE, University of Kassel
The paper presents a significant advancement in audio self-supervised learning through the introduction of Convex Gated Probing and the Better Audio Transformer, addressing critical gaps in evaluation methodologies and model performance. The comprehensive experimental validation and emphasis on reproducibility enhance its contributions to the field.
The paper introduces Convex Gated Probing (CGP), a novel probing method that leverages a gating mechanism to efficiently utilize all frozen layers of audio SSL models. This approach addresses the limitations of existing probing techniques, which often fail to capture the full potential of audio embeddings. The methodology is well-structured, presenting a clear rationale for the design choices and improvements made to the SSL pipeline, leading to the development of the Better Audio Transformer (BAT). The integration of CGP into the SSL framework is innovative and shows promise in enhancing model evaluation and performance.
The experiments are comprehensive, demonstrating the effectiveness of BAT across various audio benchmarks. The authors provide detailed comparisons against state-of-the-art models, showcasing significant performance improvements in both frozen-feature probing and fine-tuning scenarios. The results are well-documented, with sufficient statistical rigor to support the claims made regarding the superiority of BAT over existing models.
The authors emphasize the importance of reproducibility and provide a new PyTorch implementation to facilitate this. However, the paper mentions challenges in replicating results from existing models, which raises questions about the reliability of previous benchmarks. The authors' efforts to standardize methodologies and hyperparameters contribute positively to the reproducibility aspect, although the lack of a public code repository limits accessibility.
One limitation noted is the reliance on the specific architecture of the Better Audio Transformer, which may not generalize across different audio tasks or datasets. Additionally, while the CGP method shows promise, its effectiveness in more complex audio scenarios or with other model architectures remains to be validated. The paper also acknowledges the challenges of hyperparameter sensitivity in fine-tuning, which could affect the generalizability of results.
The advancements presented in this work have the potential to significantly impact the field of self-supervised audio representation learning. By improving the evaluation methods and model architectures, the research could lead to more efficient and accessible audio models, reducing computational overhead and fostering innovation in audio-related applications. The focus on reproducibility and transparency also aligns with broader efforts to enhance the reliability of machine learning research. The paper presents a significant advancement in audio self-supervised learning through the introduction of Convex Gated Probing and the Better Audio Transformer, addressing critical gaps in evaluation methodologies and model performance. The comprehensive experimental validation and emphasis on reproducibility enhance its contributions to the field.
In audio-related creative tasks, sound designers often seek to extend and morph different sounds from their libraries. Generative audio models, capable of creating audio using examples as references, offer promising solutions. By masking the noisy latents of a DiT and applying a novel variant of classifier-free guidance on such masked latents, we demonstrate that: (i) given an audio reference, we can extend it both forward and backward for a specified duration, and (ii) given two audio references, we can morph them seamlessly for the desired duration. Furthermore, we show that by fine-tuning the model on different types of stationary audio data we mitigate potential hallucinations. The effectiveness of our method is supported by objective metrics, with the generated audio achieving Fréchet Audio Distances (FADs) comparable to those of real samples from the training data. Additionally, we validate our results through a subjective listener test, where subjects gave positive ratings to the proposed model generations. This technique paves the way for more controllable and expressive generative sound frameworks, enabling sound designers to focus less on tedious, repetitive tasks and more on their actual creative process.
Primary: unknown
All Institutions: unknown
The paper presents a novel approach for generating high-quality audio extensions and morphs using Diffusion Transformers and a variant of classifier-free guidance. The technical contributions are significant, addressing real-world challenges faced by sound designers and demonstrating promising results through rigorous evaluation.
The methodology presented in this paper is robust and innovative, leveraging Diffusion Transformers and a novel Audio Prompt Guidance technique to effectively extend and morph audio. The authors provide a clear description of their approach, including the masking function and the fine-tuning strategy using the Noise Floor Dataset to mitigate hallucinations. However, while the methodology is well-structured, it could benefit from a more detailed exploration of the limitations of the masking function and guidance techniques in varying audio contexts.
The experimental evaluation is comprehensive, employing both objective metrics (Fréchet Audio Distance) and subjective listener tests to validate the effectiveness of the proposed model. The use of a large dataset for training and the careful selection of evaluation clips from sound design professionals enhances the credibility of the results. However, the paper could improve by including more diverse audio samples and comparing against a broader range of existing methods.
The paper provides sufficient detail on the architecture, training process, and evaluation metrics, which aids in reproducibility. However, the absence of specific code or model weights limits the ease with which other researchers can replicate the results. Including a GitHub repository or similar resource would significantly enhance reproducibility.
The paper acknowledges the potential for hallucinations in generated audio, particularly with stationary sounds, and discusses the trade-off between reducing hallucinations and maintaining fidelity to the original prompts. However, it does not thoroughly address how the model performs with non-stationary sounds or in complex soundscapes, which could be a significant limitation for practical applications.
The proposed model has the potential to significantly impact the field of sound design by automating tedious tasks and enhancing the creative process for sound designers. The ability to generate high-quality audio extensions and morphs could streamline workflows in various industries, including film, gaming, and virtual reality. Furthermore, the methodology could inspire future research in generative audio models and their applications in other domains. The paper presents a novel approach for generating high-quality audio extensions and morphs using Diffusion Transformers and a variant of classifier-free guidance. The technical contributions are significant, addressing real-world challenges faced by sound designers and demonstrating promising results through rigorous evaluation.
This paper presents virtual upmixing of steering vectors captured by a fewer-channel spherical microphone array. This challenge has conventionally been addressed by recovering the directions and signals of sound sources from first-order ambisonics (FOA) data, and then rendering the higher-order ambisonics (HOA) data using a physics-based acoustic simulator. This approach, however, struggles to handle the mutual dependency between the spatial directivity of source estimation and the spatial resolution of FOA ambisonics data. Our method, named SIRUP, employs a latent diffusion model architecture. Specifically, a variational autoencoder (VAE) is used to learn a compact encoding of the HOA data in a latent space and a diffusion model is then trained to generate the HOA embeddings, conditioned by the FOA data. Experimental results showed that SIRUP achieved a significant improvement compared to FOA systems for steering vector upmixing, source localization, and speech denoising.
Primary: unknown
All Institutions: JSPS KAKENHI, JST FOREST, ANR Project SAROUMANE
The main contribution of this paper is the introduction of SIRUP, a novel diffusion-based approach for enhancing spatial audio representation from FOA to HOA, which addresses critical limitations in existing methods and demonstrates significant improvements in sound source localization and speech denoising. The methodology is innovative, and the experimental results are promising, indicating a strong potential impact on the field of audio processing and machine listening.
The proposed SIRUP method innovatively integrates a variational autoencoder (VAE) with a latent diffusion model to enhance steering vector upmixing from first-order ambisonics (FOA) to higher-order ambisonics (HOA). This approach addresses the limitations of traditional methods by directly learning a latent representation of HOA data, conditioned on FOA inputs, which is a significant departure from the conventional cascaded analysis-rendering pipeline. The use of a composite loss function that combines cosine similarity with MSE is a thoughtful addition that likely contributes to the stability and performance of the model.
The experimental setup is robust, utilizing simulated room impulse responses to evaluate the performance of SIRUP across various conditions, including different signal-to-noise ratios and reverberation times. The metrics chosen for evaluation, such as beamwidth and directivity index, are appropriate for assessing the quality of the upmixed steering vectors. The results indicate that SIRUP significantly outperforms FOA systems, demonstrating its effectiveness in sound source localization and speech denoising.
While the paper provides a detailed description of the methodology, including model architecture and training procedures, it lacks explicit links to code repositories or supplementary materials that would facilitate reproducibility. The absence of a publicly available implementation may hinder other researchers from validating the findings.
One limitation is the reliance on simulated data, which may not fully capture the complexities of real-world acoustic environments. Additionally, the paper does not address the scalability of the method to larger microphone arrays or the potential computational costs associated with training the diffusion model.
The implications of this research are significant for machine listening applications, particularly in augmented reality, robotics, and autonomous systems, where accurate spatial audio representation is crucial. By improving the spatial resolution of sound source localization and enhancing speech denoising, SIRUP could lead to advancements in user experience and system performance in these domains. The main contribution of this paper is the introduction of SIRUP, a novel diffusion-based approach for enhancing spatial audio representation from FOA to HOA, which addresses critical limitations in existing methods and demonstrates significant improvements in sound source localization and speech denoising. The methodology is innovative, and the experimental results are promising, indicating a strong potential impact on the field of audio processing and machine listening.
Large Audio Language Models struggle to disentangle overlapping events in complex acoustic scenes, yielding temporally inconsistent captions and frequent hallucinations. We introduce Timestamped Audio Captioner (TAC), a model that produces temporally grounded audio descriptions at varying degrees of detail and resolution. TAC is trained with a synthetic data pipeline that constructs challenging and dynamic mixtures from real-world audio sources, enabling robust learning under realistic polyphonic conditions. Across event detection and dense captioning, TAC outperforms all competing methods, with a low hallucination rate and accurate temporal grounding. We also introduce TAC-V, an audio-visual pipeline to generate semantically rich audio-visual descriptions. We then show that TAC and TAC-V serves as a "semantic bridge" for a text-only reasoner: a simple TAC$\rightarrow$LLM and TAC-V$\rightarrow$LLM cascade achieves state-of-the-art scores on benchmarks for both audio (MMAU-Pro, MMSU, MMAR) and audio-visual (DailyOmni, VideoHolmes) understanding and reasoning respectively.
Primary: University of Maryland, College Park
All Institutions: University of Maryland, College Park, Adobe Research, OpenAI
The main contribution of this paper is the development of TAC, a model that produces temporally grounded audio captions with low hallucination rates, significantly advancing the state of audio understanding. This work addresses critical shortcomings in existing models and presents a robust framework for future research in audio and audio-visual reasoning.
The paper introduces the Timestamped Audio Captioner (TAC) and its extension TAC-V, which leverage a synthetic data pipeline to create temporally grounded audio descriptions. The methodology is innovative, utilizing a dynamic acoustic mixer to generate complex audio mixtures with precise temporal annotations, addressing the limitations of traditional audio captioning methods that often rely on sparse annotations. The approach of separating the audio captioning task from reasoning tasks through a cascade with a text-only LLM is particularly noteworthy, allowing for independent scaling and improved performance.
The experiments are comprehensive, comparing TAC against state-of-the-art models on multiple benchmarks, including MMAU-Pro, MMSU, and others. The results demonstrate significant improvements in temporal grounding and reduced hallucination rates, validating the effectiveness of the proposed methods. The ablation studies provide insights into the importance of various components of the model, further strengthening the findings.
The paper provides sufficient detail regarding the implementation, including the use of specific architectures (Qwen2-Audio) and training procedures (LoRA). However, the reliance on synthetic data may introduce challenges in replicating results in real-world scenarios, which could limit reproducibility.
The authors acknowledge limitations related to the synthetic data approach, including potential biases and a sim-to-real gap. Additionally, the model may struggle with fine-grained musical precision, which could affect its applicability in certain contexts.
The work has significant implications for improving the reliability of audio understanding systems, particularly in safety-critical applications and accessibility tools for the hearing impaired. However, the potential for misuse in surveillance contexts raises ethical considerations that must be addressed. The main contribution of this paper is the development of TAC, a model that produces temporally grounded audio captions with low hallucination rates, significantly advancing the state of audio understanding. This work addresses critical shortcomings in existing models and presents a robust framework for future research in audio and audio-visual reasoning.
Neural audio codecs (NACs) typically encode the short-term energy (gain) and normalized structure (shape) of speech/audio signals jointly within the same latent space. As a result, they are poorly robust to a global variation of the input signal level in the sense that such variation has strong influence on the embedding vectors at the output of the encoder and their quantization. This methodology is inherently inefficient, leading to codebook redundancy and suboptimal bitrate-distortion performance. To address these limitations, we propose to introduce shape-gain decomposition, widely used in classical speech/audio coding, into the NAC framework. The principle of the proposed Equalizer methodology is to decompose the input signal -- before the NAC encoder -- into gain and normalized shape vector on a short-term basis. The shape vector is processed by the NAC, while the gain is quantized with scalar quantization and transmitted separately. The output (decoded) signal is reconstructed from the normalized output of the NAC and the quantized gain. Our experiments conducted on speech signals show that this general methodology, easily applicable to any NAC, enables a substantial gain in bitrate-distortion performance, as well as a massive reduction in complexity.
Primary: Inria at Univ. Grenoble Alpes
All Institutions: Inria at Univ. Grenoble Alpes, CNRS, LJK, Univ. Grenoble Alpes, Grenoble-INP, GIPSA-lab
The main contribution of this paper is the introduction of The Equalizer, a novel methodology that applies shape-gain decomposition to enhance the performance of neural audio codecs. This work bridges classical signal processing techniques with modern machine learning approaches, providing a significant advancement in the efficiency and robustness of audio coding systems.
The proposed methodology, The Equalizer, introduces a novel shape-gain decomposition approach to neural audio codecs (NACs), which is a significant departure from traditional methods that encode gain and shape jointly. The paper effectively integrates classical signal processing concepts into modern NAC frameworks, demonstrating a clear understanding of both domains. The methodology is well-structured, involving the decomposition of input signals into gain and shape vectors before encoding, and the subsequent reconstruction of the output signal. This approach not only enhances bitrate-distortion performance but also reduces complexity, making it a valuable contribution to the field.
The experiments are robust, utilizing a substantial dataset (LibriSpeech) and comparing the proposed method against several state-of-the-art NACs. The evaluation metrics—STOI, PESQ, and SI-SDR—are appropriate for assessing audio quality and intelligibility. The results clearly demonstrate the advantages of the proposed method over traditional NACs, particularly in terms of robustness to gain variations and overall performance across different bitrates. The paper provides comprehensive experimental results that substantiate the claims made about the effectiveness of The Equalizer.
The paper includes detailed implementation details, including the training setup, evaluation metrics, and specific configurations used for the NACs. However, the lack of a publicly available project URL or demo limits the reproducibility of the results. Future work could benefit from making the code and models available to the community to facilitate further exploration and validation of the proposed methodology.
One limitation of the study is the focus on speech signals, which may not generalize to other audio types. Additionally, while the paper discusses the potential for future work, it does not explore the implications of the normalization on the embedding vectors in detail, which could be crucial for understanding the full impact of the proposed method.
The proposed methodology has significant implications for audio coding and compression, particularly in applications where efficient transmission and storage of audio data are critical, such as in telecommunications and streaming services. By improving the robustness and efficiency of NACs, this work could lead to better audio quality in various consumer and professional audio applications. The main contribution of this paper is the introduction of The Equalizer, a novel methodology that applies shape-gain decomposition to enhance the performance of neural audio codecs. This work bridges classical signal processing techniques with modern machine learning approaches, providing a significant advancement in the efficiency and robustness of audio coding systems.
Neural autoencoders underpin generative models. Practical, large-scale use of neural autoencoders for generative modeling necessitates fast encoding, low latent rates, and a single model across representations. Existing approaches are reconstruction-first: they incur high latent rates, slow encoding, and separate architectures for discrete vs. continuous latents and for different audio channel formats, hindering workflows from preprocessing to inference conditioning. We introduce a generative-first architecture for audio autoencoding that increases temporal downsampling from 2048x to 3360x and supports continuous and discrete representations and common audio channel formats in one model. By balancing compression, quality, and speed, it delivers 10x faster encoding, 1.6x lower rates, and eliminates channel-format-specific variants while maintaining competitive reconstruction quality. This enables applications previously constrained by processing costs: a 60-second mono signal compresses to 788 tokens, making generative modeling more tractable.
Primary: unknown
All Institutions: unknown
The paper presents a novel generative-first neural audio autoencoder that significantly improves encoding speed and compression efficiency while maintaining high reconstruction quality. This work is a meaningful contribution to the field of audio processing, addressing key limitations of existing models and opening avenues for practical applications in generative audio tasks.
The paper introduces a generative-first architecture for audio autoencoding, which is a significant departure from the traditional reconstruction-first approach. The methodology is well-structured, with clear architectural modifications aimed at improving efficiency and flexibility. The use of efficient activations, early downsampling, and the incorporation of mel-spectrograms to capture high-frequency information are notable innovations. The post-training adaptation to support both continuous and discrete latents without retraining is particularly clever and enhances the model's applicability.
The experimental setup is robust, with thorough evaluations of speed, quality, and generative utility. The benchmarks against state-of-the-art codecs demonstrate the effectiveness of GenAE in achieving better compression and reconstruction quality. The use of multiple metrics (SI-SDR, STFT loss, mel-spectrogram L1 distance) adds credibility to the results. However, the absence of a clear comparison with a wider range of existing models could limit the perceived impact.
The paper provides detailed implementation specifics, including architecture choices, training configurations, and evaluation metrics, which are essential for reproducibility. However, the lack of accessible code or a demo limits the practical reproducibility of the results.
The paper does not address potential limitations in terms of the generalizability of the model across different audio types beyond instrumental music. Additionally, the computational resources required for training (8 A100 GPUs for a week) may not be accessible to all researchers, which could hinder broader adoption.
The advancements in audio autoencoding presented in this paper have the potential to significantly impact various applications, including music generation, audio compression, and real-time audio processing. The ability to handle multiple audio formats with a single model streamlines workflows and could lead to more efficient use of computational resources in audio-related tasks. The paper presents a novel generative-first neural audio autoencoder that significantly improves encoding speed and compression efficiency while maintaining high reconstruction quality. This work is a meaningful contribution to the field of audio processing, addressing key limitations of existing models and opening avenues for practical applications in generative audio tasks.
Neural autoencoders underpin generative models. Practical, large-scale use of neural autoencoders for generative modeling necessitates fast encoding, low latent rates, and a single model across representations. Existing approaches are reconstruction-first: they incur high latent rates, slow encoding, and separate architectures for discrete vs. continuous latents and for different audio channel formats, hindering workflows from preprocessing to inference conditioning. We introduce a generative-first architecture for audio autoencoding that increases temporal downsampling from 2048x to 3360x and supports continuous and discrete representations and common audio channel formats in one model. By balancing compression, quality, and speed, it delivers 10x faster encoding, 1.6x lower rates, and eliminates channel-format-specific variants while maintaining competitive reconstruction quality. This enables applications previously constrained by processing costs: a 60-second mono signal compresses to 788 tokens, making generative modeling more tractable.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of a generative-first neural audio autoencoder that optimizes encoding speed and compression while maintaining high reconstruction quality across various audio formats. This work represents a significant advancement in the field of audio processing, addressing key limitations of existing models and paving the way for more efficient generative audio applications.
The proposed methodology introduces a generative-first architecture that significantly optimizes the encoding process for audio autoencoders. By focusing on architectural modifications such as efficient activations, early downsampling, and the integration of mel-spectrograms, the authors effectively address the limitations of existing reconstruction-first models. The approach to unify continuous and discrete latent representations through a post-training process is particularly innovative, allowing for greater flexibility in generative modeling. However, the paper could benefit from a clearer explanation of the theoretical underpinnings of some architectural choices, particularly the use of SnakeLite activations and their impact on performance.
The experiments are well-structured, comparing the proposed GenAE model against several state-of-the-art codecs. The use of multiple metrics (e.g., SI-SDR, PESQ-WB) to evaluate reconstruction quality and the real-time factor for speed assessment provides a comprehensive view of the model's performance. However, the paper lacks detailed descriptions of the datasets used for training and evaluation, which may affect the reproducibility and generalizability of the results. Additionally, the absence of a direct comparison with other generative models limits the contextual understanding of GenAE's advantages.
The paper provides sufficient details on the architecture and training setup, including hyperparameters and loss functions, which aids in reproducibility. However, the lack of publicly available code or datasets limits the ability for other researchers to replicate the results fully. The authors should consider releasing their model and training data to enhance reproducibility.
One limitation is the reliance on specific audio datasets, which may not fully represent the diversity of audio signals encountered in real-world applications. Additionally, while the model achieves impressive speed and compression rates, the trade-off between these factors and reconstruction quality in extreme cases is not thoroughly explored. The potential for overfitting due to the complexity of the model, especially with the extensive use of attention mechanisms, is another concern.
The advancements presented in this paper could significantly impact various applications in audio processing, including music generation, audio compression for streaming, and real-time audio manipulation. By enabling faster and more efficient audio encoding, the GenAE model could facilitate broader adoption of generative audio technologies in both commercial and research settings. The ability to handle multiple audio formats in a single model also simplifies deployment for developers. The main contribution of this paper is the introduction of a generative-first neural audio autoencoder that optimizes encoding speed and compression while maintaining high reconstruction quality across various audio formats. This work represents a significant advancement in the field of audio processing, addressing key limitations of existing models and paving the way for more efficient generative audio applications.
We introduce the Massive Audio Embedding Benchmark (MAEB), a large-scale benchmark covering 30 tasks across speech, music, environmental sounds, and cross-modal audio-text reasoning in 100+ languages. We evaluate 50+ models and find that no single model dominates across all tasks: contrastive audio-text models excel at environmental sound classification (e.g., ESC50) but score near random on multilingual speech tasks (e.g., SIB-FLEURS), while speech-pretrained models show the opposite pattern. Clustering remains challenging for all models, with even the best-performing model achieving only modest results. We observe that models excelling on acoustic understanding often perform poorly on linguistic tasks, and vice versa. We also show that the performance of audio encoders on MAEB correlates highly with their performance when used in audio large language models. MAEB is derived from MAEB+, a collection of 98 tasks. MAEB is designed to maintain task diversity while reducing evaluation cost, and it integrates into the MTEB ecosystem for unified evaluation across text, image, and audio modalities. We release MAEB and all 98 tasks along with code and a leaderboard at https://github.com/embeddings-benchmark/mteb.
Primary: Carleton University
All Institutions: Carleton University, Zendesk, Durham University, Salute Devices, MIRAI, Stanford University, Aarhus University, Indian Institute of Technology, Kharagpur, Harvard University, Capital One
The paper introduces the Massive Audio Embedding Benchmark (MAEB), a significant contribution to the field of audio machine learning that provides a comprehensive evaluation framework across diverse tasks and languages. The methodology and experimental results offer valuable insights into model performance, although further statistical analysis and detailed reproducibility guidelines would enhance its impact.
The methodology presented in the paper is robust, introducing a comprehensive benchmark (MAEB) that spans multiple audio tasks and languages. The authors provide a clear rationale for the selection of tasks and models, and the integration into the MTEB ecosystem is a significant step towards unified evaluation across modalities. However, the paper could benefit from a more detailed description of the benchmarking process and the specific metrics used for evaluation.
The experiments are extensive, evaluating over 50 models across 30 tasks. The results highlight the performance discrepancies between models trained for different audio tasks, which is a valuable insight for future research. However, the paper lacks a thorough statistical analysis of the results, which would strengthen the claims made regarding model performance.
The authors have committed to releasing code and a leaderboard, which is commendable and supports reproducibility. However, the paper should include more detailed instructions on how to replicate the experiments, including specific configurations and hyperparameters used for each model.
One limitation noted is the performance of models on clustering tasks, where even the best-performing model achieves only modest results. Additionally, the paper acknowledges the trade-offs between acoustic understanding and linguistic tasks, which may limit the applicability of certain models across all tasks.
The MAEB benchmark has the potential to significantly impact the field of audio machine learning by providing a standardized evaluation framework. This could lead to improved model development and encourage further research into multilingual and cross-modal audio tasks. The release of the benchmark also promotes collaboration and transparency in the research community. The paper introduces the Massive Audio Embedding Benchmark (MAEB), a significant contribution to the field of audio machine learning that provides a comprehensive evaluation framework across diverse tasks and languages. The methodology and experimental results offer valuable insights into model performance, although further statistical analysis and detailed reproducibility guidelines would enhance its impact.