Video-to-audio generation (V2A) is of increasing importance in domains such as film post-production, AR/VR, and sound design, particularly for the creation of Foley sound effects synchronized with on-screen actions. Foley requires generating audio that is both semantically aligned with visible events and temporally aligned with their timing. Yet, there is a mismatch between evaluation and downstream applications due to the absence of a benchmark tailored to Foley-style scenarios. We find that 74% of videos from past evaluation datasets have poor audio-visual correspondence. Moreover, they are dominated by speech and music, domains that lie outside the use case for Foley. To address this gap, we introduce FoleyBench, the first large-scale benchmark explicitly designed for Foley-style V2A evaluation. FoleyBench contains 5,000 (video, ground-truth audio, text caption) triplets, each featuring visible sound sources with audio causally tied to on-screen events. The dataset is built using an automated, scalable pipeline applied to in-the-wild internet videos from YouTube-based and Vimeo-based sources. Compared to past datasets, we show that videos from FoleyBench have stronger coverage of sound categories from a taxonomy specifically designed for Foley sound. Each clip is further labeled with metadata capturing source complexity, UCS/AudioSet category, and video length, enabling fine-grained analysis of model performance and failure modes. We benchmark several state-of-the-art V2A models, evaluating them on audio quality, audio-video alignment, temporal synchronization, and audio-text consistency. Samples are available at: https://gclef-cmu.org/foleybench
Primary: Carnegie Mellon University
All Institutions: Carnegie Mellon University
The paper presents FoleyBench, a novel benchmark for evaluating video-to-audio generation models specifically for Foley sound effects. It addresses critical gaps in existing datasets and provides valuable insights into model performance, paving the way for future advancements in the field.
The methodology employed in constructing FoleyBench is robust, utilizing an automated multi-stage pipeline that effectively filters and curates video clips to ensure high-quality audio-visual correspondence. The approach includes scene detection, audio filtering using YAMNet, and audiovisual filtering with Gemini 2.5 Pro, which collectively ensure that the dataset is tailored for Foley sound effects. The systematic categorization of clips with metadata enhances the dataset's usability for fine-grained analysis of model performance. However, the reliance on automated processes may introduce biases if the filtering models are not perfectly calibrated.
The experiments conducted on various state-of-the-art V2A models provide a comprehensive evaluation of their performance on the newly established benchmark. The paper presents a clear comparison between FoleyBench and existing datasets like VGGSound, highlighting the inadequacies of the latter for Foley-style evaluations. The inclusion of multiple metrics (audio quality, audio-video alignment, temporal synchronization, and audio-text consistency) allows for a nuanced understanding of model capabilities. The results reveal significant insights into the strengths and weaknesses of different models, particularly in relation to specific sound categories and conditions.
The paper provides sufficient details regarding the dataset construction and evaluation metrics, which aids in reproducibility. However, the exact implementation details of the models evaluated are not fully disclosed, which may hinder complete reproducibility of the results. The availability of the dataset and benchmark scores is a positive aspect, as it allows other researchers to validate findings and build upon this work.
One limitation is the potential bias introduced by the automated filtering process, which may not capture all relevant audio-visual relationships accurately. Additionally, while the dataset is extensive, the reliance on internet-sourced videos may introduce variability in audio quality and content diversity. The paper also notes that current models struggle with high-fidelity audio generation for discrete sounds and long-form videos, indicating areas for further research.
The introduction of FoleyBench has significant implications for the fields of audio generation and sound design, particularly in creative industries such as film and gaming. By providing a dedicated benchmark for Foley sound effects, it encourages the development of more sophisticated V2A models that can produce high-quality, contextually appropriate audio. This work could lead to advancements in AR/VR applications and enhance user experiences in multimedia content creation. The paper presents FoleyBench, a novel benchmark for evaluating video-to-audio generation models specifically for Foley sound effects. It addresses critical gaps in existing datasets and provides valuable insights into model performance, paving the way for future advancements in the field.
Recent advances in generative AI have made music generation a prominent research focus. However, many neural-based models rely on large datasets, raising concerns about copyright infringement and high-performance costs. In contrast, we propose MusicAIR, an innovative multimodal AI music generation framework powered by a novel algorithm-driven symbolic music core, effectively mitigating copyright infringement risks. The music core algorithms connect critical lyrical and rhythmic information to automatically derive musical features, creating a complete, coherent melodic score solely from the lyrics. The MusicAIR framework facilitates music generation from lyrics, text, and images. The generated score adheres to established principles of music theory, lyrical structure, and rhythmic conventions. We developed Generate AI Music (GenAIM), a web tool using MusicAIR for lyric-to-song, text-to-music, and image-to-music generation. In our experiments, we evaluated AI-generated music scores produced by the system using both standard music metrics and innovative analysis that compares these compositions with original works. The system achieves an average key confidence of 85%, outperforming human composers at 79%, and aligns closely with established music theory standards, demonstrating its ability to generate diverse, human-like compositions. As a co-pilot tool, GenAIM can serve as a reliable music composition assistant and a possible educational composition tutor while simultaneously lowering the entry barrier for all aspiring musicians, which is innovative and significantly contributes to AI for music generation.
Primary: Stanford University
All Institutions: Stanford University, George Mason University, IntelliSky
The paper presents MusicAIR, a novel multimodal AI music generation framework that leverages algorithm-driven methods to produce music from lyrics and images, significantly contributing to the field of AI in music generation. The comprehensive methodology and promising experimental results highlight its potential impact on music composition and education.
The methodology presented in MusicAIR is innovative, focusing on a purely algorithm-driven approach to music generation that circumvents the need for large datasets, thus addressing significant copyright and resource concerns prevalent in traditional neural network-based models. The framework's ability to generate music from multiple modalities (lyrics, text, images) is a notable advancement, particularly the integration of LLMs for image-to-lyric conversion, which is a unique aspect not commonly explored in existing literature. The detailed algorithms for score setup, rhythmic construction, and pitch generation are well-articulated, showcasing a comprehensive understanding of music theory and its application in AI.
The experiments conducted using the GenAIM tool are robust, comparing AI-generated compositions against original works using objective music theory metrics such as key confidence, melodic smoothness, and rhythm matching. The results indicate that the AI system performs competitively with human composers, achieving higher key confidence scores. However, the reliance on theory-based evaluations rather than human listening tests may limit the assessment of the music's emotional and aesthetic qualities, which are critical in music generation.
The paper provides sufficient detail regarding the algorithms and system architecture, allowing for potential reproducibility. However, the absence of a publicly available code repository or detailed implementation instructions limits the ability for others to replicate the results fully. The use of AWS for deployment is mentioned, but specifics on the setup and configuration are lacking.
The paper acknowledges limitations such as the lack of contextual depth in lyrics generated from images and the absence of variations for different moods and genres in the music produced. These factors could restrict the applicability of the framework in more diverse musical contexts. Additionally, the focus on melody generation without incorporating harmonic complexity may limit the richness of the compositions.
The MusicAIR framework has significant potential applications in music education, composition assistance, and creative industries, lowering barriers for aspiring musicians and enhancing accessibility to music creation tools. Its innovative approach could inspire further research in multimodal AI applications and contribute to the ongoing discourse on copyright and ethical considerations in AI-generated content. The paper presents MusicAIR, a novel multimodal AI music generation framework that leverages algorithm-driven methods to produce music from lyrics and images, significantly contributing to the field of AI in music generation. The comprehensive methodology and promising experimental results highlight its potential impact on music composition and education.
Automatic Music Transcription (AMT) converts audio recordings into symbolic musical representations. Training deep neural networks (DNNs) for AMT typically requires strongly aligned training pairs with precise frame-level annotations. Since creating such datasets is costly and impractical for many musical contexts, weakly aligned approaches using segment-level annotations have gained traction. However, existing methods often rely on Dynamic Time Warping (DTW) or soft alignment loss functions, both of which still require local semantic correspondences, making them error-prone and computationally expensive. In this article, we introduce CountEM, a novel AMT framework that eliminates the need for explicit local alignment by leveraging note event histograms as supervision, enabling lighter computations and greater flexibility. Using an Expectation-Maximization (EM) approach, CountEM iteratively refines predictions based solely on note occurrence counts, significantly reducing annotation efforts while maintaining high transcription accuracy. Experiments on piano, guitar, and multi-instrument datasets demonstrate that CountEM matches or surpasses existing weakly supervised methods, improving AMT's robustness, scalability, and efficiency. Our project page is available at https://yoni-yaffe.github.io/count-the-notes.
Primary: International Audio Laboratories Erlangen
All Institutions: Department of Computer Science, International Audio Laboratories Erlangen, International Laboratories, Tel Aviv University, University
CountEM introduces a novel framework for automatic music transcription that utilizes histogram-based supervision, significantly simplifying the annotation process while maintaining high transcription accuracy. The technical contributions and methodology present a meaningful advancement in the field of music information retrieval, with potential applications across various domains.
The methodology presented in CountEM is innovative, leveraging note event histograms for supervision in automatic music transcription (AMT) without requiring explicit local alignment. The use of an Expectation-Maximization (EM) approach to iteratively refine predictions based on histogram counts is a significant departure from traditional methods that rely on Dynamic Time Warping (DTW) and local alignment techniques. The peak-picking mechanism for estimating note onsets is straightforward yet effective, allowing for greater flexibility and reduced computational overhead. The paper effectively communicates the theoretical underpinnings of the approach, making it accessible for further exploration and adaptation.
The experiments conducted across various datasets (MAESTRO, GuitarSet, and MusicNet) demonstrate the robustness and generalizability of CountEM. The results indicate that the proposed method matches or surpasses existing weakly supervised methods, showcasing improved transcription accuracy even with large time windows. The evaluation metrics used, including note-level precision, recall, and F-score, are appropriate for the task and provide a comprehensive assessment of the model's performance. However, the paper could benefit from more detailed comparisons with a broader range of existing methods to contextualize its contributions further.
The paper provides sufficient implementation details, including the architecture used, training procedures, and hyperparameters, which facilitate reproducibility. The use of PyTorch and the mention of specific hardware (NVIDIA GeForce RTX 3090 GPUs) enhances the clarity of the experimental setup. However, the absence of a public code repository may hinder full reproducibility for some researchers.
One limitation of the proposed method is its reliance on the quality of the histograms derived from potentially noisy or imprecise musical scores, which could affect transcription accuracy in real-world applications. Additionally, while the method shows promise across various datasets, its performance in more complex polyphonic scenarios or with less structured musical data remains to be fully explored.
The implications of CountEM extend beyond music transcription, potentially influencing related fields such as instrument recognition, rhythm analysis, and even lyrics transcription. By reducing the annotation burden and improving the scalability of AMT systems, this work could democratize access to music transcription technologies, benefiting educators, musicians, and researchers alike. CountEM introduces a novel framework for automatic music transcription that utilizes histogram-based supervision, significantly simplifying the annotation process while maintaining high transcription accuracy. The technical contributions and methodology present a meaningful advancement in the field of music information retrieval, with potential applications across various domains.
Accurate modeling of time-varying underwater acoustic channels is essential for the design, evaluation, and deployment of reliable underwater communication systems. Conventional physics models require detailed environmental knowledge, while stochastic replay methods are constrained by the limited diversity of measured channels and often fail to generalize to unseen scenarios, reducing their practical applicability. To address these challenges, we propose StableUASim, a pre-trained conditional latent diffusion surrogate model that captures the stochastic dynamics of underwater acoustic communication channels. Leveraging generative modeling, StableUASim produces diverse and statistically realistic channel realizations, while supporting conditional generation from specific measurement samples. Pre-training enables rapid adaptation to new environments using minimal additional data, and the autoencoder latent representation facilitates efficient channel analysis and compression. Experimental results demonstrate that StableUASim accurately reproduces key channel characteristics and communication performance, providing a scalable, data-efficient, and physically consistent surrogate model for both system design and machine learning-driven underwater applications.
Primary: National University of Singapore
All Institutions: National University of Singapore, ARL, Tropical Marine Science Institute
The main contribution of this paper is the development of StableUASim, a conditional latent diffusion model that effectively simulates time-varying underwater acoustic channels, addressing key limitations of existing modeling approaches. The comprehensive methodology and experimental validation underscore its potential to significantly advance the field of underwater acoustic communication modeling.
The paper introduces StableUASim, a novel conditional latent diffusion model tailored for simulating time-varying underwater acoustic channels. The methodology is robust, combining generative modeling techniques with a pre-training strategy that enhances data efficiency and adaptability to new environments. The use of an autoencoder for latent representation and the integration of diffusion processes for channel simulation are innovative, addressing the limitations of traditional physics-based and stochastic models. The approach is well-structured, with clear formulations and a focus on capturing the stochastic dynamics of underwater channels.
The experimental evaluation is comprehensive, utilizing both simulated and real-world datasets to validate the model's performance. The paper compares StableUASim against state-of-the-art methods, demonstrating its superior adaptability and accuracy in reproducing channel characteristics and communication performance. The results are well-presented, with clear metrics such as bit error rate (BER) and cumulative distribution functions (CDFs) that substantiate the claims made regarding the model's effectiveness.
While the paper provides a detailed description of the model architecture and training procedures, it lacks specific implementation details or links to code repositories that would facilitate reproducibility. The absence of a demo or project URL further limits the ability for other researchers to replicate the findings or build upon the work.
The paper acknowledges limitations in the model's ability to handle variable-length sequences and the reliance on fixed training data configurations. Additionally, the performance may be affected by the quality and diversity of the training datasets, particularly in real-world applications where environmental conditions can vary significantly.
The proposed model has significant implications for underwater communication systems, particularly in enhancing the reliability and efficiency of acoustic channel simulations. Its potential applications extend to various fields, including ocean exploration, environmental monitoring, and autonomous underwater vehicle operations. By improving the modeling of underwater acoustic channels, the research could lead to advancements in communication technologies and methodologies in challenging underwater environments. The main contribution of this paper is the development of StableUASim, a conditional latent diffusion model that effectively simulates time-varying underwater acoustic channels, addressing key limitations of existing modeling approaches. The comprehensive methodology and experimental validation underscore its potential to significantly advance the field of underwater acoustic communication modeling.
Recent advances in generative AI have made music generation a prominent research focus. However, many neural-based models rely on large datasets, raising concerns about copyright infringement and high-performance costs. In contrast, we propose MusicAIR, an innovative multimodal AI music generation framework powered by a novel algorithm-driven symbolic music core, effectively mitigating copyright infringement risks. The music core algorithms connect critical lyrical and rhythmic information to automatically derive musical features, creating a complete, coherent melodic score solely from the lyrics. The MusicAIR framework facilitates music generation from lyrics, text, and images. The generated score adheres to established principles of music theory, lyrical structure, and rhythmic conventions. We developed Generate AI Music (GenAIM), a web tool using MusicAIR for lyric-to-song, text-to-music, and image-to-music generation. In our experiments, we evaluated AI-generated music scores produced by the system using both standard music metrics and innovative analysis that compares these compositions with original works. The system achieves an average key confidence of 85%, outperforming human composers at 79%, and aligns closely with established music theory standards, demonstrating its ability to generate diverse, human-like compositions. As a co-pilot tool, GenAIM can serve as a reliable music composition assistant and a possible educational composition tutor while simultaneously lowering the entry barrier for all aspiring musicians, which is innovative and significantly contributes to AI for music generation.
Primary: Stanford University
All Institutions: Stanford University, George Mason University, IntelliSky
The paper presents MusicAIR, a novel multimodal AI music generation framework that leverages algorithm-driven methods to produce music from lyrics and images, significantly contributing to the field of AI in music generation. The comprehensive methodology and promising experimental results highlight its potential impact on music composition and education.
The methodology presented in MusicAIR is innovative, focusing on a purely algorithm-driven approach to music generation that circumvents the need for large datasets, thus addressing significant copyright and resource concerns prevalent in traditional neural network-based models. The framework's ability to generate music from multiple modalities (lyrics, text, images) is a notable advancement, particularly the integration of LLMs for image-to-lyric conversion, which is a unique aspect not commonly explored in existing literature. The detailed algorithms for score setup, rhythmic construction, and pitch generation are well-articulated, showcasing a comprehensive understanding of music theory and its application in AI.
The experiments conducted using the GenAIM tool are robust, comparing AI-generated compositions against original works using objective music theory metrics such as key confidence, melodic smoothness, and rhythm matching. The results indicate that the AI system performs competitively with human composers, achieving higher key confidence scores. However, the reliance on theory-based evaluations rather than human listening tests may limit the assessment of the music's emotional and aesthetic qualities, which are critical in music generation.
The paper provides sufficient detail regarding the algorithms and system architecture, allowing for potential reproducibility. However, the absence of a publicly available code repository or detailed implementation instructions limits the ability for others to replicate the results fully. The use of AWS for deployment is mentioned, but specifics on the setup and configuration are lacking.
The paper acknowledges limitations such as the lack of contextual depth in lyrics generated from images and the absence of variations for different moods and genres in the music produced. These factors could restrict the applicability of the framework in more diverse musical contexts. Additionally, the focus on melody generation without incorporating harmonic complexity may limit the richness of the compositions.
The MusicAIR framework has significant potential applications in music education, composition assistance, and creative industries, lowering barriers for aspiring musicians and enhancing accessibility to music creation tools. Its innovative approach could inspire further research in multimodal AI applications and contribute to the ongoing discourse on copyright and ethical considerations in AI-generated content. The paper presents MusicAIR, a novel multimodal AI music generation framework that leverages algorithm-driven methods to produce music from lyrics and images, significantly contributing to the field of AI in music generation. The comprehensive methodology and promising experimental results highlight its potential impact on music composition and education.
Audio-language pretraining holds promise for general-purpose audio understanding, yet remains underexplored compared to its vision counterpart. While vision-language models like CLIP serve as widely adopted foundations, existing audio-language models primarily excel at retrieval tasks with limited adoption as general-purpose encoders. We identify three key barriers: limited large-scale audio-text corpora, insufficient caption diversity, and lack of systematic exploration and evaluation. To this end, we introduce CaptionStew, a 10.7M caption dataset aggregating diverse open-source audio-text corpora across multiple domains and captioning styles. Using this resource, we conduct the first comprehensive evaluation comparing contrastive and captioning objectives for audio representation learning across speech, music, and environmental sound tasks. Our results demonstrate that audio-language pretraining yields competitive, transferable representations. Through systematic data-scaling experiments, we reveal complementary objective strengths: contrastive learning achieves superior data efficiency at smaller scales, while captioning demonstrates better scalability on language-involved audio understanding tasks. We also find that common supervised initialization practices provide diminishing returns at scale, challenging current approaches. These findings establish audio-language pretraining as a viable pathway toward general-purpose audio representations, guiding future research. To accelerate progress, we release data preparation recipes, training protocols, and pretrained models, paving the way toward universal audio understanding.
Primary: Zhejiang University
All Institutions: Zhejiang University, Tencent AI Lab Seattle
The main contribution of this paper is the introduction of CaptionStew, a large and diverse audio-text dataset, and a systematic evaluation of audio-language pretraining methods, which collectively advance the understanding of audio representation learning. This work is significant as it addresses key barriers in the field and provides a pathway for future research towards universal audio understanding.
The paper introduces a novel dataset, CaptionStew, which aggregates a large and diverse collection of audio-text pairs, addressing a significant gap in the audio-language pretraining landscape. The methodology is well-structured, comparing contrastive and captioning objectives in a systematic manner, which is a crucial contribution to the understanding of audio representation learning. The evaluation of these objectives across different audio domains (speech, music, environmental sounds) adds depth to the analysis and enhances the robustness of the findings.
The experiments are comprehensive, utilizing the newly introduced dataset to conduct a thorough evaluation of the proposed methods. The results indicate that audio-language pretraining can yield competitive representations, which is a promising finding for future research. The systematic data-scaling experiments reveal important insights into the strengths of different learning objectives, although the paper could benefit from more extensive quantitative results to support its claims.
The authors have made a commendable effort to ensure reproducibility by providing comprehensive source code and detailed experimental setups. This transparency is crucial for the research community and will facilitate further exploration of the proposed methods.
One limitation of the study is the potential bias in the dataset, as it aggregates open-source audio-text pairs, which may not cover all relevant audio domains or may have varying quality. Additionally, while the findings challenge common initialization practices, the paper does not provide extensive discussion on alternative approaches that could be explored.
The implications of this research are significant, as it lays the groundwork for developing general-purpose audio representations that could enhance various applications, including audio retrieval, classification, and understanding tasks. The release of the dataset and pretrained models will likely accelerate advancements in the field of audio processing and machine learning. The main contribution of this paper is the introduction of CaptionStew, a large and diverse audio-text dataset, and a systematic evaluation of audio-language pretraining methods, which collectively advance the understanding of audio representation learning. This work is significant as it addresses key barriers in the field and provides a pathway for future research towards universal audio understanding.
Joint automatic speech recognition (ASR) and speaker diarization aim to answer the question "who spoke what" in multi-speaker scenarios. In this paper, we present an end-to-end speech large language model (Speech-LLM) for Joint strEamable DIarization and aSr (JEDIS-LLM). The model is trained only on short audio under 20s but is capable of streamable inference on long-form audio without additional training. This is achieved by introducing a Speaker Prompt Cache (SPC) with an on-the-fly update mechanism during chunk-wise streaming inference, inspired by the autoregressive nature of LLMs. The SPC also allows the seamless use of pre-enrolled speaker profiles which is common in many scenarios like meeting transcription. To further enhance diarization capability, we incorporate word-level speaker supervision into the speech encoder during training. Experimental results demonstrate that our system outperforms strong baselines, including Sortformer and Meta-Cat in the local setting on audio up to 20s, and DiarizationLM on long-form audio, despite being fully end-to-end and streamable while DiarizationLM follows a cascaded offline pipeline. To the best of our knowledge, this is the first work enabling zero-shot streamable joint ASR and diarization on long audio using a Speech-LLM trained only on short audio, achieving state-of-the-art performance.
Primary: unknown
All Institutions: unknown
The paper presents a pioneering approach to joint ASR and diarization using a Speech-LLM trained on short audio, achieving state-of-the-art performance in long audio scenarios. The innovative methodology and strong experimental results position this work as a meaningful contribution to the field of audio processing and machine learning.
The proposed methodology introduces a novel end-to-end Speech-LLM that leverages a Speaker Prompt Cache (SPC) for streamable inference on long audio. The SPC mechanism is innovative, allowing for real-time updates and the integration of pre-enrolled speaker profiles, which is a significant advancement in ASR and diarization tasks. The incorporation of word-level speaker supervision into the speech encoder is also a noteworthy methodological enhancement that likely improves diarization accuracy.
The experiments are well-structured, demonstrating the model's performance against established baselines like Sortformer, Meta-Cat, and DiarizationLM. The results indicate a strong performance in both short and long audio scenarios, showcasing the model's versatility and effectiveness. However, the paper could benefit from more detailed statistical analysis and comparisons across different datasets to strengthen the claims of superiority.
The paper lacks sufficient detail regarding the implementation specifics, such as the architecture of the Speech-LLM, training procedures, and dataset descriptions. This omission raises concerns about the reproducibility of the results. Providing access to code or a detailed methodology section would significantly enhance reproducibility.
One limitation is the reliance on short audio training data, which may affect the model's generalizability to diverse audio environments. Additionally, the performance metrics could be expanded to include more comprehensive evaluations across various speaker configurations and acoustic conditions.
The implications of this work are significant, particularly in applications such as meeting transcription, customer service, and any domain requiring accurate speaker identification in multi-speaker environments. The ability to perform zero-shot streamable ASR and diarization could lead to more efficient and user-friendly audio processing systems. The paper presents a pioneering approach to joint ASR and diarization using a Speech-LLM trained on short audio, achieving state-of-the-art performance in long audio scenarios. The innovative methodology and strong experimental results position this work as a meaningful contribution to the field of audio processing and machine learning.
Voice cloning technology poses significant privacy threats by enabling unauthorized speech synthesis from limited audio samples. Existing defenses based on imperceptible adversarial perturbations are vulnerable to common audio preprocessing such as denoising and compression. We propose SceneGuard, a training-time voice protection method that applies scene-consistent audible background noise to speech recordings. Unlike imperceptible perturbations, SceneGuard leverages naturally occurring acoustic scenes (e.g., airport, street, park) to create protective noise that is contextually appropriate and robust to countermeasures. We evaluate SceneGuard on text-to-speech training attacks, demonstrating 5.5% speaker similarity degradation with extremely high statistical significance (p < 10^{-15}, Cohen's d = 2.18) while preserving 98.6% speech intelligibility (STOI = 0.986). Robustness evaluation shows that SceneGuard maintains or enhances protection under five common countermeasures including MP3 compression, spectral subtraction, lowpass filtering, and downsampling. Our results suggest that audible, scene-consistent noise provides a more robust alternative to imperceptible perturbations for training-time voice protection. The source code are available at: https://github.com/richael-sang/SceneGuard.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of SceneGuard, a training-time voice protection method that utilizes scene-consistent audible background noise to enhance the robustness of defenses against voice cloning attacks. This innovative approach not only addresses existing vulnerabilities in audio preprocessing but also demonstrates significant potential for real-world applications in safeguarding personal audio data.
The methodology presented in SceneGuard is innovative as it introduces a novel approach to voice protection by utilizing scene-consistent audible background noise. This contrasts with traditional methods that rely on imperceptible adversarial perturbations, which have shown vulnerabilities to common audio preprocessing techniques. The use of contextually appropriate noise enhances the robustness of the protection mechanism. The paper provides a clear description of how the background noise is generated and integrated into the speech recordings, which is a significant contribution to the field of audio security.
The experimental setup is thorough, evaluating the effectiveness of SceneGuard against various text-to-speech training attacks. The reported results, including a 5.5% degradation in speaker similarity and high speech intelligibility, are statistically significant and demonstrate the method's effectiveness. The robustness tests against common countermeasures, such as MP3 compression and spectral subtraction, further validate the method's practical applicability. However, the paper could benefit from a broader range of datasets to enhance generalizability.
The paper includes a link to the source code, which is essential for reproducibility. However, the details regarding the experimental setup and specific parameters used in the experiments could be more explicitly stated to facilitate replication by other researchers.
One limitation is the potential for the effectiveness of SceneGuard to vary across different languages or dialects, which is not addressed in the paper. Additionally, the reliance on specific acoustic scenes may limit the applicability of the method in less controlled environments.
The implications of this research are significant, as it addresses a critical privacy concern in voice cloning technology. By providing a more robust defense mechanism, SceneGuard could enhance user privacy in various applications, including voice assistants and automated customer service systems. The approach also opens avenues for further research into audio security and privacy protection. The main contribution of this paper is the introduction of SceneGuard, a training-time voice protection method that utilizes scene-consistent audible background noise to enhance the robustness of defenses against voice cloning attacks. This innovative approach not only addresses existing vulnerabilities in audio preprocessing but also demonstrates significant potential for real-world applications in safeguarding personal audio data.
This paper addresses the challenges in short-time Fourier transform (STFT) domain subband adaptive filtering, in particular, subband system identification. Previous studies in this area have primarily focused on setups with subband filtering at a downsampled rate, implemented using the weighted overlap-add (WOLA) filter bank, popular in audio and speech-processing for its reduced complexity. However, this traditional approach imposes constraints on the subband filters when transformed to their full-rate representation. This paper makes three key contributions. First, it introduces a generalized WOLA filter bank that repositions subband filters before the downsampling operation, eliminating the constraints on subband filters inherent in the conventional WOLA filter bank. Second, it investigates the mean square error (MSE) performance of the generalized WOLA filter bank for full-band system identification, establishing analytical ties between the order of subband filters, the full-band system impulse response length, the decimation factor, and the prototype filters. Third, to address the increased computational complexity of the generalized WOLA, the paper proposes a low-complexity implementation termed per-tone weighted overlap-add (PT-WOLA), which maintains computational complexity on par with conventional WOLA. Analytical and empirical evidence demonstrates that the proposed generalized WOLA filter bank significantly enhances the performance of subband system identification.
Primary: KU Leuven
All Institutions: KU Leuven, Nokia Bell Labs
This paper presents a novel generalized WOLA filter bank that enhances subband system identification by repositioning filters, thereby eliminating traditional constraints. The contributions are technically sound and offer meaningful improvements in the field of audio processing, though further work is needed to ensure reproducibility and address real-world applicability.
The paper introduces a generalized WOLA filter bank that innovatively repositions subband filters before the downsampling operation, which is a significant departure from traditional methods. This approach effectively removes constraints on subband filters, enhancing the flexibility and performance of subband adaptive filtering. The methodology is well-structured, with a clear analytical framework that ties together various parameters influencing system identification performance. The introduction of the PT-WOLA implementation to maintain computational efficiency while improving performance is a commendable aspect of the methodology.
The paper provides a thorough empirical evaluation of the proposed generalized WOLA filter bank, demonstrating its advantages over conventional methods. The experiments are well-designed, focusing on mean square error (MSE) performance metrics that are relevant in the context of system identification. The results convincingly show the improvements achieved through the proposed methods, although the paper could benefit from a broader range of experimental scenarios to further validate the findings.
While the paper outlines the theoretical foundations and provides empirical results, it lacks detailed implementation specifics that would facilitate reproducibility. The absence of code or supplementary materials limits the ability of other researchers to replicate the experiments and verify the results independently.
One limitation of the study is the potential increase in computational complexity associated with the generalized WOLA filter bank, despite the authors' efforts to propose a low-complexity implementation. Additionally, the paper does not address potential challenges in real-world applications, such as varying signal conditions or noise environments, which could affect the performance of the proposed methods.
The advancements presented in this paper have significant implications for audio and speech processing applications, particularly in areas requiring efficient adaptive filtering techniques. The generalized WOLA filter bank could enhance performance in various practical scenarios, such as acoustic echo cancellation and real-time audio processing systems, thereby contributing to the development of more robust audio technologies. This paper presents a novel generalized WOLA filter bank that enhances subband system identification by repositioning filters, thereby eliminating traditional constraints. The contributions are technically sound and offer meaningful improvements in the field of audio processing, though further work is needed to ensure reproducibility and address real-world applicability.
Recent advances in generative AI for music have achieved remarkable fidelity and stylistic diversity, yet these systems often fail to align with nuanced human preferences due to the specific loss functions they use. This paper advocates for the systematic application of preference alignment techniques to music generation, addressing the fundamental gap between computational optimization and human musical appreciation. Drawing on recent breakthroughs including MusicRL's large-scale preference learning, multi-preference alignment frameworks like diffusion-based preference optimization in DiffRhythm+, and inference-time optimization techniques like Text2midi-InferAlign, we discuss how these techniques can address music's unique challenges: temporal coherence, harmonic consistency, and subjective quality assessment. We identify key research challenges including scalability to long-form compositions, reliability amongst others in preference modelling. Looking forward, we envision preference-aligned music generation enabling transformative applications in interactive composition tools and personalized music services. This work calls for sustained interdisciplinary research combining advances in machine learning, music-theory to create music AI systems that truly serve human creative and experiential needs.
Primary: Singapore University of Technology and Design (SUTD)
All Institutions: Singapore University of Technology and Design (SUTD), AAAI Publications Committee
This paper makes a significant contribution by addressing the alignment of generative music AI with human preferences through innovative methodologies. The exploration of preference alignment techniques and their implications for music generation represents a meaningful advancement in the field, with potential applications that could reshape how music is created and experienced.
The paper presents a systematic approach to aligning generative music AI with human preferences through various innovative techniques, including large-scale preference learning, multi-preference alignment frameworks, and inference-time optimization. The methodologies discussed, such as MusicRL and DiffRhythm+, showcase a blend of reinforcement learning and direct preference optimization, which are well-suited for the complexities of music generation. The authors effectively highlight the unique challenges of musical preference alignment, such as temporal coherence and cultural context, and propose methods that address these challenges. However, the paper could benefit from a more detailed explanation of the implementation specifics for each method.
The paper references experimental results demonstrating the effectiveness of the proposed methods, particularly MusicRL's performance in human evaluations compared to baseline models. However, the paper lacks detailed quantitative metrics and specific datasets used in these experiments, which would strengthen the evaluation of the proposed methods. The mention of various evaluation frameworks, such as SongEval and Audiobox-aesthetic, indicates an awareness of the need for comprehensive assessment metrics, but the results could be more explicitly detailed to allow for better understanding of their significance.
The paper does not provide sufficient implementation details or access to datasets, which raises concerns about reproducibility. The proprietary nature of the MusicRL dataset limits the ability of other researchers to validate the findings independently. Additionally, while the methodologies are described conceptually, specific algorithmic implementations and parameter settings are not detailed, making it challenging for others to replicate the experiments.
The paper identifies several limitations, including the scalability of methods to long-form compositions and the need for more robust preference modeling techniques. The reliance on proprietary datasets also poses a significant barrier to reproducibility and broader adoption of the proposed methods. Furthermore, the paper acknowledges the challenge of capturing the subjective nature of musical appreciation through traditional metrics, which may not fully encompass the complexities of human preference.
The proposed preference-aligned music generation systems have the potential to revolutionize interactive composition tools and personalized music services, enhancing user experiences in various applications, including film scoring, gaming, and therapeutic music generation. The interdisciplinary approach advocated by the authors, combining machine learning with music theory and cognitive science, could lead to more culturally aware and context-sensitive music AI systems, addressing diverse creative needs across different communities. This paper makes a significant contribution by addressing the alignment of generative music AI with human preferences through innovative methodologies. The exploration of preference alignment techniques and their implications for music generation represents a meaningful advancement in the field, with potential applications that could reshape how music is created and experienced.
We introduce CASTELLA, a human-annotated audio benchmark for the task of audio moment retrieval (AMR). Although AMR has various useful potential applications, there is still no established benchmark with real-world data. The early study of AMR trained the model with solely synthetic datasets. Moreover, the evaluation is based on annotated dataset of fewer than 100 samples. This resulted in less reliable reported performance. To ensure performance for applications in real-world environments, we present CASTELLA, a large-scale manually annotated AMR dataset. CASTELLA consists of 1,009, 213, and 640 audio recordings for train, valid, and test split, respectively, which is 24 times larger than the previous dataset. We also establish a baseline model for AMR using CASTELLA. Our experiments demonstrate that a model fine-tuned on CASTELLA after pre-training on the synthetic data outperformed a model trained solely on the synthetic data by 10.4 points in Recall1@0.7. CASTELLA is publicly available in https://h-munakata.github.io/CASTELLA-demo/.
Primary: Nagoya University
All Institutions: Nagoya University, LY Corporation
The main contribution of this paper is the introduction of the CASTELLA dataset, a large-scale, human-annotated benchmark for audio moment retrieval that addresses the limitations of previous datasets. This work not only enhances the reliability of AMR evaluations but also sets a foundation for future research in the field, particularly in the development of models that can effectively handle real-world audio data.
The methodology employed in the creation of the CASTELLA dataset is robust, utilizing crowd-sourcing for annotation and ensuring quality through a multi-step review process. The dataset's design addresses significant gaps in existing audio moment retrieval (AMR) benchmarks, particularly by focusing on real-world audio recordings and providing both global and local captions with temporal boundaries. The use of a large-scale dataset (over 1,000 audio recordings) is a notable improvement over previous efforts, which were limited in size and scope.
The experiments conducted using the CASTELLA dataset are well-structured, demonstrating the effectiveness of fine-tuning models pre-trained on synthetic datasets. The results indicate a significant performance improvement when using CASTELLA, showcasing the dataset's utility in advancing AMR tasks. The paper provides clear metrics (Recall1 and mAP) to assess model performance, which is essential for evaluating the impact of the dataset on AMR research.
The paper outlines the experimental setup and model configurations in detail, which aids in reproducibility. However, the lack of a publicly accessible code repository limits the ability for others to replicate the exact experiments. Including a GitHub repository or similar would enhance reproducibility significantly.
One limitation identified is the reliance on crowd-sourced annotations, which, while beneficial for scale, may introduce variability in quality. Additionally, the dataset's focus on English captions may limit its applicability in multilingual contexts. The paper also notes that current models struggle with short audio moments, indicating a potential area for further research and improvement.
The CASTELLA dataset has the potential to significantly impact various applications, including audio indexing, content retrieval in media, and enhancing accessibility for audio content. By providing a comprehensive benchmark for AMR, it opens avenues for improved user experiences in audio search and retrieval systems. The main contribution of this paper is the introduction of the CASTELLA dataset, a large-scale, human-annotated benchmark for audio moment retrieval that addresses the limitations of previous datasets. This work not only enhances the reliability of AMR evaluations but also sets a foundation for future research in the field, particularly in the development of models that can effectively handle real-world audio data.
Automatic Music Transcription (AMT) converts audio recordings into symbolic musical representations. Training deep neural networks (DNNs) for AMT typically requires strongly aligned training pairs with precise frame-level annotations. Since creating such datasets is costly and impractical for many musical contexts, weakly aligned approaches using segment-level annotations have gained traction. However, existing methods often rely on Dynamic Time Warping (DTW) or soft alignment loss functions, both of which still require local semantic correspondences, making them error-prone and computationally expensive. In this article, we introduce CountEM, a novel AMT framework that eliminates the need for explicit local alignment by leveraging note event histograms as supervision, enabling lighter computations and greater flexibility. Using an Expectation-Maximization (EM) approach, CountEM iteratively refines predictions based solely on note occurrence counts, significantly reducing annotation efforts while maintaining high transcription accuracy. Experiments on piano, guitar, and multi-instrument datasets demonstrate that CountEM matches or surpasses existing weakly supervised methods, improving AMT's robustness, scalability, and efficiency. Our project page is available at https://yoni-yaffe.github.io/count-the-notes.
Primary: International Audio Laboratories Erlangen
All Institutions: Department of Computer Science, International Audio Laboratories Erlangen, International Laboratories, Tel Aviv University, University
CountEM introduces a novel framework for automatic music transcription that utilizes histogram-based supervision, significantly simplifying the annotation process while maintaining high transcription accuracy. The technical contributions and methodology present a meaningful advancement in the field of music information retrieval, with potential applications across various domains.
The methodology presented in CountEM is innovative, leveraging note event histograms for supervision in automatic music transcription (AMT) without requiring explicit local alignment. The use of an Expectation-Maximization (EM) approach to iteratively refine predictions based on histogram counts is a significant departure from traditional methods that rely on Dynamic Time Warping (DTW) and local alignment techniques. The peak-picking mechanism for estimating note onsets is straightforward yet effective, allowing for greater flexibility and reduced computational overhead. The paper effectively communicates the theoretical underpinnings of the approach, making it accessible for further exploration and adaptation.
The experiments conducted across various datasets (MAESTRO, GuitarSet, and MusicNet) demonstrate the robustness and generalizability of CountEM. The results indicate that the proposed method matches or surpasses existing weakly supervised methods, showcasing improved transcription accuracy even with large time windows. The evaluation metrics used, including note-level precision, recall, and F-score, are appropriate for the task and provide a comprehensive assessment of the model's performance. However, the paper could benefit from more detailed comparisons with a broader range of existing methods to contextualize its contributions further.
The paper provides sufficient implementation details, including the architecture used, training procedures, and hyperparameters, which facilitate reproducibility. The use of PyTorch and the mention of specific hardware (NVIDIA GeForce RTX 3090 GPUs) enhances the clarity of the experimental setup. However, the absence of a public code repository may hinder full reproducibility for some researchers.
One limitation of the proposed method is its reliance on the quality of the histograms derived from potentially noisy or imprecise musical scores, which could affect transcription accuracy in real-world applications. Additionally, while the method shows promise across various datasets, its performance in more complex polyphonic scenarios or with less structured musical data remains to be fully explored.
The implications of CountEM extend beyond music transcription, potentially influencing related fields such as instrument recognition, rhythm analysis, and even lyrics transcription. By reducing the annotation burden and improving the scalability of AMT systems, this work could democratize access to music transcription technologies, benefiting educators, musicians, and researchers alike. CountEM introduces a novel framework for automatic music transcription that utilizes histogram-based supervision, significantly simplifying the annotation process while maintaining high transcription accuracy. The technical contributions and methodology present a meaningful advancement in the field of music information retrieval, with potential applications across various domains.
Speech-LLM models have demonstrated great performance in multi-modal and multi-task speech understanding. A typical speech-LLM paradigm is integrating speech modality with a large language model (LLM). While the Whisper encoder was frequently adopted in previous studies for speech input, it shows limitations regarding input format, model scale, and semantic performance. To this end, we propose a lightweight TTA model specialized in speech semantics for more effective LLM integration. With large-scale training of 358k hours of speech data on multilingual speech recognition (ASR), speech translation (ST) and speech-text alignment tasks, TTA is capable of producing robust cross-lingual speech representations. Extensive evaluations across diverse benchmarks, including ASR/ST, speech retrieval, and ASR-LLM performance assessments, demonstrate TTA's superiority over Whisper. Furthermore, we rigorously validate the interplay between cross-lingual capabilities and ASR/ST performance. The model weights and training recipes of TTA will be released as part of an audio understanding toolkit Auden.
Primary: Tencent AI Lab
All Institutions: Tencent AI Lab
The main contribution of this paper is the introduction of the TTA model, which enhances cross-lingual speech representation through a lightweight architecture and extensive training on multilingual datasets. This work addresses significant limitations in existing models and offers a promising avenue for future research in speech understanding and multilingual applications.
The proposed TTA model introduces a lightweight architecture designed specifically for enhancing speech semantics in the context of cross-lingual integration with large language models. The methodology emphasizes large-scale training with 358k hours of multilingual speech data, which is a significant contribution to the field. The authors also address the limitations of the Whisper encoder, providing a clear rationale for their approach. However, the paper could benefit from a more detailed description of the model architecture and training procedures to enhance understanding.
The paper presents extensive evaluations across various benchmarks, including ASR, ST, and speech retrieval tasks, demonstrating the TTA model's superiority over Whisper. The results appear robust, but the paper lacks detailed statistical analysis of the results, which would strengthen the claims of superiority. The choice of benchmarks is appropriate, but additional comparative analyses with other state-of-the-art models could provide a broader context for the performance claims.
The authors mention that they will release model weights and training recipes as part of an audio understanding toolkit, which is a positive step towards reproducibility. However, the paper does not provide enough implementation details or specific hyperparameters used during training, which could hinder other researchers from replicating the results effectively.
One limitation is the reliance on the Whisper encoder's shortcomings without a thorough exploration of its strengths. Additionally, while the model is claimed to be lightweight, the paper does not provide a detailed analysis of its computational efficiency compared to Whisper or other models. The scope of multilingual capabilities also needs further clarification, as the paper does not specify which languages are supported or how performance varies across them.
The TTA model has the potential to significantly impact cross-lingual speech processing applications, enhancing multilingual communication and accessibility. Its integration with large language models could facilitate advancements in various domains, including education, customer service, and global collaboration. The release of the model weights and training recipes could foster further research and development in the field. The main contribution of this paper is the introduction of the TTA model, which enhances cross-lingual speech representation through a lightweight architecture and extensive training on multilingual datasets. This work addresses significant limitations in existing models and offers a promising avenue for future research in speech understanding and multilingual applications.
Achieving a balance between lightweight design and high performance remains a significant challenge for speech enhancement (SE) tasks on resource-constrained devices. Existing state-of-the-art methods, such as MUSE, have established a strong baseline with only 0.51M parameters by introducing a Multi-path Enhanced Taylor (MET) transformer and Deformable Embedding (DE). However, an in-depth analysis reveals that MUSE still suffers from efficiency bottlenecks: the MET module relies on a complex "approximate-compensate" mechanism to mitigate the limitations of Taylor-expansion-based attention, while the offset calculation for deformable embedding introduces additional computational burden. This paper proposes IMSE, a systematically optimized and ultra-lightweight network. We introduce two core innovations: 1) Replacing the MET module with Amplitude-Aware Linear Attention (MALA). MALA fundamentally rectifies the "amplitude-ignoring" problem in linear attention by explicitly preserving the norm information of query vectors in the attention calculation, achieving efficient global modeling without an auxiliary compensation branch. 2) Replacing the DE module with Inception Depthwise Convolution (IDConv). IDConv borrows the Inception concept, decomposing large-kernel operations into efficient parallel branches (square, horizontal, and vertical strips), thereby capturing spectrogram features with extremely low parameter redundancy. Extensive experiments on the VoiceBank+DEMAND dataset demonstrate that, compared to the MUSE baseline, IMSE significantly reduces the parameter count by 16.8\% (from 0.513M to 0.427M) while achieving competitive performance comparable to the state-of-the-art on the PESQ metric (3.373). This study sets a new benchmark for the trade-off between model size and speech quality in ultra-lightweight speech enhancement.
Primary: Xinxin
All Institutions: Xinxin, Yufang, Bin
The paper presents IMSE, an ultra-lightweight speech enhancement network that optimizes existing models by introducing MALA and IDConv, achieving a notable balance between model size and performance. This work contributes to the ongoing efforts to develop efficient deep learning models for speech enhancement, addressing critical challenges in the field.
The paper introduces two significant innovations: Amplitude-Aware Linear Attention (MALA) and Inception Depthwise Convolution (IDConv). MALA addresses the limitations of traditional linear attention by reintroducing amplitude information, which is crucial for maintaining the sharpness of attention distributions. This is a notable improvement over existing methods, as it allows for efficient global modeling without the need for additional compensation branches. IDConv, on the other hand, leverages the Inception architecture to efficiently capture spectrogram features through a multi-branch approach, which reduces parameter redundancy while maintaining performance. The methodology is well-structured and clearly articulated, demonstrating a solid understanding of the challenges in speech enhancement.
The experiments are conducted on the VoiceBank+DEMAND dataset, a well-established benchmark in the field of speech enhancement. The results show that IMSE achieves a competitive PESQ score of 3.373 with a reduced parameter count of 0.427M, outperforming the baseline MUSE model and other lightweight models. The ablation studies provide valuable insights into the effectiveness of the proposed components, confirming the contributions of both MALA and IDConv. However, the paper could benefit from more extensive comparisons with a broader range of state-of-the-art methods to further validate the claims.
The paper provides a GitHub link to the open-source code, which is a positive aspect for reproducibility. However, it lacks detailed implementation specifics, such as the exact training configurations and hyperparameters used, which could hinder replication efforts by other researchers.
While the proposed model shows promising results, it is still limited to the VoiceBank+DEMAND dataset. The generalizability of the model to other datasets or real-world scenarios remains to be tested. Additionally, the paper does not address potential limitations in terms of computational efficiency during inference on actual low-power devices, which is crucial for deployment.
The advancements presented in this paper have significant implications for real-time speech enhancement applications, particularly in resource-constrained environments such as mobile devices and hearing aids. By improving the efficiency of speech enhancement models, this research could enhance communication technologies, making them more accessible and effective in noisy environments. The paper presents IMSE, an ultra-lightweight speech enhancement network that optimizes existing models by introducing MALA and IDConv, achieving a notable balance between model size and performance. This work contributes to the ongoing efforts to develop efficient deep learning models for speech enhancement, addressing critical challenges in the field.
Video-to-audio generation (V2A) is of increasing importance in domains such as film post-production, AR/VR, and sound design, particularly for the creation of Foley sound effects synchronized with on-screen actions. Foley requires generating audio that is both semantically aligned with visible events and temporally aligned with their timing. Yet, there is a mismatch between evaluation and downstream applications due to the absence of a benchmark tailored to Foley-style scenarios. We find that 74% of videos from past evaluation datasets have poor audio-visual correspondence. Moreover, they are dominated by speech and music, domains that lie outside the use case for Foley. To address this gap, we introduce FoleyBench, the first large-scale benchmark explicitly designed for Foley-style V2A evaluation. FoleyBench contains 5,000 (video, ground-truth audio, text caption) triplets, each featuring visible sound sources with audio causally tied to on-screen events. The dataset is built using an automated, scalable pipeline applied to in-the-wild internet videos from YouTube-based and Vimeo-based sources. Compared to past datasets, we show that videos from FoleyBench have stronger coverage of sound categories from a taxonomy specifically designed for Foley sound. Each clip is further labeled with metadata capturing source complexity, UCS/AudioSet category, and video length, enabling fine-grained analysis of model performance and failure modes. We benchmark several state-of-the-art V2A models, evaluating them on audio quality, audio-video alignment, temporal synchronization, and audio-text consistency. Samples are available at: https://gclef-cmu.org/foleybench
Primary: Carnegie Mellon University
All Institutions: Carnegie Mellon University
The paper presents FoleyBench, a novel benchmark for evaluating video-to-audio generation models specifically for Foley sound effects. It addresses critical gaps in existing datasets and provides valuable insights into model performance, paving the way for future advancements in the field.
The methodology employed in constructing FoleyBench is robust, utilizing an automated multi-stage pipeline that effectively filters and curates video clips to ensure high-quality audio-visual correspondence. The approach includes scene detection, audio filtering using YAMNet, and audiovisual filtering with Gemini 2.5 Pro, which collectively ensure that the dataset is tailored for Foley sound effects. The systematic categorization of clips with metadata enhances the dataset's usability for fine-grained analysis of model performance. However, the reliance on automated processes may introduce biases if the filtering models are not perfectly calibrated.
The experiments conducted on various state-of-the-art V2A models provide a comprehensive evaluation of their performance on the newly established benchmark. The paper presents a clear comparison between FoleyBench and existing datasets like VGGSound, highlighting the inadequacies of the latter for Foley-style evaluations. The inclusion of multiple metrics (audio quality, audio-video alignment, temporal synchronization, and audio-text consistency) allows for a nuanced understanding of model capabilities. The results reveal significant insights into the strengths and weaknesses of different models, particularly in relation to specific sound categories and conditions.
The paper provides sufficient details regarding the dataset construction and evaluation metrics, which aids in reproducibility. However, the exact implementation details of the models evaluated are not fully disclosed, which may hinder complete reproducibility of the results. The availability of the dataset and benchmark scores is a positive aspect, as it allows other researchers to validate findings and build upon this work.
One limitation is the potential bias introduced by the automated filtering process, which may not capture all relevant audio-visual relationships accurately. Additionally, while the dataset is extensive, the reliance on internet-sourced videos may introduce variability in audio quality and content diversity. The paper also notes that current models struggle with high-fidelity audio generation for discrete sounds and long-form videos, indicating areas for further research.
The introduction of FoleyBench has significant implications for the fields of audio generation and sound design, particularly in creative industries such as film and gaming. By providing a dedicated benchmark for Foley sound effects, it encourages the development of more sophisticated V2A models that can produce high-quality, contextually appropriate audio. This work could lead to advancements in AR/VR applications and enhance user experiences in multimedia content creation. The paper presents FoleyBench, a novel benchmark for evaluating video-to-audio generation models specifically for Foley sound effects. It addresses critical gaps in existing datasets and provides valuable insights into model performance, paving the way for future advancements in the field.
Generative models have shown remarkable performance in speech enhancement (SE), achieving superior perceptual quality over traditional discriminative approaches. However, existing generative SE approaches often overlook the risk of hallucination under severe noise, leading to incorrect spoken content or inconsistent speaker characteristics, which we term linguistic and acoustic hallucinations, respectively. We argue that linguistic hallucination stems from models' failure to constrain valid phonological structures and it is a more fundamental challenge. While language models (LMs) are well-suited for capturing the underlying speech structure through modeling the distribution of discrete tokens, existing approaches are limited in learning from noise-corrupted representations, which can lead to contaminated priors and hallucinations. To overcome these limitations, we propose the Phonologically Anchored Speech Enhancer (PASE), a generative SE framework that leverages the robust phonological prior embedded in the pre-trained WavLM model to mitigate hallucinations. First, we adapt WavLM into a denoising expert via representation distillation to clean its final-layer features. Guided by the model's intrinsic phonological prior, this process enables robust denoising while minimizing linguistic hallucinations. To further reduce acoustic hallucinations, we train the vocoder with a dual-stream representation: the high-level phonetic representation provides clean linguistic content, while a low-level acoustic representation retains speaker identity and prosody. Experimental results demonstrate that PASE not only surpasses state-of-the-art discriminative models in perceptual quality, but also significantly outperforms prior generative models with substantially lower linguistic and acoustic hallucinations.
Primary: Unknown
All Institutions: Unknown
The main contribution of this paper is the introduction of the Phonologically Anchored Speech Enhancer (PASE), which effectively mitigates linguistic and acoustic hallucinations in generative speech enhancement through innovative use of phonological priors and dual-stream representations. This work represents a meaningful advancement in the field of audio processing, addressing critical challenges in speech enhancement with a novel and effective approach.
The proposed PASE framework innovatively leverages the phonological prior from the pre-trained WavLM model, employing a denoising representation distillation strategy that effectively reduces linguistic hallucinations. The dual-stream representation approach for vocoder training is a significant methodological advancement, allowing for the preservation of both linguistic content and speaker characteristics. The paper presents a clear and systematic approach to addressing the challenges of hallucination in generative speech enhancement, which is a notable contribution to the field.
The experimental design is robust, utilizing a comprehensive dataset and a variety of evaluation metrics that capture both perceptual quality and the integrity of linguistic content. The results demonstrate PASE's superiority over existing models in reducing hallucinations while maintaining high perceptual quality. The extensive ablation studies provide valuable insights into the effectiveness of the proposed methods, reinforcing the claims made by the authors.
The paper provides detailed implementation details, including configurations for the DeWavLM and vocoder components, as well as the training process. However, the lack of specific information about the primary institution and the absence of a clear citation for the datasets used may hinder full reproducibility. The provided URLs for the demo and project repository enhance accessibility for further exploration.
While the paper presents a strong methodology and promising results, it does not fully address potential limitations in terms of computational complexity and the scalability of the proposed approach in real-world applications. Additionally, the reliance on a specific pre-trained model (WavLM) may limit the generalizability of the findings to other contexts or datasets.
The implications of this work are significant for applications in speech enhancement, particularly in environments with high levels of noise. By reducing hallucinations, PASE could improve the reliability of speech recognition systems and enhance user experiences in various domains, including telecommunications, assistive technologies, and entertainment. The main contribution of this paper is the introduction of the Phonologically Anchored Speech Enhancer (PASE), which effectively mitigates linguistic and acoustic hallucinations in generative speech enhancement through innovative use of phonological priors and dual-stream representations. This work represents a meaningful advancement in the field of audio processing, addressing critical challenges in speech enhancement with a novel and effective approach.
Large Audio-Language Models (LALMs) have recently shown impressive progress in speech recognition, audio captioning, and auditory question answering. Yet, whether these models can perceive spatial dynamics, particularly the motion of sound sources, remains unclear. In this work, we uncover a systematic motion perception deficit in current ALLMs. To investigate this issue, we introduce AMPBench, the first benchmark explicitly designed to evaluate auditory motion understanding. AMPBench introduces a controlled question-answering benchmark designed to evaluate whether Audio-Language Models (LALMs) can infer the direction and trajectory of moving sound sources from binaural audio. Comprehensive quantitative and qualitative analyses reveal that current models struggle to reliably recognize motion cues or distinguish directional patterns. The average accuracy remains below 50%, underscoring a fundamental limitation in auditory spatial reasoning. Our study highlights a fundamental gap between human and model auditory spatial reasoning, providing both a diagnostic tool and new insight for enhancing spatial cognition in future Audio-Language Models.
Primary: Wuhan University
All Institutions: Wuhan University, Dublin City University, The University of Queensland, University of California, University of Chinese Academy of Sciences
The main contribution of this paper is the introduction of AMPBench, a benchmark for evaluating auditory motion perception in Audio-Language Models, revealing significant deficits in current models' abilities to understand spatial dynamics in sound. This work is a critical step toward improving auditory cognition in machine learning, highlighting the need for further research in this area.
The paper introduces AMPBench, a novel benchmark specifically designed to evaluate auditory motion perception in Audio-Language Models (LALMs). The methodology is well-structured, focusing on controlled question-answering tasks that assess models' abilities to infer sound source motion from binaural audio. The introduction of a benchmark for this specific aspect of auditory perception is a significant contribution, as it addresses a gap in existing evaluation frameworks. However, the paper could benefit from a more detailed description of the benchmark's design and the criteria used for evaluating model performance.
The experiments conducted are comprehensive, involving both quantitative and qualitative analyses of various LALMs. The results indicate a clear performance gap, with models achieving less than 50% accuracy in recognizing motion cues. This finding is critical as it highlights a fundamental limitation in current models. However, the paper lacks a detailed comparison of different models and their specific weaknesses, which could provide deeper insights into the challenges faced in auditory motion perception.
The paper does not provide sufficient details regarding the implementation of the benchmark or the models tested, which raises concerns about reproducibility. Including access to datasets, model architectures, and evaluation scripts would enhance the reproducibility of the findings and allow other researchers to validate the results.
The study acknowledges limitations in the current models' performance but does not explore potential reasons for these deficits in depth. Additionally, the benchmark may not cover all aspects of auditory motion perception, limiting its applicability. The lack of a demo or project URL also restricts further exploration of the proposed methods.
This research has significant implications for the development of more sophisticated audio processing models, particularly in applications such as virtual reality, robotics, and assistive technologies for the hearing impaired. By identifying and addressing the limitations in auditory motion perception, the findings could guide future research directions and model enhancements. The main contribution of this paper is the introduction of AMPBench, a benchmark for evaluating auditory motion perception in Audio-Language Models, revealing significant deficits in current models' abilities to understand spatial dynamics in sound. This work is a critical step toward improving auditory cognition in machine learning, highlighting the need for further research in this area.