Modern audio generation predominantly relies on latent-space compression, introducing additional complexity and potential information loss. In this work, we challenge this paradigm with WavFlow, a framework that generates high-fidelity audio directly in raw waveform space without intermediate representations. To overcome the inherent difficulties of modeling high-dimensional and low-energy signals, we reshape audio into 2D token grids through waveform patchify and introduce amplitude lifting to align signal scales, enabling stable optimization via direct x-prediction in flow matching. To capture complex semantic alignment and temporal synchronization, we leverage an automated data pipeline to curate 5 million high-quality video-text-audio triplets, allowing the model to learn fine-grained acoustic patterns from scratch. Experimental results show that WavFlow achieves competitive performance on the video-to-audio benchmark VGGSound (FD_PaSST: 59.98, IS_PANNs: 17.40, DeSync: 0.44) and the text-to-audio benchmark AudioCaps (FD_PANNs: 10.63, IS_PANNs: 12.62), matching or exceeding the performance of established latent-based methods. Our work demonstrates that intermediate compression is not a prerequisite for high-quality synthesis, offering a simpler and more scalable alternative for multimodal audio generation.
Primary: Meta AI
All Institutions: Meta AI, Northeastern University
WavFlow presents a novel approach to audio generation by directly synthesizing waveforms without latent-space compression. This work significantly advances the field of multimodal audio generation, offering a simpler and more scalable alternative that achieves competitive performance on established benchmarks.
The methodology presented in WavFlow is innovative as it directly generates audio in the waveform space without relying on latent-space compression, which is a significant departure from traditional approaches. The authors introduce waveform patchification and amplitude lifting techniques to manage the high-dimensional nature of raw audio, which enhances the model's ability to learn complex acoustic patterns. The use of conditional flow matching and the x-prediction strategy for stable training are noteworthy advancements that contribute to the robustness of the model. The architecture is designed to facilitate multimodal learning, effectively integrating video and text inputs to enhance audio generation quality.
The experimental evaluation is thorough, utilizing large-scale datasets and benchmarking against state-of-the-art models in both video-to-audio (VT2A) and text-to-audio (T2A) tasks. The results demonstrate that WavFlow achieves competitive performance, often surpassing existing latent-based methods in various metrics such as Fréchet Distance and Inception Score. The extensive ablation studies provide insights into the impact of different architectural choices and training configurations, reinforcing the validity of the proposed methods.
The paper provides detailed training configurations and experimental setups, which enhance reproducibility. However, the absence of a publicly available demo or project URL limits the ease with which other researchers can replicate the findings. The authors do mention using proprietary datasets, which may also pose challenges for full reproducibility in terms of data access.
The paper acknowledges limitations, particularly in the context of speech or singing synthesis, which are not explicitly addressed by the current model. The authors suggest that extending the framework to include these aspects would require larger datasets and more granular linguistic annotations. Additionally, the reliance on high-quality data curation may limit the model's applicability in scenarios with less controlled data environments.
The implications of this work are significant for the field of audio generation, particularly in applications requiring high-fidelity synthesis and multimodal integration. By demonstrating that high-quality audio can be generated without intermediate latent representations, this research opens avenues for more efficient audio generation frameworks. The potential applications span various domains, including film production, video game sound design, and interactive media, where real-time audio generation aligned with visual content is crucial. WavFlow presents a novel approach to audio generation by directly synthesizing waveforms without latent-space compression. This work significantly advances the field of multimodal audio generation, offering a simpler and more scalable alternative that achieves competitive performance on established benchmarks.
Despite rapid advances in automatic speech recognition (ASR) and large audio-language models, robust recognition in real-world environments remains limited by an "acoustic robustness bottleneck": models often lose acoustic grounding and produce omissions or hallucinations under severe, compositional distortions. We propose Mega-ASR, a unified ASR-in-the-wild framework that combines scalable compound-data construction with progressive acoustic-to-semantic optimization. We introduce Voices-in-the-Wild-2M, covering 7 classic acoustic phenomena and 54 physically plausible compound scenarios, and train Mega-ASR with Acoustic-to-Semantic Progressive Supervised Fine-Tuning and Dual-Granularity WER-Gated Policy Optimization. Extensive experiments demonstrate that Mega-ASR achieves significant advantages over prior state-of-the-art systems on adverse-condition ASR benchmarks (45.69% vs. 54.01% on VOiCES R4-B-F, and 21.49% vs. 29.34% on NOIZEUS Sta-0). On complex compositional acoustic scenarios, Mega-ASR further delivers over 30% relative WER reduction against strong open- and closed-source baselines, establishing a scalable paradigm for robust ASR in-the-wild.
Primary: NTU
All Institutions: NTU, NUS, Shanghai AI Lab
The paper presents Mega-ASR, a unified framework for robust ASR in challenging acoustic environments, significantly advancing the state of the art in the field. The innovative dataset construction, novel training methodologies, and comprehensive evaluation underscore its potential impact on real-world applications of speech recognition technology.
The methodology presented in this paper is robust and innovative, focusing on the construction of a large-scale dataset (Voices-in-the-Wild-2M) that simulates various real-world acoustic phenomena. The authors employ a hierarchical simulation pipeline that effectively combines primitive acoustic effects to create atomic and compound scenarios, addressing the limitations of existing ASR datasets. The use of Acoustic-to-Semantic Progressive Supervised Fine-Tuning and Dual-Granularity WER-Gated Policy Optimization is particularly noteworthy, as it reflects a sophisticated understanding of the challenges faced in ASR under adverse conditions. The progressive fine-tuning approach is well-justified, and the dual-granularity reward mechanism is a novel contribution that enhances the model's ability to recover semantic information.
The experimental evaluation is comprehensive, with extensive benchmarks against both open-source and closed-source ASR models. The authors provide detailed comparisons across various datasets, demonstrating significant improvements in word error rates (WER) under challenging conditions. The results are well-presented, with clear metrics that highlight the advantages of Mega-ASR over existing models. The inclusion of qualitative case studies further enriches the evaluation by illustrating the types of errors mitigated by the proposed approach.
The paper provides sufficient implementation details, including training setups, hyperparameters, and data sources, which enhances reproducibility. The authors also share their dataset and evaluation benchmarks, facilitating further research in this area. However, the complexity of the model and the specific tuning of parameters may pose challenges for complete replication without access to the original code.
One limitation of the study is the potential overfitting to the simulated conditions, as real-world scenarios can be more diverse and unpredictable. Additionally, while the dataset is extensive, it may not cover all possible acoustic phenomena encountered in various environments. The reliance on simulated data may also limit the model's performance when faced with novel, unseen conditions.
The implications of this research are significant, as it addresses a critical bottleneck in ASR technology, particularly for applications in real-world environments. The advancements in robustness and semantic recovery could enhance the usability of ASR systems in various fields, including telecommunications, accessibility for the hearing impaired, and human-computer interaction. The proposed framework could serve as a foundation for future research in robust ASR and related areas. The paper presents Mega-ASR, a unified framework for robust ASR in challenging acoustic environments, significantly advancing the state of the art in the field. The innovative dataset construction, novel training methodologies, and comprehensive evaluation underscore its potential impact on real-world applications of speech recognition technology.
As audio-first agents become increasingly common in physical AI, conversational robots, and screenless wearables, audio large language models (audio-LLMs) must integrate speaker-specific understanding to support user authorization, personalization, and context-aware interaction. This requires modeling who is speaking, how the voice sounds, and how recording conditions affect speaker cues. Conventional speaker verification systems provide strong scalar scores but little linguistic evidence, while current audio-LLMs and speaker-aware language models have limited ability to organize speaker information beyond binary labels or descriptive profiles. We present SpeakerLLM, a speaker-specialized audio-LLM framework that unifies single-utterance speaker profiling, recording-condition understanding, utterance-pair speaker comparison, and evidence-organized verification reasoning within a natural-language interface. We construct verification-reasoning targets and a decision-composition policy that separate profile-level evidence from the final same-or-different decision and organize recording condition, profile evidence, and the decision into a structured trace. At its core, SpeakerLLM uses a hierarchical speaker tokenizer designed to capture multiple granularities of speaker evidence. Utterance-level speaker embeddings summarize identity and profile-level cues, whereas frame-level speaker features preserve fine-grained acoustic descriptors. Experiments show that SpeakerLLM-Base improves speaker-profile and recording-condition understanding over general audio-LLMs, while SpeakerLLM-VR preserves strong generated-verdict accuracy and produces decision traces grounded in the supervised verification reasoning schema. We will release the metadata-enriched supervision dataset and target-construction code for reproducibility.
Primary: Korea Advanced Institute of Science and Technology (KAIST)
All Institutions: Korea Advanced Institute of Science and Technology (KAIST), University of Seoul
The main contribution of this paper is the introduction of SpeakerLLM, a speaker-specialized audio-LLM framework that effectively integrates speaker understanding and verification reasoning within a natural-language interface. This work significantly advances the field of audio processing by enhancing the explainability and accuracy of speaker verification systems, making it a valuable addition to the literature.
The paper presents a well-structured methodology with a clear two-stage training process for SpeakerLLM, which effectively integrates speaker profiling, recording condition understanding, and verification reasoning. The hierarchical speaker tokenizer is a novel approach that captures different granularities of speaker evidence, enhancing the model's ability to process and understand speaker-specific cues. The decision-composition policy that separates profile-level evidence from the final decision is a significant advancement in explainability for speaker verification systems.
The experiments are comprehensive, demonstrating the effectiveness of SpeakerLLM-Base and SpeakerLLM-VR through various tasks, including speaker profiling and verification reasoning. The results show substantial improvements over general audio-LLMs, especially in tasks requiring fine-grained acoustic evidence. The use of a controlled dataset and clear evaluation metrics strengthens the findings.
The authors commit to releasing the metadata-enriched supervision dataset and target-construction code, which is crucial for reproducibility. However, the paper could benefit from additional details on the implementation of the models and the specific configurations used during training.
The paper acknowledges limitations, including the need for further evaluation of the model in real-world noisy environments and the necessity of consent-aware interfaces for user privacy. The reliance on specific datasets may limit the generalizability of the findings.
The proposed framework has significant implications for the development of audio-first AI systems, particularly in enhancing user interaction through personalized and context-aware speaker verification. The ability to provide explainable decisions in speaker verification can improve trust and usability in applications like conversational agents and security systems. The main contribution of this paper is the introduction of SpeakerLLM, a speaker-specialized audio-LLM framework that effectively integrates speaker understanding and verification reasoning within a natural-language interface. This work significantly advances the field of audio processing by enhancing the explainability and accuracy of speaker verification systems, making it a valuable addition to the literature.
In bandwidth-constrained communication such as satellite and underwater channels, speech must often be transmitted at ultra-low bitrates where intelligibility is the primary objective. At such extreme compression levels, codecs trained with acoustic reconstruction losses tend to allocate bits to perceptual detail, leading to substantial degradation in word error rate (WER). This paper proposes ClariCodec, a neural speech codec operating at 300 bit per second (bps) that reformulates quantisation as a stochastic policy, enabling reinforcement learning (RL)-based optimisation of intelligibility. Specifically, the encoder is fine-tuned using WER-driven rewards while the acoustic reconstruction pipeline remains frozen. Even without RL, ClariCodec achieves 4.64% WER on the LibriSpeech test-clean set at 300 bps, already competitive with codecs operating at higher bitrates. Further RL fine-tuning reduces WER to 3.55% on test-clean and 10.4% on test-other, corresponding to a 23% relative reduction while preserving perceptual quality.
Primary: Tsinghua University
All Institutions: Tsinghua University, Huawei Technologies Co., Ltd
The main contribution of this paper is the introduction of ClariCodec, a neural speech codec optimized for intelligibility at ultra-low bitrates using reinforcement learning. This work represents a significant advancement in codec design, addressing the challenges of speech intelligibility in bandwidth-constrained environments while maintaining competitive acoustic quality.
The proposed methodology introduces ClariCodec, a neural speech codec that operates at an ultra-low bitrate of 300 bps. The two-stage training strategy is innovative, combining a reconstruction-based pre-training phase with a reinforcement learning (RL) fine-tuning phase. The reformulation of quantisation as a stochastic policy allows for direct optimisation against a non-differentiable WER metric, which is a significant advancement in codec training. The use of group relative policy optimisation (GRPO) is a novel approach in this context, enabling the encoder to learn representations that prioritize intelligibility over acoustic fidelity. The methodology is well-structured and clearly articulated, demonstrating a thoughtful integration of RL into codec training.
The experiments are robust, utilizing the LibriSpeech dataset, which is a standard benchmark in speech processing. The results show a clear improvement in WER, with ClariCodec achieving a WER of 3.55% on the test-clean set, which is competitive with higher bitrate codecs. The paper includes comprehensive comparisons with various baseline systems, providing a thorough analysis of both intelligibility and acoustic quality metrics. The ablation studies further validate the effectiveness of the proposed RL fine-tuning strategy, highlighting the balance between intelligibility and perceptual quality.
The paper provides sufficient details regarding the training setup, loss functions, and evaluation metrics, which would allow other researchers to replicate the experiments. However, the absence of a publicly available code repository limits full reproducibility. The authors mention the use of generative AI for language polishing, which does not affect the technical content but is worth noting for transparency.
One limitation is the potential trade-off between intelligibility and acoustic fidelity, as indicated by the PESQ scores during RL fine-tuning. The model's performance may vary under different acoustic conditions or with different languages, which is not explored in the current study. Additionally, the non-causal architecture may pose latency issues for real-time applications, which the authors plan to address in future work.
The development of ClariCodec has significant implications for communication in bandwidth-constrained environments, such as satellite and underwater channels. By prioritizing intelligibility, this work could enhance speech communication in critical applications where clarity is paramount. The approach may also inspire further research into RL applications in audio processing and codec design, potentially leading to advancements in other areas of speech technology. The main contribution of this paper is the introduction of ClariCodec, a neural speech codec optimized for intelligibility at ultra-low bitrates using reinforcement learning. This work represents a significant advancement in codec design, addressing the challenges of speech intelligibility in bandwidth-constrained environments while maintaining competitive acoustic quality.
In conversational speech separation and recognition tasks, close-talk microphones are typically attached to each speaker during training data collection to capture near-field, close-talk mixture signals, in addition to using far-field microphones to record far-field mixture signals. Each such close-talk mixture exhibits a reasonably high energy level for the wearer and could intuitively serve as weak supervision for training far-field speech separation models directly on real-recorded far-field signals. However, they are not sufficiently clean for this purpose, as they often contain strong cross-talk speech from other speakers in addition to background noise. To address this, we propose cross-talk reduction (CTR), a task aiming to isolate the wearer's speech from each close-talk mixture, and a novel method called CTRnet, which can be trained directly on real-recorded pairs of close-talk and far-field mixtures to accomplish CTR. Building on CTRnet, we further propose pseudo-label based far-field speech separation (PuLSS), which uses CTRnet's estimated clean speech as pseudo-labels to train models for separating far-field mixtures. A key advantage of the proposed framework is that both CTRnet and PuLSS can be trained on real-recorded data from the target domain, addressing the generalization gap commonly observed when models are trained exclusively on simulated data. On the CHiME-6 dataset, our framework achieves state-of-the-art ASR performance under both oracle and estimated speaker diarization, surpassing all CHiME-{7,8} challenge submissions. To our knowledge, it is the first neural speech separation method that substantially outperforms guided source separation on real conversational "speech-in-the-wild" data.
Primary: Southern University of Science and Technology
All Institutions: Southern University of Science and Technology, Carnegie Mellon University
The paper presents a novel framework for cross-talk speech reduction and far-field speech separation, significantly advancing the state of the art in speech processing. The combination of innovative methodologies and strong experimental results positions this work as a valuable contribution to the field of machine learning and audio signal processing.
The paper introduces a novel approach to cross-talk reduction (CTR) using a neural network architecture called CTRnet, which is trained on real-recorded close-talk and far-field mixtures. The methodology is well-structured, leveraging both unsupervised and weakly-supervised learning techniques to address the challenges of domain mismatch in speech separation tasks. The introduction of pseudo-labels for far-field speech separation (PuLSS) is particularly innovative, allowing the use of estimated clean speech from CTRnet as training targets, which enhances the model's performance on real-world data. The formulation of CTR as a blind deconvolution problem is a significant theoretical contribution, providing a solid foundation for the proposed methods.
The experiments are conducted on the challenging CHiME-6 dataset, which is well-suited for evaluating the proposed methods in realistic conversational scenarios. The authors report state-of-the-art results in automatic speech recognition (ASR) performance, demonstrating the effectiveness of their approach. The use of both oracle and estimated speaker diarization in evaluations provides a comprehensive view of the model's capabilities. However, the paper could benefit from additional comparisons with other recent methods to further contextualize its performance.
The paper provides a detailed description of the methodologies and experimental setups, including the architecture of CTRnet and PuLSS, the training processes, and the evaluation metrics used. However, the absence of a publicly accessible code repository limits reproducibility. The authors mention a demo URL, which is a positive aspect, but full implementation details or a project URL would enhance reproducibility significantly.
One limitation is the reliance on speaker-activity timestamps for weak supervision, which may not be available in all practical scenarios. Additionally, while the proposed methods show promise, their performance in more diverse acoustic environments or with varying numbers of speakers could be further explored. The paper also does not address potential computational costs associated with training and inference, which could be a concern for real-time applications.
The proposed methods have significant implications for improving speech separation and recognition in real-world environments, particularly in applications such as teleconferencing, hearing aids, and voice-activated systems. By effectively addressing the cocktail party problem, this research could enhance communication technologies and accessibility tools, making them more robust in noisy environments. The paper presents a novel framework for cross-talk speech reduction and far-field speech separation, significantly advancing the state of the art in speech processing. The combination of innovative methodologies and strong experimental results positions this work as a valuable contribution to the field of machine learning and audio signal processing.
Audio-to-score alignment is a long-standing challenge in music information retrieval and arguably the most widely applicable alignment task for music research. Alignment algorithms match two versions of a piece of music, and for this to work these versions need to be in comparable formats. Audio-to-audio alignment matches audio features; when matching audio files to scores, they must either synthesize the score or derive audio-like features by means of piano rolls or similar feature sequences. Symbolic alignment, by contrast, matches symbolically encoded notes; in an audio-to-score scenario these would be obtained by a transcription of the audio file. In this article, we present an algorithm that bridges audio-like and symbol-level features directly. Sequential audio features encoding onset and spectral activation are matched to score positions by a bespoke dynamic programming-based matching algorithm derived from symbolic alignment methods. The resulting method is both precise - surpassing widely used audio-to-audio approaches based on synthesized scores -, and remains flexible in its digital signal processing components, i.e., the method is adaptable to diverse timbral characteristics without requiring a separate transcription model. Furthermore it inherits some of the symbolic alignment runtime advantages with an algorithmic complexity that is at worst linear in the length of the (typically short) symbolic score and (typically long) audio feature sequence. In the following sections, we provide a detailed algorithm description and evaluate its alignment quality on a large-scale dataset of solo piano recordings.
Primary: Institute of Computational Perception, Johannes Kepler University Linz
All Institutions: Institute of Computational Perception, Johannes Kepler University Linz, LIT AI Lab, Linz Institute of Technology
The main contribution of this paper is the introduction of a precise audio-to-score alignment algorithm that effectively combines audio feature processing with symbolic alignment techniques, demonstrating significant improvements in alignment accuracy over traditional methods. This work is a meaningful step forward in the field of music information retrieval, showcasing innovative methodologies and practical applications.
The paper presents a novel audio-to-score alignment algorithm that effectively bridges audio-like and symbolic features using a dynamic programming approach. The method is innovative in its direct matching of sequential audio features to score positions without requiring a separate transcription model, which is a significant advancement over traditional methods. The use of a bespoke cost function that incorporates onset, spectral, and stretch terms is well-justified and demonstrates a thoughtful integration of digital signal processing techniques. The algorithm’s linear complexity concerning the symbolic score length is a notable advantage, making it suitable for real-time applications.
The evaluation is conducted on a large-scale dataset of solo piano recordings, providing a robust basis for comparison against existing methods. The results show that the proposed method outperforms a baseline audio-to-audio alignment technique across all precision metrics, which is a strong indicator of its effectiveness. However, the paper could benefit from more detailed analysis of the dataset and potential biases, as well as comparisons with additional state-of-the-art methods beyond the mentioned baselines.
The methodology is described in sufficient detail, and the implementation is made available on GitHub, which facilitates reproducibility. However, the paper lacks specific details on the parameter tuning process and the exact configurations used for the experiments, which could hinder replication efforts by other researchers.
One limitation is the reliance on a specific type of audio (solo piano), which may not generalize well to other musical contexts or instruments. Additionally, while the algorithm shows promising results, it does not achieve the precision of symbolic alignment methods, indicating that there is still room for improvement. The lack of optimization for parameter settings also suggests that the performance metrics could potentially be improved.
This research has significant implications for music information retrieval and could enhance various applications such as music transcription, automatic accompaniment systems, and interactive music learning tools. By providing a more precise and flexible alignment method, it opens avenues for further exploration in the intersection of audio processing and symbolic music representation. The main contribution of this paper is the introduction of a precise audio-to-score alignment algorithm that effectively combines audio feature processing with symbolic alignment techniques, demonstrating significant improvements in alignment accuracy over traditional methods. This work is a meaningful step forward in the field of music information retrieval, showcasing innovative methodologies and practical applications.
Omni-modal large language models (om-LLMs) achieve unified audio-visual understanding by encoding video and audio into temporally aligned token sequences interleaved at the window level. However, processing these dense non-textual tokens throughout the LLM incurs substantial computational overhead. Although training-free token selection can reduce this cost, existing methods either focus on visual-only inputs or prune om-LLM tokens only before the LLM with fixed per-modality ratios, failing to capture how cross-modal token importance evolves across layers. To address this limitation, we first analyze the layer-wise token dependency of om-LLMs. We find that visual and audio dependencies follow a block-wise pattern and gradually weaken with depth, indicating that many late-layer non-textual tokens become redundant after cross-modal fusion. Motivated by this observation, we propose SEATS, a training-free, stage-adaptive token selection method for efficient om-LLM inference. Before the LLM, SEATS removes spatiotemporal redundancy via attention-weighted diversity selection. Inside the LLM, it progressively prunes tokens across blocks and dynamically allocates the retention budget from temporal windows to modalities using query relevance scores. In late layers, it removes all remaining non-textual tokens once cross-modal fusion is complete. Experiments on Qwen2.5-Omni and Qwen3-Omni demonstrate that SEATS effectively improves inference efficiency. Retaining only 10% of visual and audio tokens, it achieves a 9.3x FLOPs reduction and a 4.8x prefill speedup while preserving 96.3% of the original performance.
Primary: Renmin University of China
All Institutions: Renmin University of China, WeChat Vision, Tencent Inc.
The main contribution of this paper is the introduction of SEATS, a training-free, stage-adaptive token selection method that significantly enhances the efficiency of omni-modal large language models by intelligently pruning non-textual tokens based on layer-wise dependencies and query relevance. This work represents a meaningful advancement in the field of audio-visual understanding and multimodal AI systems.
The proposed SEATS method introduces a novel three-stage token selection process tailored for omni-modal large language models (om-LLMs). It effectively addresses the inefficiencies associated with processing dense audio-visual tokens by leveraging a block-wise decay schedule for token retention ratios, dynamic allocation of token budgets based on query relevance, and complete removal of non-textual tokens in late layers. This method is training-free and integrates seamlessly into existing om-LLMs, making it both innovative and practical.
The experiments conducted on Qwen2.5-Omni and Qwen3-Omni across five audio-visual benchmarks demonstrate significant improvements in inference efficiency, achieving a 9.3x reduction in FLOPs and a 4.8x speedup in prefill time while maintaining 96.3% of original performance. The results are compelling and validate the effectiveness of the proposed method against competitive baselines.
The paper provides a GitHub repository link for the implementation, which is essential for reproducibility. However, the paper lacks detailed descriptions of the datasets used in the experiments, which could hinder full reproducibility.
One limitation is the reliance on specific om-LLMs (Qwen2.5-Omni and Qwen3-Omni) for validation, which may limit generalizability to other architectures. Additionally, while the method shows promise in efficiency, the impact on model interpretability and the potential loss of nuanced information from the pruning process are not fully addressed.
The proposed method has significant implications for the deployment of om-LLMs in real-time applications, particularly in scenarios requiring efficient processing of audio-visual data, such as video analysis, interactive AI systems, and multimedia content generation. The ability to reduce computational overhead while preserving performance can facilitate broader adoption of these models in resource-constrained environments. The main contribution of this paper is the introduction of SEATS, a training-free, stage-adaptive token selection method that significantly enhances the efficiency of omni-modal large language models by intelligently pruning non-textual tokens based on layer-wise dependencies and query relevance. This work represents a meaningful advancement in the field of audio-visual understanding and multimodal AI systems.
Reinforcement learning is a powerful learning paradigm that has spearheaded progress in numerous domains. Its core promise lies in learning through high-level goals without the need for granular labels. However, it still remains elusive in the realm of audio, where it has received substantially less attention than in computer vision or other domains. The key question remains: how can agents learn to listen purely via reward-driven exploration? In this contribution, we present an overview of previous attempts and a new conceptual framework for learning to listen by reward. Our approach depends on the continuous search for novel sound sources. We formulate our framework, discuss open technical challenges, and present a first proof-of-concept implementation that showcases the feasibility of our approach.
Primary: Technical University of Munich
All Institutions: Technical University of Munich, Imperial College London, Munich Center for Machine Learning, Munich Data Science Institute, Group on Language, Audio, & Music
The paper presents a novel conceptual framework for learning to listen by reward, laying the groundwork for future advancements in reinforcement learning for audio applications. The methodology is innovative, and the initial results are promising, indicating potential for significant contributions to the field of audio machine learning.
The paper proposes a novel conceptual framework for reinforcement learning in audio environments, emphasizing reward-driven exploration for sound source localization. It draws inspiration from human learning, particularly how infants use sound for navigation, and formulates a clear mathematical model for the agent's interactions with its environment. The methodology is well-structured, detailing the design of the reward function and the learning algorithm, which is based on deep Q-learning. However, the paper lacks extensive details on the implementation specifics of the proof-of-concept, which could limit replicability.
The authors conducted experiments in a simulated environment to validate their framework, using two different neural network architectures for the Q-network. The results indicate that the stateful CNN-Transformer model significantly outperforms the memoryless CNN model, achieving a 74% accuracy in selecting optimal actions. The experiments are well-designed, focusing on key metrics such as accuracy, reachability, and average total reward, although the evaluation could benefit from more diverse scenarios and a larger dataset to enhance generalizability.
The paper provides a reasonable level of detail regarding the experimental setup, including the simulation environment and the parameters used for training. However, the absence of a publicly available code repository or demo limits the reproducibility of the results. Clearer documentation and sharing of the code would enhance the ability of other researchers to replicate the findings.
One notable limitation is the reliance on a simplistic simulation environment, which may not fully capture the complexities of real-world audio interactions. Additionally, the framework currently focuses on stationary sound sources, which restricts the exploration of dynamic audio environments. The authors also acknowledge the limitations of existing simulation software in handling moving sources and microphones, which could impact the realism of their results.
The proposed framework has significant implications for the development of autonomous agents capable of navigating and interacting with their environments using audio cues. This could lead to advancements in fields such as robotics, assistive technologies for the visually impaired, and audio-based navigation systems. The exploration of reinforcement learning in audio contexts is an underrepresented area, and this work could inspire further research and applications in this domain. The paper presents a novel conceptual framework for learning to listen by reward, laying the groundwork for future advancements in reinforcement learning for audio applications. The methodology is innovative, and the initial results are promising, indicating potential for significant contributions to the field of audio machine learning.
Distributed microphone arrays composed of multiple subarrays enable blind source separation over a wide spatial area. Directly applying fast multichannel nonnegative matrix factorization (FastMNMF) to all subarrays can exploit observations from all subarrays, but it requires repeated inversions of large matrices spanning all microphones, causing the computational cost to increase rapidly as the number of microphones grows. In contrast, applying FastMNMF to one subarray reduces the matrix size but cannot exploit observations from other subarrays. We propose distributed FastMNMF, which imposes a block-diagonal structure on the source spatial covariance matrices, so that matrix inversions are performed within subarrays. The NMF-based source spectrogram model is shared across subarrays, allowing the method to aggregate source activity information while discarding inter-subarray covariance. In synchronized, noiseless simulations with fixed room and array/source geometry, the method required less computation time than conventional FastMNMF using all subarrays, achieved a higher average source-to-distortion ratio than conventional FastMNMF using one subarray, and was applicable in the tested five-source condition, where each four-microphone subarray was locally underdetermined.
Primary: The University of Tokyo
All Institutions: The University of Tokyo, The National Institute of Advanced Industrial Science and Technology (AIST)
The main contribution of this paper is the introduction of a distributed FastMNMF method that improves computational efficiency in blind source separation for distributed microphone arrays while maintaining or enhancing separation performance. The technical contribution is significant, addressing a critical challenge in the field of audio signal processing and providing a solid foundation for future research in distributed acoustic sensing and processing.
The proposed distributed FastMNMF method introduces a block-diagonal structure to the source spatial covariance matrices, allowing for more efficient computation in blind source separation using distributed microphone arrays. This approach effectively balances the trade-off between computational efficiency and separation performance by enabling matrix inversions within subarrays rather than across all microphones. The methodology is well-grounded in existing literature, building on FastMNMF while addressing its limitations in distributed settings.
The experiments conducted are robust, utilizing synchronized, noiseless simulations with a well-defined experimental setup. The authors compare their method against both conventional FastMNMF approaches, demonstrating clear improvements in source-to-distortion ratio (SDR) and computational efficiency. The evaluation metrics are relevant, and the results indicate that the proposed method performs better than the baseline approaches in the tested conditions.
The paper provides detailed descriptions of the experimental setup, including the simulation environment, parameter initialization, and the algorithms used. However, the lack of a publicly available code repository limits reproducibility. Future work should include sharing the implementation to facilitate validation of results by the community.
The study is limited to synchronized and noiseless conditions, which may not reflect real-world scenarios where noise and synchronization errors are prevalent. Additionally, the method's performance in more complex acoustic environments or with varying numbers of sources and microphones remains to be explored.
The proposed method has significant potential applications in various fields, including telecommunications, hearing aids, and surveillance systems, where effective sound source separation is crucial. By improving computational efficiency and separation quality, this research can enhance the performance of distributed microphone arrays in practical applications. The main contribution of this paper is the introduction of a distributed FastMNMF method that improves computational efficiency in blind source separation for distributed microphone arrays while maintaining or enhancing separation performance. The technical contribution is significant, addressing a critical challenge in the field of audio signal processing and providing a solid foundation for future research in distributed acoustic sensing and processing.
Contextual biasing is essential to improving the recognition of rare and domain-specific words in an automatic speech recognition (ASR) system. While numerous methods have been proposed in recent years, most of them focus on offline settings and do not explicitly address the challenges of streaming ASR. For example, CTC-based word spotting (CTC-WS) have demonstrated strong performance by directly detecting keywords from CTC log-probabilities, but they are limited to offline processing and require access to the full utterance. In This work, we present a streaming extension of CTC-WS for real-time contextual biasing. Our method maintains active keyword paths across audio chunks using a stateful token passing algorithm, enabling the detection of keywords that span multiple chunks. To ensure low latency and stable output, we introduce an incremental commitment mechanism that only emits segments guaranteed not to be affected by future audio, while deferring uncertain regions. This method naturally integrates with streaming ASR pipelines and does not require modifications to the underlying acoustic model or additional training, making it practical for real-world deployment. Experimental results show that our method reduces overall WER and effectively improves keyword F-score, demonstrating its effectiveness for real-time ASR applications.
Primary: National Taiwan Normal University
All Institutions: National Taiwan Normal University
This paper presents a novel approach to contextual biasing in streaming ASR, effectively addressing the challenges of recognizing rare and domain-specific words in real-time applications. The methodology is innovative, and the results indicate a meaningful contribution to the field of automatic speech recognition.
The proposed methodology extends CTC-based word spotting to a streaming ASR context, which is a significant advancement given the limitations of existing methods that primarily focus on offline processing. The introduction of a stateful token passing algorithm and an incremental commitment mechanism allows for the detection of keywords that may span across audio chunks, addressing a critical challenge in streaming ASR. The method's design ensures that it integrates seamlessly with existing ASR pipelines without requiring retraining or architectural changes, enhancing its practical applicability.
The experimental results demonstrate a clear improvement in both word error rate (WER) and keyword F-score across two datasets specifically designed for named entities. The comparisons with existing methods, such as GPU-accelerated phrase boosting, further validate the effectiveness of the proposed approach. The experiments are well-structured, utilizing appropriate datasets and metrics to assess the performance of the proposed method.
The paper provides sufficient details regarding the datasets used and the experimental setup, including the model architecture and evaluation metrics. However, the lack of publicly available code or a demo limits the reproducibility of the results. Future work could benefit from sharing the implementation to facilitate further research and validation.
One limitation is the reliance on specific datasets (STOP1 and STOP2) that may not generalize across all ASR applications. Additionally, while the method shows improvements in performance, the computational overhead introduced by the word spotting mechanism may still pose challenges in highly resource-constrained environments.
The advancements presented in this paper have significant implications for real-time applications such as live captioning, voice assistants, and interactive systems where accurate recognition of domain-specific terms is crucial. By improving the recognition of rare and context-specific words, this work could enhance user experience and accessibility in various speech-driven technologies. This paper presents a novel approach to contextual biasing in streaming ASR, effectively addressing the challenges of recognizing rare and domain-specific words in real-time applications. The methodology is innovative, and the results indicate a meaningful contribution to the field of automatic speech recognition.
We investigate Counterfactual Video Foley Generation, which aims to adopt a sound-source identity that contradicts the visual evidence while remaining temporally synchronized to a silent video. Existing Video&Text-to-Audio (VT2A) models struggle with this, often remaining anchored to the visually implied sound source when video and text contents disagree. We present ConterFlow, an inference-time dual-phase sampling scheme for pretrained flow-matching VT2A models. Phase 1 builds a video-derived temporal structure while suppressing the visually implied source; Phase 2 drops video conditioning to focus entirely on shaping audio timbre toward the target prompt. ConterFlow substantially improves counterfactual Video Foley generation compared to naive negative prompting and state-of-the-art baselines. To evaluate replacement quality, we propose a metric leveraging a text-audio co-embedding space to measure both target-prompt evidence and residual visually implied source leakage. Video demonstrations and code are available at https://gyubin-lee.github.io/counterflow-demo/
Primary: Kim Jaechul Graduate School of AI, KAIST
All Institutions: Kim Jaechul Graduate School of AI, KAIST, Graduate School of Cultural Technology, KAIST
The paper presents CounterFlow, a two-phase inference strategy for counterfactual video Foley generation that significantly improves upon existing methods. The innovative approach, rigorous experimental validation, and potential applications in creative sound design underscore its importance in advancing the field of audio machine learning.
The proposed methodology, CounterFlow, introduces a novel two-phase sampling strategy that effectively separates the temporal structure formation from sound identity injection in counterfactual video Foley generation. This approach is innovative as it addresses the limitations of existing VT2A models, which struggle to generate audio that contradicts the visually implied sound source. The dual-phase design is well-founded, leveraging the understanding that early sampling steps establish timing while later steps refine sound identity. The use of negative prompting to suppress visually implied sources is a clever strategy that enhances control over the generated audio.
The experiments are well-structured, utilizing a comprehensive dataset (VGGSound-Sparse Clean) that allows for a robust evaluation of the proposed method against state-of-the-art baselines. The metrics chosen for evaluation, including FLAM and positive-FLAM ratio, are particularly relevant for assessing the quality of counterfactual sound replacement. The results demonstrate a clear advantage of CounterFlow over existing methods, particularly in maintaining temporal alignment while achieving high fidelity in sound identity replacement.
The paper provides sufficient details regarding the experimental setup, including the choice of pretrained models, sampling strategies, and evaluation metrics. However, the lack of a publicly available code repository limits full reproducibility. The authors mention that video demonstrations and code are available, but a direct link to the code repository would enhance reproducibility.
One limitation noted in the conclusion is the occasional generation of sound during silent intervals, indicating a need for improved temporal gating. Additionally, while the two-phase approach is effective, it may require careful tuning of transition parameters to balance sound identity and temporal alignment.
The implications of this research are significant for the fields of audio production and machine learning, particularly in creative industries such as film and gaming. By enabling sound designers to generate counterfactual audio that aligns with visual content, this work opens up new avenues for creative expression and enhances the capabilities of automated audio generation systems. The paper presents CounterFlow, a two-phase inference strategy for counterfactual video Foley generation that significantly improves upon existing methods. The innovative approach, rigorous experimental validation, and potential applications in creative sound design underscore its importance in advancing the field of audio machine learning.
ADD in real-world scenarios has evolved from speech-only spoofing to more challenging component-level settings, where speech and environmental sounds may be independently manipulated. To tackle this, we propose EnvTriCascade, an Environment-Aware Tri-Stage Cascaded framework for the ESDD2 Challenge. First, a mix-consistency detector provides a binary prior to distinguish original recordings from manipulated mixtures, which calibrates the final decisions. Next, two complementary five-class detectors, leveraging SSLAM+XLS-R and EAT-large+XLS-R representations, extract robust multi-branch features integrated via a cross-branch attention-gated classifier. To enhance robustness against diverse mixing conditions, we incorporate RawBoost augmentation. Trained exclusively on the official CompSpoofV2 dataset, our system achieves a Macro-F1 score of 0.8266 on the test set, significantly outperforming the official baseline and ranking second in the challenge.
Primary: Communication University of China
All Institutions: Communication University of China, Ant Group
The paper introduces EnvTriCascade, a tri-stage cascaded framework for audio deepfake detection, achieving a Macro-F1 score of 0.8266 and ranking second in the ESDD2 Challenge. The methodology demonstrates a sophisticated approach to addressing the complexities of mixed audio environments, combining innovative feature extraction and classification strategies that could significantly advance the field of audio deepfake detection.
The paper presents a novel tri-stage cascaded framework (EnvTriCascade) for audio deepfake detection that effectively addresses the challenges of component-level spoofing in mixed audio environments. The methodology is well-structured, incorporating a mix-consistency detector for binary classification, followed by dual-branch multi-class detectors that leverage self-supervised learning representations. The use of a cross-branch attention-gated classifier and RawBoost augmentation enhances the robustness of the system against diverse acoustic conditions. The approach of fusing multiple feature representations and employing a calibration mechanism to mitigate decision conflicts is innovative and demonstrates a solid understanding of the complexities involved in audio deepfake detection.
The experiments are conducted on the CompSpoofV2 dataset, which is substantial in size and complexity, providing a strong basis for evaluating the proposed methodology. The reported Macro-F1 score of 0.8266, which significantly outperforms the baseline, indicates the effectiveness of the proposed framework. The paper includes detailed comparisons of various system configurations, showcasing the contributions of each component to the overall performance. However, the absence of external validation datasets or comparisons with other state-of-the-art methods limits the breadth of the evaluation.
The implementation details are well-documented, including the architecture, training process, and hyperparameters. The use of frozen pre-trained models and specific augmentation strategies is clearly described, which aids in reproducibility. However, the lack of publicly available code or a demo URL limits the ability for others to replicate the results independently.
One limitation of the study is the reliance on a single dataset for training and evaluation, which may affect the generalizability of the results. Additionally, while the proposed system achieves high performance, the complexity of the model with a large number of parameters could pose challenges in real-world applications regarding computational efficiency and deployment.
The proposed framework has significant implications for the field of audio deepfake detection, particularly in enhancing the reliability of audio authenticity verification in various applications such as media, security, and communications. The advancements in component-level spoofing detection could lead to improved systems for combating misinformation and ensuring the integrity of audio content. The paper introduces EnvTriCascade, a tri-stage cascaded framework for audio deepfake detection, achieving a Macro-F1 score of 0.8266 and ranking second in the ESDD2 Challenge. The methodology demonstrates a sophisticated approach to addressing the complexities of mixed audio environments, combining innovative feature extraction and classification strategies that could significantly advance the field of audio deepfake detection.
Recently, a spatially selective non-linear filter (SSF) has been proposed for target speaker extraction, using the target direction-of-arrival (DOA) as a spatial cue. Since learned intermediate features are tied to the microphone geometry, the performance of the SSF degrades significantly when evaluated on mismatched array geometries. In this paper, we propose a geometry-conditioned SSF (GC-SSF), which incorporates a geometry-conditioning branch based on FiLM layers. Furthermore, we propose a feature that jointly encodes the DOA and the microphone positions (DOA-MPE). The conditioning branch modulates the intermediate feature maps of the SSF using the DOA-MPE feature to capture the spatial relationship between the microphone positions and the target speaker. Experimental results across circular, uniform linear, and random microphone arrays show that the proposed GC-SSF generalizes better to mismatched geometries while maintaining high spatial selectivity, demonstrating its ability to effectively adapt the filtering process to different array geometries
Primary: Carl von Ossietzky Universität Oldenburg
All Institutions: Carl von Ossietzky Universität Oldenburg, German Research Foundation
The main contribution of this paper is the introduction of a geometry-conditioned spatially selective filter (GC-SSF) that enhances target speaker extraction across varying microphone geometries, significantly improving generalization and robustness. This work represents a meaningful step forward in the field of audio processing, addressing a critical challenge with innovative methods and thorough experimental validation.
The proposed methodology introduces a geometry-conditioned spatially selective non-linear filter (GC-SSF) that effectively incorporates a geometry-conditioning branch using FiLM layers and a novel DOA-Microphone Positional Encoding (DOA-MPE) feature. This approach addresses the limitations of existing spatially selective filters by allowing the model to generalize across different microphone geometries, which is a significant advancement in the field of target speaker extraction. The integration of positional encoding and conditioning mechanisms is well-justified and demonstrates a thoughtful approach to enhancing the robustness of the extraction process.
The experimental setup is comprehensive, utilizing various microphone array configurations (circular, uniform linear, and random) to evaluate the performance of the GC-SSF. The results indicate that the proposed method consistently outperforms baseline systems in terms of generalization across mismatched geometries, with clear metrics provided (PESQ and SI-SDR) to support the claims. The sensitivity analysis regarding target DOA errors further strengthens the findings, showcasing the model's robustness and spatial selectivity.
The paper provides sufficient details regarding the experimental setup, including datasets, network architecture, and training procedures, which would allow for reproducibility of the results. However, the absence of a public code repository or demo URL limits the ease of access for other researchers to validate and build upon this work.
One identified limitation is that the current architecture is designed for a fixed number of microphones, which may restrict its applicability in more dynamic or ad-hoc acoustic environments. Additionally, while the results are promising, the paper does not explore the potential impact of varying environmental conditions beyond the simulated scenarios.
The advancements presented in this paper have significant implications for real-world applications in acoustic signal processing, particularly in environments where microphone configurations may vary. The ability to generalize across different geometries could enhance the performance of speaker extraction systems in various settings, such as conference rooms, public spaces, and smart devices. The main contribution of this paper is the introduction of a geometry-conditioned spatially selective filter (GC-SSF) that enhances target speaker extraction across varying microphone geometries, significantly improving generalization and robustness. This work represents a meaningful step forward in the field of audio processing, addressing a critical challenge with innovative methods and thorough experimental validation.
The conventional normalized subband p-norm (NSPN) algorithm achieves robustness in $α$-stable noise ($1<α\leq 2$) by utilizing low-order error moments. However, its performance degrades significantly under three scenarios: (1) non-Gaussian inputs, (2) $α$-stable noise with $0<α\leq 1$, and (3) sparse system identification. To address these limitations, this paper proposes a fractional-order NSPN algorithm based on the nearest Kronecker product (NKP) decomposition and fractional-order stochastic gradient descent, termed NKP-FoNSPN. Theoretical bounds for the fractional-order parameter $β$ are also derived. Notably, when $β=1$, the NKP-FoNSPN reduces to a new NKP-NSPN algorithm, while its non-NKP decomposition variant becomes the fractional-order NSPN (FoNSPN) algorithm. Furthermore, a novel transformation-based NKP (TNKP) decomposition technique is designed, which exhibits lower computational complexity than conventional NKP for specific filter structures. The resulting TNKP-based FoNSPN (TNKP-FoNSPN) achieves lower steady-state misadjustment and multiplication cost compared with the NKP-FoNSPN algorithm. Additionally, complete computational complexity analyses are provided. For active noise control (ANC) scenarios, we develop filtered-x variants: NKP-FxFoNSPN and TNKP-FxFoNSPN. From the former, two additional variants are derived: NKP-FxNSPN and FxFoNSPN. Simulations using diverse noise sources (pink, helicopter, gunshot, pile driver, and traction substation noise) demonstrate the superiority of the proposed algorithms. Finally, we validate their noise reduction performance in a real constructed single-channel duct ANC and a simulated multi-channel ANC systems.
Primary: Southwest Jiaotong University
All Institutions: Southwest Jiaotong University, Ministry of Education, School of Electrical Engineering, Key Laboratory of Magnetic Suspension Technology and Maglev Vehicle
The main contribution of this paper is the development of a fractional-order subband p-norm adaptive filter that effectively addresses the limitations of existing algorithms in active noise control scenarios. This work significantly advances the state-of-the-art in adaptive filtering by introducing innovative methodologies and demonstrating their effectiveness through rigorous experimentation.
The paper introduces a novel fractional-order normalized subband p-norm adaptive filter (NKP-FoNSPN) that leverages the nearest Kronecker product decomposition and fractional-order stochastic gradient descent. The methodology is well-structured, addressing specific limitations of existing algorithms in handling non-Gaussian inputs and sparse system identification. The introduction of the transformation-based nearest Kronecker product decomposition (TNKP) technique is particularly noteworthy, as it reduces computational complexity while enhancing performance. The theoretical bounds for the fractional-order parameter are derived, which adds rigor to the proposed approach. The paper effectively combines theoretical insights with practical algorithm development, making it a significant contribution to adaptive filtering in noise control scenarios.
The experimental evaluation is comprehensive, utilizing various noise sources (pink, helicopter, gunshot, pile driver, and traction substation noise) to validate the proposed algorithms. The simulations demonstrate the superiority of the NKP-FoNSPN and TNKP-FoNSPN algorithms over existing methods, particularly in challenging noise environments. The performance metrics used, such as normalized mean-square deviation (NMSD), are appropriate for the context, and the results are well-presented, showing clear advantages in convergence rates and steady-state misadjustment.
The paper lacks specific implementation details that would facilitate reproducibility, such as code availability or links to datasets used in the experiments. While the methodology is described in detail, the absence of a demo or project URL limits the ability of other researchers to replicate the findings directly.
One limitation is the lack of real-world applicability testing beyond the constructed single-channel duct ANC and simulated multi-channel ANC systems. The performance in diverse and uncontrolled environments remains to be validated. Additionally, while the paper addresses several scenarios, it may not cover all potential edge cases in adaptive filtering, particularly with more complex noise profiles.
The proposed algorithms have significant implications for active noise control applications, particularly in environments where non-Gaussian noise is prevalent. The advancements in adaptive filtering techniques can enhance various fields, including audio processing, telecommunications, and environmental noise management. The integration of fractional-order calculus into adaptive filtering may inspire further research into novel approaches for handling complex signal processing challenges. The main contribution of this paper is the development of a fractional-order subband p-norm adaptive filter that effectively addresses the limitations of existing algorithms in active noise control scenarios. This work significantly advances the state-of-the-art in adaptive filtering by introducing innovative methodologies and demonstrating their effectiveness through rigorous experimentation.
Training general-purpose Audio Large Language Models (ALLMs) across diverse datasets is essential for holistic audio understanding, yet it faces significant challenges due to dataset heterogeneity, which often leads to conflicting gradients and slow convergence. Despite its impact, how to explicitly manage this heterogeneity during training remains underexplored, with current practices relying primarily on uniform mixture. In this work, we analyze multi-dataset AudioQA training from a convergence perspective and propose Grouped Sequential Training (GST). GST strategically organizes datasets into affinity-aware groups and introduces them via a progressive scheduling protocol, effectively balancing the stability of parallel training with the efficiency of sequential optimization. To ensure scalability, we develop gradient-based affinity metrics that capture inter-dataset relationships without the prohibitive cost of empirical transferability estimation. Extensive evaluations on 14 AudioQA datasets spanning speech, music, and environmental sounds demonstrate that GST achieves 30--40\% faster convergence than standard parallel training while maintaining or even surpassing the performance of mix-all training. Our results provide both theoretical insights and a practical, model-agnostic framework for efficient large-scale ALLM optimization.
Primary: Shenzhen International Graduate School, Tsinghua University
All Institutions: Shenzhen International Graduate School, Tsinghua University, The Hong Kong Polytechnic University, Independent Researcher
The main contribution of this paper is the introduction of Grouped Sequential Training (GST), a novel framework that effectively addresses the challenges posed by dataset heterogeneity in training Audio Large Language Models. This work is significant as it combines theoretical insights with practical implementations, demonstrating substantial improvements in training efficiency and model performance across diverse audio tasks.
The paper introduces Grouped Sequential Training (GST), a novel approach that organizes datasets into affinity-aware groups to mitigate the challenges of dataset heterogeneity in training Audio Large Language Models (ALLMs). The methodology is well-structured, combining theoretical analysis with practical implementation, and effectively balances the trade-offs between parallel and sequential training. The use of gradient-based affinity metrics to cluster datasets is particularly innovative, as it allows for a more principled approach to dataset scheduling without incurring significant computational overhead. The theoretical framework established for convergence analysis is robust and provides a solid foundation for the proposed method.
The experiments are extensive, involving 14 diverse AudioQA datasets that cover a wide range of audio tasks, including speech, music, and environmental sounds. The results demonstrate that GST achieves 30-40% faster convergence compared to standard parallel training while maintaining or improving performance. The comparison against various baselines, including naive mixing and sequential training, is thorough, and the metrics used (token-level accuracy) are appropriate for evaluating the performance of ALLMs. The empirical results validate the theoretical claims made in the paper, showcasing the effectiveness of the proposed approach.
The paper provides detailed implementation details, including model architecture, training configurations, and hyperparameters, which enhances reproducibility. However, the absence of publicly available code or datasets limits the ability for others to replicate the results directly. Future work could benefit from sharing the experimental setup to facilitate further research in this area.
The paper acknowledges limitations, such as the focus on a specific model architecture (SALMONN) and the need for further verification of GST's scalability to larger models. Additionally, the static ordering of dataset groups may not account for dynamic changes in dataset relationships during training, suggesting that a more adaptive approach could be explored.
The proposed GST framework has significant implications for the training of ALLMs, particularly in applications requiring robust audio understanding across diverse datasets. By addressing the inefficiencies associated with dataset heterogeneity, this work could lead to more efficient training protocols in various audio-related tasks, enhancing the capabilities of AI systems in real-world applications. The main contribution of this paper is the introduction of Grouped Sequential Training (GST), a novel framework that effectively addresses the challenges posed by dataset heterogeneity in training Audio Large Language Models. This work is significant as it combines theoretical insights with practical implementations, demonstrating substantial improvements in training efficiency and model performance across diverse audio tasks.
Detecting AI-generated music is crucial for preserving artistic authenticity and preventing the misuse of generative music technologies. However, existing discriminative detectors typically rely on generated samples during training and often suffer from severe performance degradation when confronted with music produced by unseen generators, which limits their real-world applicability. To address this issue, we formulate a zero-shot setting for AI-generated music detection, where the detector is trained exclusively on real music without access to any generated samples. Under this setting, we propose MusicDET, a generator-agnostic detection framework based on frequency-guided normalizing flows that probabilistically models the distribution of real music features. By evaluating the likelihood of an input sample under the learned real-music distribution, MusicDET enables effective detection of out-of-distribution music signals. Experiments on the FakeMusicCaps and SONICS datasets show that MusicDET consistently outperforms conventional discriminative detectors, particularly when detecting music generated by previously unseen models.
Primary: Southeast University
All Institutions: Southeast University, Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Purple Mountain Laboratories, Engineering Research Center of Blockchain Application
The main contribution of this paper is the introduction of MusicDET, a zero-shot AI-generated music detection framework that utilizes frequency-guided normalizing flows to model real music distributions, achieving state-of-the-art performance in cross-generator evaluations. This work represents a significant advancement in the field of audio detection, addressing a critical need for reliable methods to distinguish between human-created and AI-generated music.
The methodology presented in MusicDET is innovative, leveraging frequency-guided normalizing flows to model the distribution of real music features for zero-shot detection of AI-generated music. This approach is particularly noteworthy as it circumvents the need for training on generated samples, which is a significant limitation in existing methods. The use of a probabilistic framework allows for effective detection of out-of-distribution samples, and the detailed design of frequency-wise decomposition and band-wise normalizing flows demonstrates a deep understanding of the complexities of musical data.
The experimental evaluation is thorough, utilizing two benchmark datasets (FakeMusicCaps and SONICS) to validate the effectiveness of MusicDET. The results indicate that it consistently outperforms conventional discriminative detectors, particularly in cross-generator scenarios. The use of Equal Error Rate (EER) as a metric is appropriate for the task, and the paper provides a comprehensive analysis of the results, including comparisons with state-of-the-art methods. The experiments also include ablation studies that enhance the understanding of the model's performance and robustness.
The paper provides sufficient implementation details, including the architecture of the model, training procedures, and evaluation metrics. The authors mention the use of a GitHub repository for code access, which supports reproducibility. However, the reliance on specific hardware (NVIDIA RTX 4090) may limit accessibility for some researchers.
One limitation is the potential for MusicDET to struggle with robustness against audio manipulations, as indicated in the experiments. Additionally, while the zero-shot approach is a significant advancement, it may not cover all practical scenarios, especially as generative models evolve. The paper also acknowledges the need for further research into robustness against adversarial attacks and real-world post-processing.
The work has significant implications for the music industry, particularly in protecting artistic integrity and addressing copyright issues associated with AI-generated content. By providing a reliable detection method, MusicDET could help mitigate the risks of misuse of generative music technologies, fostering a more equitable music ecosystem. The research also opens avenues for future work in audio authenticity and anomaly detection across various domains. The main contribution of this paper is the introduction of MusicDET, a zero-shot AI-generated music detection framework that utilizes frequency-guided normalizing flows to model real music distributions, achieving state-of-the-art performance in cross-generator evaluations. This work represents a significant advancement in the field of audio detection, addressing a critical need for reliable methods to distinguish between human-created and AI-generated music.
The rapid advancement of generative AI has made audio deepfakes increasingly indistinguishable from authentic human vocals, posing significant threats to persons-of-interest (POI) such as public figures. Current detection systems primarily rely on generic, black-box models that fail to capture speaker-specific idiosyncratic traits and lack interpretability. In this paper, we propose Phoneme-based Voice Profiling (PVP), a novel personalized defense framework. By shifting the detection paradigm from macro-utterance analysis to micro-phonetic modeling, PVP captures the unique acoustic distributions underlying a POI's habitual articulatory patterns. Specifically, our framework models speaker-specific phonetic realizations using lightweight Gaussian Mixture Models (GMMs) estimated solely from bona fide reference speech. This design enables data-efficient profiling and robust generalization to previously unseen spoofing attacks without requiring heavy spoof-specific training. Furthermore, we introduce the first large-scale Chinese POI deepfake dataset to benchmark speaker-specific detection. Experimental results demonstrate that PVP significantly outperforms state-of-the-art generic detectors in POI spoofing scenarios, achieving substantial EER reductions while providing fine-grained, phoneme-level interpretability for forensic analysis. Code and data are available at: https://github.com/JunXue-tech/PVP
Primary: Wuhan University
All Institutions: Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University
The paper presents a significant advancement in speaker-specific deepfake detection through the innovative use of phoneme-level profiling, offering a robust and interpretable framework that outperforms existing methods.
The proposed Phoneme-based Voice Profiling (PVP) framework introduces a novel approach to deepfake detection by focusing on phoneme-level analysis rather than macro-utterance assessments. This shift allows for capturing speaker-specific articulatory patterns through lightweight Gaussian Mixture Models (GMMs), enhancing interpretability and robustness against unseen spoofing attacks. The methodology is well-structured, combining phoneme-level consistency scoring with global speaker identity modeling, which is a significant advancement over traditional black-box models.
The experimental evaluation is robust, utilizing both a newly created Chinese POI deepfake dataset and the Famous Figures dataset to benchmark the proposed method. The results demonstrate substantial improvements in detection performance, with significant reductions in Equal Error Rate (EER) compared to state-of-the-art methods. The ablation studies further validate the importance of each methodological component, reinforcing the effectiveness of phoneme-level profiling.
The paper provides sufficient implementation details, including model configurations and evaluation metrics, which enhances reproducibility. The availability of the code and dataset on GitHub supports further research and validation of the findings.
While the framework shows promising results, it may still be limited by the diversity of the training data, particularly in capturing all possible phonetic variations across different speakers and languages. Additionally, the reliance on a small amount of reference speech data may not generalize well to all scenarios.
The implications of this research are significant, particularly in the realms of security and forensic analysis, where accurate detection of deepfake audio can prevent misinformation and protect public figures. The interpretability of the model also opens avenues for its application in legal contexts, where understanding the rationale behind detection decisions is crucial. The paper presents a significant advancement in speaker-specific deepfake detection through the innovative use of phoneme-level profiling, offering a robust and interpretable framework that outperforms existing methods.
Latent representations are at the heart of the majority of modern generative models. In the audio domain they are typically produced by a neural-audio-codec autoencoder. In this work we introduce SAME (Semantically-Aligned Music autoEncoder), an autoencoder for stereo music and general audio that reaches a 4096$\times$ temporal compression ratio while maintaining reconstruction quality and downstream generative performance. We achieve this by combining a tranformer-based backbone with set of semantic regularisation approaches, phase-aware reconstruction losses and improved discriminator designs. The architecture delivers substantial computational cost benefits, through both its high compression ratio and its reliance on well-optimised transformer primitives. Two variants (a large SAME-L and a CPU-deployable SAME-S) are released in open-weights form.
Primary: Stability AI
All Institutions: Stability AI
The paper introduces SAME, a stereo music and general-audio autoencoder that achieves a remarkable 4096Ă— temporal compression ratio while maintaining sound quality and generative performance. The comprehensive analysis of the technical contributions, innovative methodology, and significant implications for the field highlight the paper's relevance and potential impact in advancing audio generative models.
The paper presents a novel architecture for audio autoencoding, termed SAME (Semantically-Aligned Music autoEncoder), which integrates a transformer-based backbone with semantic regularization, phase-aware reconstruction losses, and improved discriminator designs. The methodology is well-structured, employing a combination of innovative techniques such as query-based transformer resampling and a soft-normalization bottleneck, which collectively enhance the model's generative capabilities while achieving a high compression ratio. The use of auxiliary losses to shape the latent space for downstream tasks is particularly noteworthy, as it demonstrates a thoughtful approach to improving generative performance without relying on traditional VAE formulations.
The evaluation is thorough, employing both objective metrics (such as SI-SDR and MEL log-magnitude) and subjective assessments via MUSHRA tests. The results indicate that SAME-L outperforms several baselines in terms of audio quality and computational efficiency, which is a significant achievement given the high compression ratio. The inclusion of ablation studies further strengthens the findings by isolating the contributions of various components of the model.
The paper provides sufficient detail regarding the architecture, training procedures, and evaluation metrics, which should facilitate reproducibility. However, the lack of a publicly available code repository or demo limits the ability for independent verification of results. The authors mention releasing model weights, which is a positive step but does not fully address the reproducibility of the entire system.
One limitation is the reliance on a specific dataset (Audiosparx production music) for training, which may affect the generalizability of the model to other audio domains. Additionally, while the model achieves impressive results, the computational demands of the larger variant (SAME-L) may limit its accessibility for broader applications, particularly in resource-constrained environments.
The advancements presented in this paper have the potential to significantly impact the field of audio processing and generative models. By achieving high compression ratios while maintaining audio quality, the model could facilitate more efficient audio streaming and storage solutions. Furthermore, the integration of semantic alignment in audio generation opens avenues for more contextually aware audio applications, such as music generation that aligns with specific themes or emotions. The paper introduces SAME, a stereo music and general-audio autoencoder that achieves a remarkable 4096Ă— temporal compression ratio while maintaining sound quality and generative performance. The comprehensive analysis of the technical contributions, innovative methodology, and significant implications for the field highlight the paper's relevance and potential impact in advancing audio generative models.
The sonata form is a musically rich and hierarchically structured form that poses significant challenges for automatic analysis. While music structure analysis has seen strides of progress in recent years, sonata form analysis remains in its early stages. This is largely due to the time-consuming and high barrier of the music background requirement for annotating classical music structures. To advance research in this area, we curated SoSA-Moz, the first large-scale dataset featuring comprehensive hierarchical structure annotations. This work establishes a foundation for systematic sonata form analysis. Leveraging this newly contributed resource, we further propose Sonalyzer-Moz, a baseline model specifically designed for investigating complex sonata structures. This framework integrates feature aggregation with sequential modeling, enabling it to capture both local feature and upper-level structural dependencies. Experiment results show that Sonalyzer-Moz is capable of identifying the components' boundaries of the upper-level structure that are critical to understanding sonata form. Therefore, this method demonstrates, for the first time, the effectiveness of automatic upper-level analysis of sonata form, and provides a robust baseline for future research in the automatic understanding of sonata form while advancing the study of classical music structure analysis.
Primary: Monash University Malaysia
All Institutions: Monash University Malaysia, La Trobe University, Monash University
The main contribution of this paper is the introduction of the SoSA-Moz dataset and the Sonalyzer-Moz framework, which together provide a novel approach to analyzing the complex hierarchical structure of Mozart's sonata form using deep learning techniques. This work not only fills a gap in the literature but also sets a foundation for future research in automatic music structure analysis.
The methodology presented in the paper is well-structured, introducing the SoSA-Moz dataset as a foundational resource for sonata form analysis. The Sonalyzer-Moz framework employs a combination of feature aggregation and sequential modeling through CNN and LSTM layers, which is appropriate for capturing the hierarchical nature of sonata form. The integration of dynamic self-similarity matrices and statistical features enhances the model's ability to identify structural boundaries. However, the paper could benefit from a more detailed explanation of the hyperparameter tuning process and the rationale behind the chosen configurations.
The experimental evaluation is robust, with a clear division of the dataset into training, validation, and test sets to prevent data leakage. The paper provides a comprehensive comparison against state-of-the-art methods for popular music, which demonstrates the effectiveness of Sonalyzer-Moz. The reported performance metrics (HR3R, HR3P, HR3F) are relevant and provide insight into the model's capabilities. However, the paper lacks a detailed discussion on the significance of the performance metrics and how they relate to the specific challenges of sonata form analysis.
The implementation details are adequately described, including the use of specific hardware and software configurations. The availability of the dataset and code as open-source contributes positively to reproducibility. However, the paper could enhance reproducibility by providing more explicit instructions for setting up the environment and running the experiments.
One notable limitation is the reliance on a single composer (Mozart) for the dataset, which may limit the generalizability of the findings to other composers or styles of classical music. Additionally, the model's performance, while competitive, still leaves room for improvement, particularly in capturing the nuances of the sonata form. The authors acknowledge that the current model may not fully exploit the potential of deep learning architectures for this specific domain.
The work has the potential to significantly advance the field of music structure analysis, particularly in classical music. By providing a large-scale dataset and a baseline model, it opens avenues for further research and development of more sophisticated models that could enhance music education, music generation, and music recommendation systems. The focus on sonata form analysis may also inspire interdisciplinary collaborations between musicology and machine learning. The main contribution of this paper is the introduction of the SoSA-Moz dataset and the Sonalyzer-Moz framework, which together provide a novel approach to analyzing the complex hierarchical structure of Mozart's sonata form using deep learning techniques. This work not only fills a gap in the literature but also sets a foundation for future research in automatic music structure analysis.
Stable Audio 3 is a family of fast latent diffusion models (small, medium, large) for variable-length audio generation and editing. Since our models can generate several minutes of audio, variable-length generations are key to avoid the cost of producing full-length generations for short sounds. We also support inpainting, enabling targeted audio editing and the continuation of short recordings. Our latent diffusion models operate on top of a novel semantic-acoustic autoencoder that projects audio into a compact latent space, enabling efficient diffusion-based generation while preserving audio fidelity and encouraging semantic structure in the latent. Finally, we run adversarial post-training to both accelerate inference and improve generation quality, reducing the number of inference steps while improving fidelity and prompt adherence. Stable Audio 3 models are trained on licensed and Creative Commons data to generate music and sounds in less than a 2s on an H200 GPU and less than a few seconds on a MacBook Pro M4. We release the weights of small and medium, that can run on consumer-grade hardware, together with their training and inference pipeline.
Primary: Stability AI
All Institutions: Stability AI
Stable Audio 3 represents a significant advancement in audio generation technology, combining innovative methodologies with practical applications for both consumers and professionals. The paper's contributions to variable-length audio generation, inpainting capabilities, and efficient model training are poised to impact the field of machine learning and audio synthesis significantly.
The methodology presented in Stable Audio 3 is robust and innovative, particularly in its approach to variable-length audio generation using latent diffusion models. The introduction of a semantic-acoustic autoencoder, which allows for efficient audio representation while preserving fidelity and semantic structure, is a significant advancement in audio generation. The use of adversarial post-training to enhance inference speed and output quality is also noteworthy, as it addresses a critical challenge in generative models. The paper effectively combines multiple techniques, including flow matching, distillation, and adversarial training, to create a comprehensive training pipeline that enhances both the quality and efficiency of audio generation.
The paper provides a thorough evaluation of the models against existing state-of-the-art systems, demonstrating significant improvements in audio generation quality and inference speed. The experiments are well-structured, showcasing the models' capabilities in generating variable-length audio and performing inpainting tasks. However, specific quantitative metrics and user studies could further substantiate the claims of superior performance, particularly in subjective evaluations of audio quality.
The authors have made the model weights for the small and medium versions available, along with the training and inference pipeline, which is a positive step towards reproducibility. However, the paper could benefit from more detailed implementation instructions and hyperparameter settings to facilitate easier replication of results by other researchers.
One limitation of the study is the reliance on licensed and Creative Commons data, which may restrict the diversity of the audio used for training. Additionally, while the models are designed for consumer-grade hardware, the computational requirements for the larger models may still be prohibitive for some users. The paper also does not address potential biases in the training data that could affect the generated outputs.
The implications of Stable Audio 3 are significant for various applications, including music production, sound design, and interactive media. By enabling high-quality audio generation on consumer hardware, the model democratizes access to advanced audio synthesis tools, potentially fostering creativity and innovation in the audio domain. The ability to perform targeted audio editing through inpainting could also enhance workflows in professional audio production. Stable Audio 3 represents a significant advancement in audio generation technology, combining innovative methodologies with practical applications for both consumers and professionals. The paper's contributions to variable-length audio generation, inpainting capabilities, and efficient model training are poised to impact the field of machine learning and audio synthesis significantly.
Despite 230 million speakers, Urdu remains critically under-resourced in speech technology. We introduce UrduSpeech: a large high-fidelity Urdu corpus comprising 156 hours of audio with 12-dimension paralinguistic metadata, encompassing US-Std, US-CS, US-EngPk. To address Right-to-Left script constraints and frequent code-switching, we developed UrduSpeech, a LLM-driven pipeline to curate data across 12 diverse categories, including news, drama, and rare literary forms like Bait-Bazi. We also release a 9-hour US-Benchmark set, manually corrected by native annotators to serve as a standard. Human quality assessment of the primary 156-hour corpus yielded a Mean Opinion Score (MOS) of 4.6 (std = 0.7) with inter-rater reliability confirmed by a 0.68 Cohen's Kappa, validating our curation pipeline's 97.6% confidence score. The corpus maintains a 60-40 gender balance across 71,792 utterances. Our work represents a significant leap toward linguistic inclusivity in global AI. The corpus and code are open-sourced, and a demo page is available.
Primary: Northwestern Polytechnical University
All Institutions: Northwestern Polytechnical University
The main contribution of this paper is the introduction of the UrduSpeech corpus, a comprehensive resource for Urdu speech technology that includes high-fidelity audio and extensive paralinguistic annotations. This work significantly enhances the landscape of speech resources for under-resourced languages, addressing critical gaps in existing datasets and methodologies.
The methodology is robust, leveraging a multi-stage pipeline for data curation that addresses the unique challenges of Urdu speech, including RTL script and code-switching. The authors employed advanced techniques such as speaker diarization and noise removal, ensuring high-quality audio segments. The integration of 12-dimensional paralinguistic annotations is a significant enhancement, allowing for detailed analysis of emotional and vocal characteristics. The use of generative models for transcription and annotation, along with a rigorous human-centric validation framework, further strengthens the methodology.
The experiments are comprehensive, with a clear focus on establishing a baseline for Urdu ASR and TTS. The authors conducted a pilot study and a thorough evaluation of various transcription models, providing detailed comparisons and insights into their performance. The Mean Opinion Score (MOS) and inter-rater reliability metrics demonstrate the corpus's high fidelity and reliability, which are crucial for future research and applications.
The paper outlines the data collection and preprocessing steps in detail, which aids in reproducibility. However, the absence of a publicly accessible project URL limits the ability for others to directly replicate the study. The authors mention open-sourcing the corpus and code, which is a positive aspect for reproducibility.
The paper acknowledges several limitations, including potential over-segmentation in speaker diarization and the presence of background noise in some audio segments. While the authors have made efforts to validate the gender distribution and speaker IDs, ongoing work is needed to ensure absolute compliance. The reliance on automated systems for initial processing may introduce errors that require manual correction.
The UrduSpeech corpus represents a significant advancement in the field of speech technology for under-resourced languages, particularly Urdu. By providing a high-quality, diverse dataset, this work has the potential to enhance the performance of ASR and TTS systems for Urdu and related dialects, fostering linguistic inclusivity in AI applications. The integration of paralinguistic metadata opens new avenues for research in affective computing and speaker profiling. The main contribution of this paper is the introduction of the UrduSpeech corpus, a comprehensive resource for Urdu speech technology that includes high-fidelity audio and extensive paralinguistic annotations. This work significantly enhances the landscape of speech resources for under-resourced languages, addressing critical gaps in existing datasets and methodologies.
Real-time magnetic resonance imaging (rtMRI) of speech production enables non-invasive visualization of dynamic vocal-tract motion and is valuable for speech science and clinical assessment. However, rtMRI is fundamentally constrained by trade-offs among spatial resolution, temporal resolution, and acquisition speed, often leading to undersampled k-space measurements and degraded reconstructions. We propose SIREM, a speech-informed MRI reconstruction framework that uses synchronized speech as a cross-modal prior. The central idea is that vocal-tract configurations during speech are correlated with the produced acoustics, making part of the image content predictable from audio. SIREM models each frame as a fusion of an audio-driven component and an MRI-driven component through a spatial weighting map. The audio branch predicts articulator-related structure from speech, while the MRI branch reconstructs complementary content from measured k-space data. We further introduce a learnable soft weighting profile over spiral arms, enabling a differentiable study of how k-space arm usage interacts with speech-informed fusion. This yields a unified multimodal formulation that combines audio-driven prediction, MRI reconstruction, and sampling adaptation. We evaluate SIREM on the USC speech rtMRI benchmark against standard baselines, including gridding, wavelet-based compressed sensing, and total variation. SIREM introduces a speech-informed reconstruction paradigm that operates in a substantially higher-throughput regime than iterative methods while preserving anatomically plausible vocal-tract structure. These results establish an initial benchmark for multimodal speech-informed rtMRI reconstruction and highlight the potential of synchronized speech as an auxiliary prior for fast reconstruction. The source code is available at https://github.com/mdhasanai/SIREM
Primary: Institute of Radiology, University Hospital Erlangen
All Institutions: Institute of Radiology, University Hospital Erlangen, Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg, Institut für Informationsverarbeitung, Leibniz Universität Hannover, Department of Radiology, Harvard Medical School and Massachusetts General Hospital
The paper introduces SIREM, a novel speech-informed MRI reconstruction framework that leverages synchronized audio to enhance real-time imaging of speech production. This work represents a meaningful advancement in multimodal imaging techniques, combining audio and MRI data to improve reconstruction quality and efficiency, thereby addressing critical challenges in the field of speech science and clinical assessment.
The proposed SIREM framework innovatively combines synchronized audio with MRI reconstruction to address the challenges of real-time magnetic resonance imaging of speech. By modeling the reconstruction as a fusion of audio-driven and MRI-driven components, the methodology effectively leverages the correlation between vocal tract configurations and produced acoustics. The introduction of a learnable soft weighting profile over spiral arms adds a differentiable mechanism for optimizing k-space sampling, which is a significant advancement in the field. However, the reliance on a fixed segmentation-derived explained-by-audio map limits the flexibility of the model.
The experiments are well-structured, utilizing the USC speech rtMRI benchmark and comparing SIREM against established baselines such as gridding, wavelet-based compressed sensing, and total variation. The evaluation metrics are comprehensive, covering both distortion-based and perceptual measures, which provide a thorough assessment of reconstruction quality. While SIREM does not uniformly outperform classical methods, it demonstrates the utility of synchronized speech as a prior, achieving notable improvements in certain metrics.
The paper provides sufficient implementation details, including model architecture, training procedures, and hyperparameters, which enhances reproducibility. The availability of the source code on GitHub further supports this aspect, allowing other researchers to replicate the study and build upon the proposed method.
Key limitations include the use of a fixed explained-by-audio map, which may not capture the full variability of the audio signal's predictive power across different anatomical regions. Additionally, the evaluation is based on a relatively small dataset, which may affect the generalizability of the results. Future work should explore learned fusion maps and prospective sampling strategies to enhance the model's adaptability.
The SIREM framework has significant potential applications in speech science and clinical assessment, particularly in improving the efficiency and quality of rtMRI for speech production analysis. By reducing scan times and enhancing reconstruction fidelity, this method could facilitate more effective clinical evaluations and research into speech disorders. The integration of multimodal data also opens avenues for further exploration in related fields such as audio-visual speech synthesis and real-time imaging technologies. The paper introduces SIREM, a novel speech-informed MRI reconstruction framework that leverages synchronized audio to enhance real-time imaging of speech production. This work represents a meaningful advancement in multimodal imaging techniques, combining audio and MRI data to improve reconstruction quality and efficiency, thereby addressing critical challenges in the field of speech science and clinical assessment.
Weakly labeled datasets such as AudioSet have driven recent progress in audio tagging. However, annotation quality varies across sound classes. Labels may be incomplete, ambiguous, or unreliable, which introduces class-dependent supervision bias during optimisation. The issue becomes harder as real and generated audio are increasingly mixed in training, and generated samples do not always match their intended semantic labels. Prior work mainly addressed unreliable supervision from missing-positive labels, while this paper targets three other sources of unreliable supervision: spurious additions, misassignments between similar classes, and weakened label evidence. These effects introduce class-dependent optimisation bias that is not explicitly modeled by most existing methods. To bridge this gap, the paper proposes a Class-wise Supervision Unreliability (CSU) framework that controls supervision strength at the class level during training. CSU learns a separate unreliability parameter for each class and down-weights less reliable supervision without changing the model architecture or inference process. To support evaluations, this paper also introduces ESC-FreeGen50, a manually verified benchmark of 50 sound classes that combines real and generated audio. Experiments on controlled benchmarks and AudioSet show that CSU improves robustness across different architectures and different sources of supervision unreliability. The results indicate that explicit class-wise modeling of supervision unreliability is an effective and practical strategy for robust audio tagging under large-scale weakly labeled training. Code and data are available at: https://github.com/Yuanbo2020/CSU
Primary: University of Oxford
All Institutions: University of Oxford, KU Leuven, Harbin Engineering University, KTH Royal Institute of Technology, University of Surrey
The main contribution of this paper is the introduction of the Class-wise Supervision Unreliability (CSU) framework, which effectively addresses the challenges posed by unreliable supervision in audio tagging tasks. The comprehensive evaluation and the introduction of a new benchmark dataset significantly advance the state of the art in robust audio tagging methodologies.
The paper introduces the Class-wise Supervision Unreliability (CSU) framework, which innovatively addresses the problem of unreliable supervision in audio tagging by learning separate unreliability parameters for each class. This approach allows for dynamic down-weighting of less reliable supervision without altering the model architecture or inference process. The methodology is well-structured, addressing three specific types of supervision unreliability (spurious additions, misassignments, and weakened label evidence) and providing a clear rationale for the need for class-wise control mechanisms. The incorporation of a new benchmark dataset, ESC-FreeGen50, further enhances the methodology by allowing controlled evaluations of the proposed framework.
The experiments are comprehensive, utilizing both the newly introduced ESC-FreeGen50 dataset and the well-established AudioSet for validation. The results demonstrate that CSU significantly improves robustness across various architectures and types of supervision unreliability. The evaluation metrics used, including mean Average Precision (mAP) and F1-score, are appropriate for the audio tagging task. The paper effectively shows the performance gains of CSU over baseline models and other robust learning methods, providing strong empirical support for the proposed framework.
The paper provides sufficient details regarding the implementation of the CSU framework and the experimental setup, including model architectures, training procedures, and evaluation metrics. However, the lack of a demo URL or direct access to the experimental results may hinder full reproducibility for external researchers. Nonetheless, the availability of the code and dataset on GitHub is a positive aspect for reproducibility.
While the paper presents a robust framework, it does not thoroughly explore the potential limitations of the CSU approach, such as the impact of varying the number of classes or the generalizability of the learned unreliability parameters across different datasets. Additionally, the reliance on manually verified labels for the ESC-FreeGen50 dataset may limit its scalability and applicability to larger, less curated datasets.
The proposed CSU framework has significant implications for the field of audio tagging and weakly supervised learning, particularly in real-world applications where annotation quality is often inconsistent. By improving robustness against unreliable supervision, this work can enhance the performance of audio tagging systems in various domains, including environmental sound recognition and multimedia content analysis. The introduction of the ESC-FreeGen50 dataset also provides a valuable resource for future research in this area. The main contribution of this paper is the introduction of the Class-wise Supervision Unreliability (CSU) framework, which effectively addresses the challenges posed by unreliable supervision in audio tagging tasks. The comprehensive evaluation and the introduction of a new benchmark dataset significantly advance the state of the art in robust audio tagging methodologies.
This paper describes a novel paradigm that formalizes automatic piano transcription (APT) as an optimal transport (OT) problem, not as a frame-level multi-label binary classification problem. Our method learns to minimize the cost of transporting a predicted distribution of note events to the ground-truth distribution over time and frequency. The OT loss can thus accommodate temporal misalignment, leading to perceptually relevant optimization. We also propose a convolutional recurrent neural network (CRNN) with a harmonics-aware attention mechanism to capture the spectro-temporal dependencies inherent in music.Our experiments using the MAESTRO dataset showed that our method attained a state-of-the-art performance in onset detection. We confirmed the versatility of the OT loss in application to existing models.
Primary: Graduate School of Informatics, Kyoto University
All Institutions: Graduate School of Informatics, Kyoto University, Graduate School of Engineering, Kyoto University, Independent Researcher, Hong Kong
This paper presents a significant advancement in automatic piano transcription by introducing an optimal transport framework that enhances the model's ability to handle temporal misalignments. The combination of a novel loss function and a well-designed neural architecture positions this work as a meaningful contribution to the field of machine learning in music.
The paper introduces a novel approach to automatic piano transcription (APT) by framing it as an optimal transport (OT) problem rather than a traditional frame-level multi-label classification task. This shift is significant as it allows for more flexible handling of temporal misalignments in note predictions. The proposed convolutional recurrent neural network (CRNN) architecture, SFT-CRNN, incorporates a harmonics-aware attention mechanism, enhancing its ability to model spectro-temporal dependencies. The methodology is well-structured, with a clear explanation of the OT loss function and its application to the APT task, making it accessible for replication and further exploration.
The experiments are robust, utilizing the MAESTRO dataset, which is a well-regarded benchmark in the field. The authors report state-of-the-art results in onset detection, achieving an F1-score of 98.36%, which demonstrates the effectiveness of their approach. The comparative study against established baselines and the ablation studies provide strong evidence for the contributions of the OT loss and the model architecture. The evaluation metrics used (precision, recall, F1-score) are appropriate for the task, and the results are presented clearly.
While the paper provides a detailed description of the model architecture and training procedures, it lacks specific implementation details such as code availability or links to a repository. This omission may hinder reproducibility, as other researchers would need to rely solely on the descriptions provided to replicate the results.
One limitation noted in the paper is the model's performance on offset detection, which does not exceed the best-performing systems. The authors attribute this to the absence of a dedicated sustain pedal detection module, indicating a potential area for future work. Additionally, the reliance on a specific dataset (MAESTRO) may limit the generalizability of the results to other musical contexts.
The proposed method has the potential to significantly impact the field of music information retrieval, particularly in applications requiring accurate transcription of musical performances. The ability to handle temporal misalignments could improve the usability of APT systems in real-world scenarios, such as music education, automated accompaniment, and music analysis tools. Furthermore, the model's adaptability to other tasks within music information retrieval suggests broader applicability. This paper presents a significant advancement in automatic piano transcription by introducing an optimal transport framework that enhances the model's ability to handle temporal misalignments. The combination of a novel loss function and a well-designed neural architecture positions this work as a meaningful contribution to the field of machine learning in music.
Robust selective auditory attention under multilingual interference is critical for reliable deployment of Large Audio Language Models (LALMs). We introduce MUSA, a cocktail party-inspired multilingual benchmark for source-grounded spoken-language understanding and reasoning. Each item pairs an English target dialogue with a semantically plausible distractor in English, Spanish, Korean, or Chinese, and evaluates models across (1) single, (2) source separation-based two-stage, (3) and end-to-end cocktail party settings under controlled SNRs. Evaluating two closed-source and four open-weight LALMs, we find that strong single performance does not ensure robust selective auditory attention: cocktail party accuracy degrades under severe SNRs, and errors are dominated by distractor-grounded source confusion. In addition, separation reduces acoustic overlap but leaves source attribution unresolved, often yielding confident wrong-stream answers. Data and code will be released upon publication.
Primary: University of Illinois Urbana-Champaign
All Institutions: University of Illinois Urbana-Champaign
The main contribution of this paper is the introduction of the MUSA benchmark, which evaluates the selective auditory attention of LALMs in multilingual contexts, revealing critical insights into their performance limitations. This work significantly advances the understanding of LALMs' capabilities and highlights the importance of robust auditory attention mechanisms in real-world applications.
The paper introduces a novel benchmark, MUSA, designed to evaluate the selective auditory attention capabilities of Large Audio Language Models (LALMs) in the presence of multilingual distractors. The methodology is well-structured, employing a cocktail party paradigm that mimics real-world scenarios where multiple languages may interfere with audio processing. The authors rigorously define the experimental settings, including single, separation-based, and cocktail party conditions, and provide a detailed diagnostic error taxonomy that categorizes model failures. This structured approach allows for a comprehensive understanding of the models' performance under varying signal-to-noise ratios (SNRs), which is a significant advancement in the field.
The experiments are robust, involving six different LALMs evaluated across multiple settings and SNR levels. The results clearly demonstrate that high performance in single-stream conditions does not translate to effective performance in cocktail party scenarios, highlighting a critical gap in current LALM capabilities. The authors provide detailed statistical analyses and error distributions, which enhance the validity of their findings. The use of synthesized audio ensures consistency, although it may limit ecological validity.
The paper mentions that data and code will be released upon publication, which is a positive aspect for reproducibility. However, the specifics of the implementation details, such as the exact configurations of the models and the separation techniques used, could be better documented to facilitate independent replication of the results.
The study is limited by the relatively small dataset of 200 synthesized cases, which may not capture the full variability of natural speech. Additionally, the focus on English as the target language and the use of a single off-the-shelf separator may restrict the generalizability of the findings. The authors also acknowledge potential confounding factors that could influence cross-lingual understanding, which are not fully controlled.
The findings have significant implications for the deployment of LALMs in high-stakes environments such as healthcare and aviation, where accurate audio understanding is critical. By addressing the challenges of multilingual interference, this research paves the way for more reliable audio processing systems that can operate effectively in diverse linguistic contexts. The introduction of MUSA as a benchmark could stimulate further research in this area, leading to advancements in model architectures and training methodologies. The main contribution of this paper is the introduction of the MUSA benchmark, which evaluates the selective auditory attention of LALMs in multilingual contexts, revealing critical insights into their performance limitations. This work significantly advances the understanding of LALMs' capabilities and highlights the importance of robust auditory attention mechanisms in real-world applications.
High-fidelity text-to-music generation typically relies on massive proprietary datasets and immense computational resources. Existing models often struggle to generate coherent pure musical accompaniments and lack precise, localized semantic control due to their reliance on coarse, track-level annotations. To address these limitations under constrained data and computing resources, we propose S2Accompanist, a Semantic-Aware and Structure-Guided Diffusion Model developed for the ICME2026 ATTM Grand Challenge. Specifically, we design an automated data pipeline comprising structural segmentation, Large Audio-Language Model driven segment-level captioning, and dual-metric quality grading to overcome the absence of localized metadata in raw datasets. Furthermore, we propose a semantic-aware Variational Autoencoder fine-tuning strategy that explicitly distills foundational LeadSheet structures into the acoustic latent space, effectively improving the overall audio fidelity. Extensive experiments demonstrate that S2Accompanist achieves state-of-the-art objective performance on the ATTM Grand Challenge benchmark across both the Efficiency and Performance Tracks. With only 402M parameters, our model remains competitive compared to larger-scale unconstrained models and secured first place in the Efficiency Track.
Primary: Northwestern Polytechnical University
All Institutions: Northwestern Polytechnical University, WeNet Open Source Community
S2Accompanist presents a significant advancement in music accompaniment generation through its innovative data pipeline and semantic-aware modeling techniques. The comprehensive evaluation against benchmarks demonstrates its effectiveness and potential impact on the field of machine learning in audio.
The methodology presented in S2Accompanist is robust and innovative, particularly in its automated data pipeline that integrates structural segmentation and semantic captioning. The use of a Large Audio-Language Model (LALM) for generating fine-grained captions is a significant advancement, allowing for better semantic control in music generation. The introduction of a semantic-aware Variational Autoencoder (VAE) fine-tuning strategy is a notable contribution, as it effectively distills musical structures into the latent space, enhancing audio fidelity. The overall architecture, which combines these elements into a diffusion model, is well-structured and addresses the limitations of existing models in generating coherent musical accompaniments.
The experimental evaluation is thorough, with extensive testing against established benchmarks in the ATTM Grand Challenge. The results demonstrate S2Accompanist's superiority in both objective metrics (FAD, CCS) and subjective evaluations (MOS), securing the top position in the Efficiency Track. The use of dual-metric grading for data selection is particularly effective, ensuring high-quality training data. The paper provides clear comparisons with other models, showcasing the competitive performance of S2Accompanist despite its smaller parameter size.
The paper includes sufficient details regarding the training process, model architecture, and evaluation metrics, which aids in reproducibility. However, the lack of a publicly available code repository or demo limits the ability for others to replicate the results directly. Future work could benefit from releasing the model and data pipeline to the community.
One limitation is the reliance on the MTG-Jamendo dataset, which may not encompass the full diversity of musical styles and genres, potentially affecting the generalizability of the model. Additionally, while the model performs well under constrained conditions, its performance in less controlled environments or with more complex musical tasks remains to be tested.
The advancements made in S2Accompanist have significant implications for the field of music generation, particularly in enabling high-fidelity accompaniment generation with limited data resources. This could democratize access to music generation technologies, allowing smaller developers and researchers to create sophisticated models without the need for extensive datasets or computational power. The model's approach to integrating semantic understanding into music generation could also inspire future research in multimodal AI applications. S2Accompanist presents a significant advancement in music accompaniment generation through its innovative data pipeline and semantic-aware modeling techniques. The comprehensive evaluation against benchmarks demonstrates its effectiveness and potential impact on the field of machine learning in audio.
Finding sound effects or environmental sounds that match a creator's intended impression remains a largely manual process in multimedia production. This is especially relevant for comics and other visual media, where visually stylized onomatopoeic expressions convey auditory impressions through letter shapes, strokes, layouts, and decorative patterns. However, cross-modal retrieval between onomatopoeic images and general sounds has been largely unexplored. This paper thus introduces a bidirectional retrieval framework between onomatopoeic images and the corresponding sound clips. Instead of directly comparing embeddings extracted from pretrained image and audio encoder, we train modality-specific projection heads that re-align the embeddings for visual onomatopoeia and corresponding sounds. We then construct the Multimodal Image-Audio Onomatopoeia dataset (MIAO), which contains paired onomatopoeic images and sound clips across 50 sound event classes. Experimental results show that the proposed method substantially outperforms a zero-shot baseline using pretrained CLIP and CLAP embeddings. These results demonstrate that adapting pretrained representations enables effective retrieval in both directions: from onomatopoeic images to sounds and from sounds to onomatopoeic images.
Primary: Kyoto University
All Institutions: Kyoto University, Doshisha University
This paper presents a novel approach to cross-modal retrieval between onomatopoeic images and sounds, significantly contributing to the field of audio-visual machine learning. The methodology effectively adapts existing models to a unique dataset, demonstrating the potential for improved retrieval performance in multimedia applications.
The proposed methodology introduces a novel bidirectional retrieval framework that leverages modality-specific projection heads to align embeddings from pretrained image and audio encoders for onomatopoeic images and sounds. This approach effectively addresses the challenge of cross-modal retrieval in a previously unexplored area, demonstrating a thoughtful adaptation of existing models (CLIP and CLAP) to a unique dataset (MIAO). The use of projection heads to refine the embedding space is a significant methodological advancement that enhances retrieval performance.
The experiments conducted are robust, utilizing a well-constructed dataset (MIAO) with a clear evaluation strategy. The results show substantial improvements over a zero-shot baseline, indicating the effectiveness of the proposed method. The evaluation metrics used (mAP, R@k, MRR) are appropriate for the task and provide a comprehensive view of the retrieval performance in both directions. However, the paper could benefit from additional comparisons with more sophisticated baselines or alternative methods to further validate the effectiveness of the proposed approach.
The paper provides sufficient detail regarding the methodology, including the dataset construction, experimental setup, and evaluation metrics, which supports reproducibility. However, the absence of a publicly available code repository limits the ability for others to directly replicate the results. Providing implementation details or a link to a codebase would significantly enhance reproducibility.
One identified limitation is the reliance on pretrained models, which may not fully capture the nuances of onomatopoeic images. Additionally, the model's performance varies between retrieval directions, suggesting that further investigation into the variability of visual representations is needed. The dataset's size and diversity may also limit generalizability, as the results are based on a specific set of illustrators and sound classes.
The proposed framework has potential applications in multimedia production, particularly in enhancing the efficiency of sound effect selection in visual media like comics and animations. By automating the retrieval process based on visual cues, this work could significantly reduce the manual effort required by creators, leading to more streamlined workflows in multimedia content creation. Furthermore, the insights gained from this research could inspire future studies on cross-modal retrieval and representation learning in other domains. This paper presents a novel approach to cross-modal retrieval between onomatopoeic images and sounds, significantly contributing to the field of audio-visual machine learning. The methodology effectively adapts existing models to a unique dataset, demonstrating the potential for improved retrieval performance in multimedia applications.
Continuous autoregressive speech synthesis has recently emerged as a promising direction for zero-shot text-to-speech (TTS). However, existing methods still suffer from a fundamental mismatch between semantic-prosodic modeling and reconstruction-driven continuous speech representations. This mismatch causes TTS models to focus excessively on low-level acoustic textures at the expense of high-level semantic coherence, further exacerbating error accumulation in autoregressive generation. To address this challenge, we propose SemaVoice, a semantic-aware continuous autoregressive framework for high-fidelity zero-shot TTS. SemaVoice introduces a Speech Foundation Model (SFM) guided alignment mechanism that refines continuous speech representations to better capture both local semantic consistency and global structural relationships. These representations condition a patch-wise diffusion head within the autoregressive framework for high-quality speech synthesis. Experimental results on the Seed-TTS benchmark show that SemaVoice achieves an English WER of 1.71\% and remains highly competitive with state-of-the-art open-source systems in both objective and subjective evaluations. The effectiveness of SFM guided alignment is further confirmed by significant improvements under varying representation granularities with a fixed information-rate constraint.
Primary: The Chinese University of Hong Kong
All Institutions: The Chinese University of Hong Kong, Tsinghua University, SenseTime Research
The main contribution of this paper is the introduction of SemaVoice, a semantic-aware continuous autoregressive framework that significantly improves high-fidelity zero-shot text-to-speech synthesis through an innovative SFM-guided alignment mechanism. This work represents a meaningful advancement in the field of speech synthesis, addressing critical limitations in existing models and demonstrating strong experimental results.
The proposed SemaVoice framework introduces a novel SFM-guided alignment mechanism that effectively addresses the mismatch between semantic-prosodic modeling and reconstruction-driven continuous speech representations. This innovative approach enhances the semantic coherence of generated speech while maintaining acoustic fidelity. The use of a continuous autoregressive framework, combined with a patch-wise diffusion head, is a significant advancement over traditional TTS architectures. The methodology is well-structured, with a clear explanation of the components and their interactions, although the complexity may pose challenges for replication.
The experimental evaluation is robust, utilizing a large-scale bilingual dataset of 150K hours for training and thorough testing on the Seed-TTS benchmark. The results demonstrate competitive performance against state-of-the-art systems in both objective (WER, speaker similarity) and subjective (MOS) metrics, indicating the effectiveness of the proposed framework. The ablation studies provide valuable insights into the contributions of key components, reinforcing the importance of the SFM-guided alignment mechanism.
While the paper provides a detailed description of the architecture and training process, it lacks a publicly available implementation or code repository, which hinders reproducibility. The absence of a demo URL also limits practical engagement with the model.
The evaluation is limited to a bilingual dataset, which may restrict the generalizability of the findings. Additionally, the paper acknowledges inherent challenges with sequential inference latency and error accumulation in autoregressive generation, which could impact real-time applications.
The advancements in zero-shot TTS synthesis have significant implications for applications in voice cloning, virtual assistants, and accessibility technologies. By improving the semantic coherence and acoustic fidelity of synthesized speech, SemaVoice could enhance user experience in various domains, including entertainment, education, and communication. The main contribution of this paper is the introduction of SemaVoice, a semantic-aware continuous autoregressive framework that significantly improves high-fidelity zero-shot text-to-speech synthesis through an innovative SFM-guided alignment mechanism. This work represents a meaningful advancement in the field of speech synthesis, addressing critical limitations in existing models and demonstrating strong experimental results.
Early detection of exacerbations in asthma and chronic obstructive pulmonary disease (COPD) is important for timely intervention. Speech has emerged as a promising tool for continuous, non-invasive respiratory disease monitoring. However, speech signals inherently carry speaker-identifiable attributes that may dominate model predictions, which may compromise both diagnosis performance and patient privacy. Furthermore, the acoustic features associated with respiratory disease and speaker identity remain unclear in respiratory disease monitoring. We propose an adversarial learning architecture that disentangles pathology-related acoustic patterns from speaker-identifiable attributes. The framework optimizes two clinically hierarchical tasks: (i) respiratory status classification (stable vs. exacerbated) and (ii) exacerbation type classification (asthma exacerbation vs. COPD exacerbation). Speaker identity is suppressed through gradient reversal-based adversarial training. To enhance clinical interpretability, we employ SHapley Additive exPlanations (SHAP) to quantify the contributions of acoustic features to pathology-related predictions versus speaker identity. On the TACTICAS dataset, our method outperforms the single-task baseline across both tasks. For the respiratory status task (stable vs. exacerbated), the AUC improves from 0.897 to 0.910. For the exacerbation type task (asthma exacerbation vs. COPD exacerbation), the AUC increases from 0.674 to 0.793. Concurrently, the J-ratio decreases, confirming effective suppression of speaker information. SHAP analysis reveals the contributions of the acoustic features to both tasks. External validation on the Bridge2AI-Voice dataset further demonstrates consistent performance improvement and reduced speaker dependency, confirming cross-dataset generalizability.
Primary: Maastricht University
All Institutions: Maastricht University, Maastricht University Medical Centre, NUTRIM Research Institute of Nutrition and Translational Research in Metabolism
The main contribution of this paper is the development of a multi-task adversarial learning framework that enhances the accuracy of speech-based monitoring for asthma and COPD exacerbations while preserving patient privacy. This work represents a significant step forward in the intersection of machine learning, healthcare, and privacy, providing a foundation for future research and applications in remote health monitoring.
The proposed methodology utilizes an innovative adversarial learning framework that effectively disentangles speaker-identifiable attributes from pathology-related acoustic features. This approach is well-justified, addressing critical issues of privacy and model generalizability in speech-based monitoring of respiratory diseases. The use of gradient reversal for adversarial training is a solid choice, and the integration of SHAP for interpretability adds significant value to the methodology. However, the paper could benefit from a more detailed explanation of the hyperparameter tuning process and the rationale behind the choice of specific features.
The experiments are robust, utilizing two distinct datasets (TACTICAS and Bridge2AI-Voice) to validate the model's performance. The reported improvements in AUC scores across both tasks indicate a significant enhancement in diagnostic accuracy. The use of the J-ratio to measure speaker information leakage is a novel contribution that strengthens the findings. However, the paper could improve by providing more detailed statistical analyses and comparisons with baseline models to better contextualize the results.
While the paper outlines the methodology and datasets used, it lacks sufficient implementation details that would allow for full reproducibility. Key aspects such as the specific configurations of the model architecture, training procedures, and data preprocessing steps are not thoroughly documented. Providing access to code or supplementary materials would greatly enhance reproducibility.
The study is limited by its focus on Dutch speakers, which may affect the generalizability of the findings to other languages and dialects. Additionally, the model's performance on a wider range of respiratory conditions beyond asthma and COPD is not explored, which could limit its applicability in clinical settings. The reliance on specific acoustic features may also overlook other potentially relevant indicators of respiratory health.
This research has significant implications for the development of non-invasive, privacy-preserving monitoring systems for chronic respiratory diseases. By improving diagnostic accuracy while safeguarding patient identity, the framework could facilitate wider adoption of speech-based health monitoring technologies in clinical practice. The findings could also inspire further research into adversarial learning applications in healthcare, particularly in areas where patient privacy is a concern. The main contribution of this paper is the development of a multi-task adversarial learning framework that enhances the accuracy of speech-based monitoring for asthma and COPD exacerbations while preserving patient privacy. This work represents a significant step forward in the intersection of machine learning, healthcare, and privacy, providing a foundation for future research and applications in remote health monitoring.
Latent diffusion models have emerged as the dominant paradigm for many generation tasks including audio generation such as text-to-audio, text-to-music and text-to-speech. A key component of latent diffusion is an autoencoder (VAE) that compresses high-dimensional signals into a low frame rate continuous representation that is conducive for downstream prediction. Regularizing these VAEs is challenging, as there is a trade-off between over-regularized (poor output quality) and under-regularized (difficult to predict) latent representations. We propose a framework for studying this trade-off through compression and train Audio VAEs at specific bitrates via target-KL regularization. This allows direct comparison to well-studied discrete neural audio codec models, and the construction of rate-distortion curves for audio VAEs. We evaluate the impact of target-KL regularization on text-to-sound generation and find that sweeping compression rates is helpful in identifying the optimal generation setting.
Primary: Adobe Research
All Institutions: Adobe Research
The main contribution of this paper is the introduction of target-KL regularization for training continuous VAEs at fixed bitrates, enabling systematic comparisons with discrete audio codecs and enhancing the understanding of the compression-reconstruction trade-off in audio generation tasks. This work represents a meaningful advancement in the field of audio machine learning, with potential applications in various generative audio tasks.
The proposed method of target-KL regularization is a significant advancement in the training of continuous VAEs for audio generation. By systematically addressing the trade-off between compression and reconstruction quality, the authors provide a novel framework that allows for targeted bitrate control during training. This approach not only enhances the understanding of latent representations in VAEs but also facilitates direct comparisons with discrete audio codecs, which is a notable contribution to the field. The methodology is well-structured, with clear definitions and a solid theoretical foundation linking compression theory to VAE training.
The experiments conducted are thorough and well-documented, utilizing a variety of datasets and architectures to evaluate the performance of the proposed DAC-VAE models. The results demonstrate a clear advantage of the target-KL regularization in achieving optimal compression rates for different audio tasks, including text-to-sound and text-to-speech generation. The use of rate-distortion curves to visualize the performance of various models is particularly effective in illustrating the benefits of the proposed method. However, the paper could benefit from more extensive qualitative evaluations and comparisons with a wider range of existing models.
The paper provides a detailed description of the model architecture, training procedures, and evaluation metrics, which aids in reproducibility. However, the absence of a publicly available code repository or demo URL limits the ability for others to replicate the results directly. Including such resources would significantly enhance the reproducibility of the findings.
One limitation of the study is the reliance on proprietary datasets, which may restrict the generalizability of the results. Additionally, while the authors discuss the trade-offs involved in compression rates, there is limited exploration of how these findings might apply to other audio generation tasks beyond those tested. The qualitative aspects of generated audio, such as naturalness and emotional expressiveness, could also be further investigated.
The implications of this research are significant for the audio generation community, particularly in applications involving text-to-audio synthesis and music generation. By providing a framework for systematically studying the trade-offs in audio compression, this work could lead to advancements in the development of more efficient and higher-quality generative audio models. The findings may also influence future research directions in multimodal audio applications and the integration of audio generation with other machine learning tasks. The main contribution of this paper is the introduction of target-KL regularization for training continuous VAEs at fixed bitrates, enabling systematic comparisons with discrete audio codecs and enhancing the understanding of the compression-reconstruction trade-off in audio generation tasks. This work represents a meaningful advancement in the field of audio machine learning, with potential applications in various generative audio tasks.
Artificially generated speech is increasingly embedded in everyday life. Voice cloning in particular enables applications where identity preservation is important, such as completing a recording, dubbing in a new language, or preserving the voices of individuals with speech loss. However, in our work, we find that despite the term, voice cloning does not faithfully ''clone'' an individual's voice. Instead, we find that widely-used voice cloning models systematically apply style transfer to source voices. As rated by human annotators, cloned voices are perceived as more authoritative, warm, customer-service-like, and human-like compared to their sources. Human annotators also report greater trust in cloned voices than source voices, and a greater willingness to disclose sensitive personal information to them. Our work furthermore shows that voice cloning leads to homogenization of speaker characteristics, as measured by reduced variance in accent, speaking rate, and the audio embedding space. Together, our results highlight a new set of limitations and risks of voice cloning technology and their potential impact on human behavior.
Primary: Cornell University
All Institutions: Cornell University, TogetherAI, Stanford University
This paper presents a critical examination of voice cloning technologies, revealing that they often apply style transformations rather than faithfully reproducing individual voices. The findings underscore the need for greater awareness and regulation of voice cloning technologies to mitigate potential risks to personal identity and societal norms.
The methodology employed in this study is robust, utilizing a diverse participant pool and a systematic approach to evaluate voice cloning systems. The authors effectively use paired audio samples for human annotation, which allows for a direct comparison between source and cloned voices. The use of multiple TTS models and the inclusion of ablation studies to explore the effects of clip duration and generation settings provide a comprehensive understanding of the phenomena observed. However, the reliance on subjective human ratings introduces potential biases that could affect the results.
The experiments are well-structured, with a clear focus on evaluating the perceived qualities of cloned voices compared to their sources. The statistical significance of the findings is appropriately reported, and the use of various metrics to assess human perception adds depth to the analysis. The findings regarding the homogenization of speaker characteristics and the behavioral implications of voice cloning are particularly noteworthy. However, the paper could benefit from more extensive quantitative analysis alongside the qualitative assessments.
The authors provide sufficient detail regarding their experimental setup, including participant demographics, data collection methods, and the models used for voice cloning. The availability of datasets and code on GitHub enhances reproducibility. However, the paper lacks explicit details on the training processes and hyperparameters used for the TTS models, which could hinder full replication of the results.
The study acknowledges several limitations, including the potential biases in human ratings and the lack of demographic diversity in the participant pool. Additionally, the focus on a limited number of TTS models may not fully capture the variability across different voice cloning technologies. The implications of voice cloning on identity and cultural representation are significant but require further exploration in future work.
This research has substantial implications for the development and deployment of voice cloning technologies. The findings raise important ethical questions regarding identity preservation, trust in synthetic voices, and the potential for misuse in sensitive contexts. The study highlights the need for transparency in how voice cloning systems operate and the societal impacts they may have, particularly in terms of cultural homogenization and the reinforcement of existing biases in voice perception. This paper presents a critical examination of voice cloning technologies, revealing that they often apply style transformations rather than faithfully reproducing individual voices. The findings underscore the need for greater awareness and regulation of voice cloning technologies to mitigate potential risks to personal identity and societal norms.
Audio super-resolution (SR), also referred to as bandwidth extension (BWE), aims to reconstruct high-fidelity signals from low-resolution (LR) or band-limited (BL) observations, an inherently ill-posed task due to the ambiguity of missing high-frequency (HF) content. This survey provides a comprehensive overview of the field, with a particular focus on the paradigm shift from discriminative mapping to modern generative modeling. We first review early discriminative deep neural network (DNN) models, which formulate BWE/SR as a deterministic mapping problem and are prone to regression-to-the-mean effects and spectral over-smoothing. We then systematically review generative approaches, including autoregressive (AR) models, variational autoencoders (VAEs), generative adversarial networks (GANs), diffusion and score-based models, flow-based methods, and Schrödinger bridges. Across these approaches, we examine key design aspects, including representation domain, architecture, conditioning mechanisms, and trade-offs among reconstruction fidelity, perceptual quality, robustness, and computational efficiency. Furthermore, we discuss emerging directions involving large language models (LLMs) and multimodal foundation models, and highlight open challenges in perceptual evaluation, phase modeling, and real-world generalization. By providing a structured taxonomy and unified perspective, this survey establishes a comprehensive foundation and offers a practical roadmap for advancing BWE/SR from deterministic point estimation toward distribution-aware generative modeling.
Primary: Stony Brook University
All Institutions: Stony Brook University, Northeastern University, University of Illinois Chicago, Discovery Partners Institute
The main contribution of this paper is its comprehensive survey of audio super-resolution and bandwidth extension techniques, providing a structured taxonomy and critical evaluation of existing methodologies. This work serves as a valuable resource for researchers seeking to understand the evolution of the field and the current state of generative modeling approaches.
The paper presents a comprehensive survey of audio super-resolution (SR) and bandwidth extension (BWE), effectively categorizing existing methodologies into discriminative and generative models. It critically evaluates the limitations of traditional deterministic approaches and highlights the advantages of generative frameworks, such as GANs and diffusion models. The authors provide a structured taxonomy that clarifies the relationship between BWE and SR, which is a significant contribution to the field. However, while the survey is thorough, it lacks original experimental results or novel methodologies that could further enhance its impact.
The paper does not present original experiments or results but instead synthesizes existing literature and methodologies. It reviews various datasets and evaluation metrics commonly used in the field, including subjective and objective measures. The lack of new experimental validation limits the paper's technical impact, as it primarily serves as a literature review rather than presenting novel findings.
As a survey paper, reproducibility is not directly applicable; however, the authors do provide a clear overview of existing methodologies and their evaluation metrics. The absence of new experimental results means there are no implementation details to reproduce, which is a common limitation in survey papers.
The primary limitation of this paper is its lack of original experimental contributions or novel methodologies. While it provides a comprehensive overview, it does not advance the field with new insights or findings. Additionally, the survey may not cover the most recent developments if they emerged after the paper's submission.
The survey has significant implications for researchers in audio processing, as it provides a structured overview of the evolution of BWE and SR techniques. By highlighting the shift towards generative models, it may guide future research directions and inspire the development of new methodologies. The discussion of emerging trends, such as the integration of large language models, indicates potential avenues for future exploration in multimodal audio systems. The main contribution of this paper is its comprehensive survey of audio super-resolution and bandwidth extension techniques, providing a structured taxonomy and critical evaluation of existing methodologies. This work serves as a valuable resource for researchers seeking to understand the evolution of the field and the current state of generative modeling approaches.
Training data attribution (TDA) for music generation must answer two questions that copyright analysis requires, namely which training songs influence a generated output and along which musical aspects the influence operates. Existing methods reduce influence to a single scalar, without revealing which musical aspects are dominant in that influence. We propose ARIA, a framework that decomposes attribution along musical aspects (five for symbolic music, three for audio) and pairs the decomposition with reliability diagnostics computed from the segment-level score matrix. It measures within-group similarity among the top-K attributed tracks against random reference groups drawn from the training pool, and diagnoses the score matrix through its singular value decomposition and column statistics. On a symbolic-music model where attribution ground truth is available through counterfactual retraining, the reliability diagnostics rank four attribution methods identically to that ground truth. On an audio music generation model, ARIA reveals attribution behaviors that vary substantially across TDA methods, flags score matrices whose retrieved tracks are nearly identical across queries rather than reflecting per-query attribution, and characterizes embedding-similarity retrieval baselines by the musical aspect each encoder surfaces. Together, ARIA produces per-aspect attribution evidence aligned with the musical aspects considered under the idea-expression distinction in copyright analysis.
Primary: Chalmers University of Technology
All Institutions: Chalmers University of Technology, University of Gothenburg
The paper presents ARIA, a novel framework for music training data attribution that effectively decomposes influence along musical aspects and provides reliability diagnostics, addressing a critical need in the intersection of machine learning and copyright law.
The proposed ARIA framework innovatively decomposes training data attribution (TDA) along multiple musical aspects, addressing a significant gap in existing methods that reduce influence to a single scalar. The methodology includes reliability diagnostics based on segment-level score matrices and singular value decomposition, which are crucial for understanding the attribution behavior of different methods. This multi-faceted approach is particularly relevant in the context of music generation and copyright analysis, as it aligns with the legal framework of idea-expression distinction.
The experiments conducted on both symbolic and audio music generation models are well-structured, utilizing a benchmark with ground truth for validation and exploring the performance of various attribution methods. The results demonstrate the effectiveness of ARIA in revealing the influence of training songs on generated outputs and highlight the variability of attribution behaviors across different methods. The use of statistical measures to assess within-group similarity adds robustness to the findings.
The paper provides comprehensive details on the experimental setup, including model architectures, datasets, and evaluation metrics, which enhances reproducibility. However, the absence of publicly available code or a demo limits the practical reproducibility of the results.
One limitation is the reliance on existing benchmarks and the challenges associated with creating ground truth for audio attribution, which may affect the generalizability of the findings. Additionally, the framework's performance may vary with different types of music or genres, which is not fully explored in the experiments.
The implications of this research extend to the legal domain, particularly in copyright analysis, as it provides a framework for understanding the influence of training data on generative models. This could aid in developing fair compensation mechanisms for artists and inform future regulations regarding AI-generated content. The framework also sets a foundation for further research in music generation and attribution, potentially influencing how generative models are evaluated and utilized in practice. The paper presents ARIA, a novel framework for music training data attribution that effectively decomposes influence along musical aspects and provides reliability diagnostics, addressing a critical need in the intersection of machine learning and copyright law.
Toxic speech detection has become a crucial challenge in maintaining safe online communication environments. However, existing approaches to toxic speech detection often neglect the contribution of paralinguistic cues, such as emotion, intonation, and speech rate, which are key to detecting speech toxicity. Moreover, current toxic speech datasets are predominantly text-based, limiting the development of models that can capture paralinguistic cues.To address these challenges, we present ToxiAlert-Bench, a large-scale audio dataset comprising over 30,000 audio clips annotated with seven major toxic categories and twenty fine-grained toxic labels. Uniquely, our dataset annotates toxicity sources -- distinguishing between textual content and paralinguistic origins -- for comprehensive toxic speech analysis.Furthermore, we propose a dual-head neural network with a multi-stage training strategy tailored for toxic speech detection. This architecture features two task-specific classification headers: one for identifying the source of sensitivity (textual or paralinguistic), and the other for categorizing the specific toxic type. The training process involves independent head training followed by joint fine-tuning to reduce task interference. To mitigate data class imbalance, we incorporate class-balanced sampling and weighted loss functions.Our experimental results show that leveraging paralinguistic features significantly improves detection performance. Our method consistently outperforms existing baselines across multiple evaluation metrics, with a 21.1% relative improvement in Macro-F1 score and a 13.0% relative gain in accuracy over the strongest baseline, highlighting its enhanced effectiveness and practical applicability.
Primary: Zhejiang University
All Institutions: Zhejiang University, Zhejiang Provincial Natural Science Foundation, National Natural Science Foundation of China
The main contribution of this work is the introduction of ToxiAlert-Bench, a comprehensive dataset for paralinguistic-aware toxic speech detection, and a dual-head neural network that significantly improves detection performance by integrating both textual and paralinguistic features. This paper represents a meaningful advancement in the field of audio-based machine learning, addressing a critical gap in existing research and providing a robust framework for future studies.
The paper introduces a novel dual-head neural network architecture designed specifically for detecting toxic speech by leveraging both textual and paralinguistic cues. The methodology is well-structured, involving a multi-stage training strategy that effectively reduces task interference and addresses data imbalance through class-balanced sampling and weighted loss functions. The dataset, ToxiAlert-Bench, is comprehensive, comprising over 30,000 audio clips with detailed annotations that allow for nuanced analysis of toxicity sources. The use of both real and synthesized audio samples enhances the dataset's robustness and diversity.
The experiments are thorough, comparing the proposed method against several state-of-the-art baselines. The results demonstrate significant improvements in detection performance, particularly in identifying toxicity conveyed through paralinguistic cues. The paper provides detailed metrics, including accuracy and Macro-F1 scores, which support the claims of the model's effectiveness. The ablation studies further validate the contributions of the model's components, reinforcing the robustness of the findings.
The authors have taken steps to ensure reproducibility by documenting the dataset construction process and providing a GitHub repository for the model. However, the paper could benefit from more detailed implementation specifics, such as hyperparameter settings and training protocols, to facilitate easier replication by other researchers.
One limitation is the reliance on the quality of the synthetic data generated, which may not fully capture the complexity of real-world toxic speech. Additionally, while the dataset is extensive, the focus on English may limit the applicability of the findings to other languages and cultural contexts. The paper does not address potential biases in the dataset or the model's performance across different demographics.
This research has significant implications for online communication platforms, particularly in enhancing moderation systems for audio content. By addressing the nuances of toxic speech that are often overlooked in text-based moderation, the findings could lead to more effective tools for preventing harassment and promoting safer online environments. The dataset and model could serve as foundational resources for future research in audio-based toxicity detection. The main contribution of this work is the introduction of ToxiAlert-Bench, a comprehensive dataset for paralinguistic-aware toxic speech detection, and a dual-head neural network that significantly improves detection performance by integrating both textual and paralinguistic features. This paper represents a meaningful advancement in the field of audio-based machine learning, addressing a critical gap in existing research and providing a robust framework for future studies.
In recent years, the performance of automatic speech recognition (ASR) systems has made considerable progress. Unfortunately, for people with speech impairments, such as people treated for oral cancer (OC), ASR performance is still lagging behind. The scarcity and variability of OC speech data makes development of ASR models for this type of speech difficult. In this work, we use data augmentation and large language model (LLM) error correction to mitigate this problem. We apply various augmentation techniques on a corpus of Dutch oral cancer speech to create synthetic data, and evaluate their effect on ASR performance. We finetune Whisper and Massively Multilingual Speech (MMS) models for each augmentation technique and observe, on average, an 8% relative decrease in Word Error Rate (WER) when including data created using text-to-speech (TTS). When employing LLMs for error correction, we see a further 21.4-26.2% relative decrease in WER for finetuned ASR models and a 10.0% relative decrease for non-finetuned models. Overall, we achieve a 40% relative WER decrease for Whisper and a 50% relative WER decrease for MMS, indicating that a combination of data augmentation and LLM correction is a viable strategy for the recognition of OC speech.
Primary: University of Groningen
All Institutions: University of Groningen, University Hospital Cologne, University Medical Center Groningen, Netherlands Cancer Institute, University of Amsterdam, Nagoya University
The main contribution of this paper lies in its innovative approach to enhancing automatic speech recognition for oral cancer patients through data augmentation and large language model error correction. This comprehensive analysis of the technical contributions, methodology, and significance to the field highlights the potential for improved communication technologies for individuals with speech impairments.
The authors propose a comprehensive methodology that combines various data augmentation techniques (time stretching, speed perturbation, vocal tract length perturbation, voice conversion, and text-to-speech) to enhance ASR performance for oral cancer speech. The integration of LLMs for error correction is particularly innovative, providing a dual approach to improving recognition accuracy. The methodology is well-structured, with clear descriptions of each augmentation technique and its rationale, alongside a robust experimental design that includes both finetuning and error correction.
The experiments are thorough, utilizing a leave-one-speaker-out approach to evaluate the ASR models on a dataset specifically tailored for oral cancer speech. The performance metrics, primarily Word Error Rate (WER), are well-documented, showing significant improvements across various models and augmentation methods. The results demonstrate the effectiveness of TTS augmentation and LLM error correction, with detailed comparisons across different configurations, which adds to the reliability of the findings.
The paper provides sufficient detail regarding the implementation of the augmentation methods and the ASR models used, including hyperparameters and experimental setups. However, the lack of publicly available code or datasets limits the reproducibility of the results. Future work could benefit from sharing these resources to enhance transparency and facilitate further research in this area.
The study acknowledges several limitations, including the small dataset size and the ecological validity of the speech data, which may not generalize well to real-world scenarios. The reliance on a single augmentation method per experiment and the computational cost of the LLMs for error correction are also noted as potential drawbacks.
This research has significant implications for improving ASR systems for individuals with speech impairments, particularly those affected by oral cancer. By addressing a critical gap in ASR performance for pathological speech, the findings could lead to more accessible communication technologies for affected individuals, enhancing their quality of life. The combination of data augmentation and LLMs may also inspire future research in other domains of speech recognition. The main contribution of this paper lies in its innovative approach to enhancing automatic speech recognition for oral cancer patients through data augmentation and large language model error correction. This comprehensive analysis of the technical contributions, methodology, and significance to the field highlights the potential for improved communication technologies for individuals with speech impairments.
Autoregressive music generation depends strongly on the audio tokenizer. Existing high-fidelity codecs often use residual multi-codebook quantization, which preserves reconstruction quality but complicates language modeling after sequence flattening, as the residual hierarchy imposes strong sequential dependencies and can amplify error accumulation. We propose BandTok, a generation-oriented 2D Mel-spectrogram tokenizer that represents each frame with Mel-frequency band tokens from a single shared codebook. This design yields a physically interpretable time-frequency token grid with a more independent token structure, making it better suited for autoregressive modeling. BandTok improves reconstruction with a multi-scale PatchGAN objective and EMA codebook updates. We further introduce an autoregressive language model with 2D Rotary Position Embedding (2D RoPE) to preserve temporal and frequency-band structure during generation. Experiments show that BandTok improves over residual-codebook tokenizers and achieves strong results in a data-limited setting. The source code and generation demos for this work are publicly available.
Primary: Central Conservatory of Music
All Institutions: Central Conservatory of Music, Zhipu AI
The main contribution of this paper is the introduction of BandTok, a novel 2D Mel-spectrogram tokenizer that enhances autoregressive music generation through improved token independence and reconstruction fidelity. This work significantly advances the field by addressing limitations of existing tokenization methods and providing a robust framework for future research in audio generation.
The paper presents BandTok, a novel 2D Mel-spectrogram tokenizer specifically designed for autoregressive music generation. The methodology is well-structured, focusing on improving token independence and reducing error propagation through a shared codebook of Mel-frequency band tokens. The use of a multi-scale PatchGAN discriminator and EMA codebook updates enhances reconstruction fidelity, while the introduction of 2D Rotary Position Embedding (RoPE) effectively preserves the temporal and frequency-band structure during generation. The approach is innovative, leveraging a unique tokenization strategy that contrasts with traditional residual multi-codebook methods.
The experiments are comprehensive, comparing BandTok against existing tokenizers and evaluating both reconstruction quality and generation performance. The use of objective metrics like FAD and CLAP scores, alongside subjective assessments, provides a robust evaluation framework. The results indicate that BandTok outperforms residual-codebook tokenizers, demonstrating its effectiveness in a data-limited setting. However, the paper could benefit from more extensive ablation studies to isolate the impact of each component of the proposed method.
The paper provides sufficient implementation details, including training configurations, datasets, and evaluation metrics, which should facilitate reproducibility. The source code and generation demos are publicly available, further supporting the reproducibility of the results. However, the lack of a clear description of the datasets used for training and evaluation could pose challenges for researchers attempting to replicate the study.
One limitation is the reliance on specific datasets, which may affect the generalizability of the results. The paper also does not address potential biases in the training data, which could influence the quality of generated music. Additionally, while the proposed method shows improvements over existing approaches, the paper does not explore the scalability of BandTok with larger datasets or more complex music generation tasks.
The proposed method has significant implications for the field of music generation, particularly in enhancing the quality and fidelity of generated audio. By improving tokenization strategies, BandTok could facilitate advancements in various applications, including music composition, sound design, and interactive audio systems. The integration of multimodal aspects, such as text conditioning, opens avenues for more sophisticated music generation frameworks that could benefit artists and content creators. The main contribution of this paper is the introduction of BandTok, a novel 2D Mel-spectrogram tokenizer that enhances autoregressive music generation through improved token independence and reconstruction fidelity. This work significantly advances the field by addressing limitations of existing tokenization methods and providing a robust framework for future research in audio generation.
Generative models are capable to address difficult problems with non-unique solutions like bandwidth extension and gap filling, removing highly non-linear artifacts from codecs, clipping and distortion, as opposed to removing linear additive components like noise and reverb. While large offline processing models have shown impressive results, these tasks have not been solved with real-time capable models with low latency and compute. We propose a few-step flow matching model using Data Prediction Mean Flows in combination with suitable novel low-latency architecture to make flow matching models an attractive choice under theses constraints. Compared to state-of-the-art, our proposed mean flow model uses 120x less compute and introduces no algorithmic latency other than the STFT, while achieving similar audio quality.
Primary: Microsoft Research
All Institutions: Microsoft Research
This work presents a significant advancement in real-time speech restoration using generative models, demonstrating a 120x reduction in computational complexity while maintaining audio quality. The combination of innovative methodologies and thorough experimental validation positions this research as a notable contribution to the field of machine learning and audio processing.
The paper introduces a novel few-step flow matching model utilizing Data Prediction Mean Flows (DP-MF) for real-time speech restoration. The methodology is well-structured, addressing the limitations of existing generative models in terms of latency and computational efficiency. The combination of innovative training techniques, such as the introduction of a data prediction loss and the careful design of flow time distributions, demonstrates a significant advancement in the field. The architecture is designed to minimize latency while maximizing audio quality, which is critical for real-time applications.
The experiments are comprehensive, utilizing a large-scale dataset that simulates real-world audio degradation scenarios. The evaluation metrics include both subjective (MOS, WER) and objective (DNSMOS SIG) measures, which provide a balanced view of the model's performance. The results indicate that the proposed model outperforms existing state-of-the-art models in terms of quality while significantly reducing computational requirements, showcasing the effectiveness of the proposed approach.
The paper provides sufficient details regarding the architecture, training data, and evaluation metrics, which would allow for reproducibility. However, the absence of a public code repository limits accessibility for other researchers wishing to replicate or build upon this work.
While the proposed model shows substantial improvements in latency and computational efficiency, there are still gaps in performance compared to non-causal models, particularly in terms of WER. Additionally, the reliance on specific training data and augmentation techniques may limit generalizability to other types of audio restoration tasks.
The advancements made in this paper have significant implications for various applications, including telecommunications, hearing aids, and augmented reality devices. By enabling real-time speech restoration with reduced computational demands, this work could enhance user experiences in environments where audio quality is critical. This work presents a significant advancement in real-time speech restoration using generative models, demonstrating a 120x reduction in computational complexity while maintaining audio quality. The combination of innovative methodologies and thorough experimental validation positions this research as a notable contribution to the field of machine learning and audio processing.
Recent breakthroughs in multi-talker ASR (MT-ASR) and speaker diarization (SD) rely on synthetic data to mitigate the scarcity of large-scale conversational recordings, yet the impact of specific simulation choices remains poorly understood. To mind the gap between simulated mixtures and real-world interactions, we present a study of synthetic data generation for leading MT-ASR (DiCoW) and SD (Sortformer) systems. By introducing FastMSS, a highly efficient open-source simulator, we analyze turn-taking dynamics, source domain, acoustic augmentation, and data mixing strategies. Our findings reveal that optimal simulation recipes are highly task-dependent: increasing speech overlap benefits ASR but degrades diarization. Furthermore, broad source diversity consistently outperforms exact domain matching. Ultimately, synthetic-only training approaches real-data baselines, and combining simulated data with real recordings yields substantial gains over real-only training across both tasks.
Primary: Carnegie Mellon University
All Institutions: Brno University of Technology, Carnegie Mellon University, NVIDIA
The paper presents a comprehensive study on the impact of synthetic conversational data on multi-talker ASR and speaker diarization, revealing critical insights into simulation strategies and their task-dependent effects. The introduction of FastMSS as an open-source toolkit represents a significant advancement in the field, enabling further research and application in multi-talker speech processing.
The paper introduces FastMSS, an open-source simulator that allows for the generation of synthetic multi-talker conversations with configurable parameters. The methodology is robust, systematically varying key factors such as turn-taking dynamics and source domain diversity. The authors provide a clear rationale for their choices and demonstrate the importance of task-specific simulation strategies, which is a significant contribution to the field. The use of two leading models, DiCoW for MT-ASR and Sortformer for SD, adds depth to the analysis, allowing for a comprehensive understanding of how synthetic data can be optimized for different tasks.
The experiments are well-designed, utilizing a variety of datasets that reflect real-world conditions. The results are clearly presented, showing the impact of different simulation strategies on performance metrics such as tcpWER for ASR and DER for diarization. The findings that synthetic data can approach real-data performance and that combining both yields the best results are particularly noteworthy. The paper effectively demonstrates the practical implications of its findings, making it relevant for both academic and industry applications.
The authors emphasize reproducibility by releasing FastMSS as an open-source toolkit, which is a commendable practice in the research community. They provide detailed descriptions of their experimental setup, including datasets and evaluation metrics, which further enhances the reproducibility of their results. However, the reliance on specific configurations and hyperparameters may require careful attention from users to replicate the results exactly.
One limitation noted in the paper is the potential lack of inter-turn semantic coherence in the generated conversations, which could affect the performance of ASR systems. Additionally, while the study covers a range of simulation strategies, the generalizability of the findings to other tasks or domains outside those tested remains uncertain. The paper could also benefit from a more extensive discussion on the ethical implications of using synthetic data in real-world applications.
The research has significant implications for the fields of speech recognition and speaker diarization, particularly in scenarios where real conversational data is scarce. By demonstrating that synthetic data can effectively complement or even substitute real data, this work opens avenues for more efficient training of ASR and diarization systems. The findings could lead to advancements in applications such as virtual assistants, automated meeting transcriptions, and other multi-talker environments. The paper presents a comprehensive study on the impact of synthetic conversational data on multi-talker ASR and speaker diarization, revealing critical insights into simulation strategies and their task-dependent effects. The introduction of FastMSS as an open-source toolkit represents a significant advancement in the field, enabling further research and application in multi-talker speech processing.
Subretinal injection is a delicate vitreoretinal procedure requiring precise needle placement within the subretinal space while avoiding perforation of the retinal pigment epithelium (RPE), a layer directly beneath the target with extremely limited regenerative capacity. To enhance depth perception during cannula advancement, intraoperative optical coherence tomography (iOCT) offers high-resolution cross-sectional visualization of needle-tissue interaction; however, interpreting these images requires sustained visual attention alongside the en face microscope view, thereby increasing cognitive load during critical phases and placing additional demands on the surgeon's proprioceptive control. In this paper, we propose a structured, real-time sonification framework designed for extensible mapping of iOCT-derived anatomical features into perceptual auditory feedback. The method employs a physics-inspired acoustic model driven by segmented retinal layers from a stream of iOCT B-scans, with needle motion and injection-induced retinal layer displacements serving as excitation inputs to the sound model, enabling perception of tool position and retinal deformation. In a controlled user study (n=34), the proposed sonification achieved high retinal layer identification accuracy and robust detection of retinal deformation-related events, significantly outperforming a state-of-the-art baseline in overall event identification (83.4% vs. 60.6%, p < 0.001), with gains driven primarily by enhanced detection of injection-induced retinal deformation. Evaluation by experts (n=4) confirmed the clinical relevance and potential intraoperative applicability of the method. These results establish structured iOCT sonification as a viable complementary modality for real-time surgical guidance in subretinal injection.
Primary: Princeton University
All Institutions: Princeton University, Technische Universität München, Rotterdam Eye Hospital, Centre for Tactile Internet with Human-in-the-Loop, Technische Universität Dresden, Munich Center for Machine Learning, Chair for Social Affective Touch
This paper presents a novel real-time sonification framework for enhancing surgical guidance during subretinal injections, demonstrating significant improvements in event identification accuracy through innovative auditory feedback mechanisms. The methodology and experimental results indicate a strong potential for clinical impact, although further validation in diverse surgical contexts is necessary for widespread adoption.
The proposed methodology introduces a structured sonification framework that effectively maps iOCT-derived anatomical features into auditory feedback, leveraging a physics-inspired acoustic model. The approach is well-defined, utilizing real-time updates based on segmented retinal layers and employing a mass-spring-damper system to reflect dynamic interactions during subretinal injections. The integration of both tool-driven and anatomy-driven excitations is innovative, enhancing the auditory feedback's relevance to surgical contexts. However, the reliance on a specific anatomical model may limit generalizability across different surgical scenarios.
The user study involving 34 participants provides robust evidence of the proposed method's effectiveness, demonstrating significant improvements in event identification accuracy compared to a baseline. The statistical significance of the results (p < 0.001) strengthens the claims of enhanced performance. The qualitative evaluations and feedback from expert surgeons further validate the clinical applicability of the framework. However, additional details on participant demographics and the specific experimental setup would enhance the evaluation's transparency.
The paper provides a GitHub repository link for the code, which is a positive step towards reproducibility. However, the implementation details could be more thoroughly documented to facilitate easier replication by other researchers. The reliance on specific software libraries (e.g., miPhysics) should also be clearly stated to avoid potential compatibility issues.
The study's limitations include a small sample size for expert feedback and the potential for bias in participant selection. The framework's performance in diverse surgical scenarios beyond subretinal injection remains untested. Additionally, the auditory feedback's effectiveness may vary based on individual surgeon preferences and experiences, which could affect its adoption in clinical practice.
The proposed sonification framework has the potential to significantly enhance surgical precision and reduce cognitive load during delicate procedures like subretinal injections. By providing real-time auditory feedback, it could improve patient outcomes and streamline surgical workflows. The approach may also inspire further research into auditory feedback systems in other medical domains, potentially leading to broader applications in minimally invasive surgeries. This paper presents a novel real-time sonification framework for enhancing surgical guidance during subretinal injections, demonstrating significant improvements in event identification accuracy through innovative auditory feedback mechanisms. The methodology and experimental results indicate a strong potential for clinical impact, although further validation in diverse surgical contexts is necessary for widespread adoption.
Segment Anything Model 2 (SAM2) exhibits strong generalisation for promptable segmentation in video clips; however, its integration with the audio modality remains underexplored. Existing approaches either convert audio into visual prompts (e.g., boxes) via foundation models, or inject adapters into the image encoder for audio-visual fusion. Yet both directions fall short in human-in-the-loop scenarios due to limited prompt accuracy and increased inference overhead. In particular, these adapter-based methods often suffer from audio prompt dilution, where the signal gradually weakens as it propagates through the network. In this work, we propose AuralSAM2, which integrates audio into SAM2 while largely preserving its promptable segmentation capability. Its core module, AuralFuser, fuses audio and visual features to generate sparse and dense prompts. Guided by audio and built upon SAM2's feature pyramid, these prompts propagate auditory cues across visual layers, reinforcing cross-modal influence. To further align modalities, we introduce an audio-guided contrastive loss that emphasises auditory relevance in dominant visual features. Our method achieves notable accuracy gains on public benchmarks with only minimal impact on the interactive efficiency of promptable segmentation. Our code is available at https://github.com/yyliu01/AuralSAM2.
Primary: University of Oxford
All Institutions: University of Oxford, Australian Institute for Machine Learning, Stanford University, University of Central Florida, University of Surrey
The paper presents AuralSAM2, a novel framework that enhances the Segment Anything Model 2 by integrating audio features for improved promptable segmentation. This work significantly advances the field of audio-visual integration in machine learning, providing a robust methodology and strong experimental results that demonstrate its potential impact on future research and applications.
The methodology introduces AuralFuser, which effectively integrates audio features into the SAM2 framework without modifying its visual backbone. This is achieved through a novel approach that generates both sparse and dense prompts, enhancing the model's ability to leverage audio cues in segmentation tasks. The introduction of an audio-guided contrastive loss (AudioCon) is particularly innovative as it addresses the challenge of visual dominance in the latent space, ensuring that audio signals are prioritized in the learning process. The hierarchical design of the feature pyramid is a significant methodological advancement that preserves audio influence throughout the network.
The experimental evaluation is robust, utilizing two public benchmarks (Ref-AVS and AVSBench) to demonstrate the efficacy of AuralSAM2. The results show significant improvements in segmentation accuracy compared to existing methods, particularly in human-in-the-loop scenarios, which is a critical application area. The ablation studies effectively highlight the contributions of different components of the proposed method, reinforcing the validity of the results.
The paper provides a link to the code repository, which is essential for reproducibility. However, the implementation details could be more comprehensive, particularly regarding the training setup and hyperparameters used. Clearer documentation would enhance the ability of other researchers to replicate the results.
One limitation is the reliance on the SAM2 framework, which may restrict the generalizability of the proposed method to other architectures. Additionally, while the integration of audio is innovative, the paper does not extensively discuss the potential challenges in real-world applications, such as varying audio quality or background noise.
The integration of audio into visual segmentation tasks has significant implications for various applications, including video analysis, surveillance, and human-computer interaction. By improving the accuracy of segmentation in scenarios where audio cues are present, this work could enhance the usability of AI systems in real-world environments, making them more efficient and effective. The paper presents AuralSAM2, a novel framework that enhances the Segment Anything Model 2 by integrating audio features for improved promptable segmentation. This work significantly advances the field of audio-visual integration in machine learning, providing a robust methodology and strong experimental results that demonstrate its potential impact on future research and applications.
Multimedia verification requires not only accurate conclusions but also transparent and contestable reasoning. We propose a contestable multi-agent framework that integrates multimodal large language models, external verification tools, and arena-based quantitative bipolar argumentation (A-QBAF) as a submission to the ICMR 2026 Grand Challenge on Multimedia Verification. Our method decomposes each case into claim-centered sections, retrieves targeted evidence, and converts evidence into structured support and attack arguments with provenance and strength scores. These arguments are resolved through small local argument graphs with selective clash resolution and uncertainty-aware escalation. The resulting system generates section-wise verification reports that are transparent, editable, and computationally practical for real-world multimedia verification. Our implementation is public at: https://github.com/Analytics-Everywhere-Lab/MV2026_the_liems.
Primary: University of New Brunswick
All Institutions: University of New Brunswick, FPT Software, University of Science
The paper presents a contestable multi-agent framework for multimedia verification that integrates multimodal large language models and an arena-based argumentation approach. The methodology is innovative and addresses critical issues in multimedia verification, although empirical validation and detailed experimental results are needed to fully assess its impact.
The proposed methodology is innovative, integrating multimodal large language models with an arena-based quantitative bipolar argumentation framework. The multi-agent approach effectively decomposes multimedia verification tasks into claim-centered sections, allowing for structured argumentation and transparent reasoning. The use of selective clash resolution and uncertainty-aware escalation enhances the system's robustness and practicality for real-world applications.
The paper lacks detailed experimental results or benchmarks that validate the proposed framework's effectiveness. While it describes the methodology in depth, the absence of empirical data or comparisons against existing methods limits the assessment of its performance and impact.
The implementation is publicly available on GitHub, which is a positive aspect for reproducibility. However, the paper does not provide sufficient details on the datasets used, evaluation metrics, or specific experimental setups, which could hinder full reproducibility.
The paper does not address potential limitations in terms of scalability, the complexity of the argumentation process, or the handling of ambiguous cases. Additionally, the reliance on external verification tools may introduce variability in results based on the quality of those tools.
The framework has significant implications for multimedia verification, particularly in combating misinformation in digital media. Its emphasis on contestability and transparency could enhance trust in automated verification systems, making it a valuable tool for journalists, fact-checkers, and the general public. The paper presents a contestable multi-agent framework for multimedia verification that integrates multimodal large language models and an arena-based argumentation approach. The methodology is innovative and addresses critical issues in multimedia verification, although empirical validation and detailed experimental results are needed to fully assess its impact.
Persian music, with its unique tonalities, modal systems (Dastgah), and rhythmic structures, presents significant challenges for music generation models trained primarily on Western music. We address this gap by curating the first large-scale dataset of Persian songs, comprising over 900 hours high-quality audio samples across diverse sub-genres, including pop, traditional, and contemporary styles. This dataset captures the rich melodic and cultural diversity of Persian music and serves as the foundation for fine-tuning MusicGen, a state-of-the-art generative music model. We adapt MusicGen to this domain and evaluate its performance by utilizing subjective and objective metrics. To assess the semantic alignment between generated music and intended style tags, we report the proportion of relevant tags accurately reflected in the generated outputs. Our results demonstrate that the fine-tuned model produces compositions that more align with Persian stylistic conventions. This work introduces a new resource for generative music research and illustrates the adaptability of music generation models to underrepresented cultural and linguistic contexts.
Primary: Sharif University of Technology
All Institutions: Sharif University of Technology, Independent Researcher
This paper presents the first large-scale dataset of Persian music and successfully adapts a state-of-the-art generative model to this culturally rich domain. The comprehensive methodology and promising results underscore the potential for AI to engage with and celebrate diverse musical traditions.
The methodology is robust, featuring a comprehensive dataset curation process that addresses the significant gap in Persian music resources. The authors employed a sophisticated approach for audio segmentation, tagging, and conditioning using state-of-the-art models. The three-stage training pipeline for adapting MusicGen to Persian music is well-structured, emphasizing unsupervised domain adaptation, instrument-focused fine-tuning, and supervised fine-tuning, which collectively enhance the model's cultural fidelity and stylistic accuracy. However, the reliance on automated tagging and the absence of expert validation for some aspects of the dataset may introduce noise and inaccuracies.
The experimental evaluation is thorough, utilizing both objective metrics (KLD and Chroma Cosine Similarity) and a hybrid evaluation strategy. The results indicate that the fine-tuned model significantly outperforms the baseline in generating culturally coherent Persian music. However, the evaluation could benefit from a more extensive subjective assessment involving trained musicians to capture perceptual qualities that are critical in music generation.
The paper provides a clear description of the dataset creation process and model training, which facilitates reproducibility. However, some details regarding the specific configurations used during training and the exact nature of the evaluation metrics could be elaborated upon to enhance clarity for future researchers attempting to replicate the study.
Key limitations include the dataset's skewed genre distribution towards Persian pop, which may affect the model's generalizability across other Persian music styles. The automatic tagging process may introduce inaccuracies, and the evaluation metrics used do not fully capture the richness of Persian music, particularly in terms of microtonal fidelity and ornamentation. Additionally, the model's performance may be constrained by the smaller variant of MusicGen used for fine-tuning.
This research has significant implications for the field of generative music, particularly in promoting cultural diversity in AI-generated content. By addressing the underrepresentation of Persian music in generative models, this work opens avenues for further exploration of other non-Western musical traditions. The dataset created can serve as a valuable resource for future research in music generation, potentially influencing the development of more culturally-aware AI systems. This paper presents the first large-scale dataset of Persian music and successfully adapts a state-of-the-art generative model to this culturally rich domain. The comprehensive methodology and promising results underscore the potential for AI to engage with and celebrate diverse musical traditions.
LLM-based automatic speech recognition models demonstrate strong performance by connecting audio encoders and LLMs. However, data scarcity of paired speech and transcription often hinders their adaptation to new domains, making text-only domain adaptation crucial. Existing methods typically rely on either fine-tuning the LLM alone or employing pseudo-audio prompts. The former neglects essential acoustic context, while the latter either suffers from limited scalability in data-scarce conditions, or yields inexpressive prompts by leveraging only textual features, ignoring audio modality. To address this, we propose an enhanced framework that explicitly models speech-text alignment. Our method efficiently generates highly expressive pseudo-audio prompts that bridges the modality gap, enabling effective target-domain adaptation. Experiments demonstrate that our approach outperforms existing text-only methods, improving both overall error rates and out-of-vocabulary coverage.
Primary: Kyoto University
All Institutions: Kyoto University, LY Corporation
The main contribution of this paper is the introduction of the TE2SL framework, which enhances text-only domain adaptation in LLM-based ASR by generating expressive pseudo-audio prompts through a learnable refinement module. This work represents a significant advancement in bridging the modality gap in ASR systems, with promising implications for improving performance in data-scarce environments.
The proposed Text-Embedding-to-Speech-Latent (TE2SL) framework innovatively addresses the challenge of text-only domain adaptation in LLM-based ASR by introducing a learnable refinement module that enhances the quality of pseudo-audio prompts. This method effectively bridges the modality gap by ensuring that the synthesized prompts are both sample-dependent and aligned with the characteristics of the audio encoder and projector. The methodology is well-structured, with a clear distinction between training and adaptation phases, and utilizes a Conformer architecture to achieve this refinement. The focus on architecture-aware synthesis is a significant advancement over previous heuristic approaches.
The experiments conducted are thorough, comparing the TE2SL framework against established baselines, including LLM-only fine-tuning and pseudo-audio prompt methods. The results demonstrate substantial improvements in both recognition accuracy and out-of-vocabulary (OOV) recall across multiple datasets in English and Japanese, validating the effectiveness of the proposed method. The use of diverse datasets strengthens the generalizability of the findings, and the metrics employed (WER and CER) are appropriate for evaluating ASR performance.
The paper provides a detailed description of the experimental setup, including model architectures, training configurations, and evaluation metrics. However, the absence of a publicly available code repository or demo limits the reproducibility of the results. Clearer documentation or a supplementary material section with implementation details could enhance reproducibility.
One limitation is the reliance on the quality of the audio encoder and projector, which may vary across different languages or domains. Additionally, while the method shows promise in improving OOV recall, the paper does not extensively discuss the implications of these improvements in practical applications. The scalability of the TE2SL framework in low-resource settings, where high-quality audio encoders may not be available, also warrants further exploration.
The proposed approach has significant potential applications in various domains where ASR systems are deployed, particularly in low-resource languages or specialized fields with limited paired data. By improving domain adaptation capabilities, this work can enhance accessibility and usability of ASR technologies in diverse linguistic contexts. The findings could also inform future research on multimodal learning and integration of audio-visual data in ASR systems. The main contribution of this paper is the introduction of the TE2SL framework, which enhances text-only domain adaptation in LLM-based ASR by generating expressive pseudo-audio prompts through a learnable refinement module. This work represents a significant advancement in bridging the modality gap in ASR systems, with promising implications for improving performance in data-scarce environments.
Target speech extraction remains difficult for compact devices because monaural neural models lack spatial evidence and classical beamformers lose resolving power when the microphone aperture is only a few centimetres. We present IsoNet, a user-selectable audio-visual target speech extraction system for a compact 4-microphone array. IsoNet combines complex multi-channel STFT features, GCC-PHAT spatial cues, face-conditioned visual embeddings, and auxiliary direction-of-arrival supervision inside a U-Net mask estimation network. Three curriculum variants were trained on 25,000 simulated VoxCeleb mixtures with progressively difficult SNR regimes. On a hard test set spanning -1 to 10 dB SNR, IsoNet-CL1 achieves 9.31 dB SI-SDR, a 4.85 dB improvement over the mixture, with PESQ 2.13 and STOI 0.84. Oracle delay-and-sum and MVDR beamformers degrade the same mixtures by 4.82 dB and 6.08 dB SI-SDRi, respectively, showing that the proposed learned multimodal conditioning solves a regime where conventional spatial filtering is ineffective. Ablation studies show consistent gains from visual conditioning, GCC-PHAT features, and extended delay-bin encoding. The results establish a compact-array, face-selectable speech extraction baseline under controlled simulation and identify the remaining barriers to real deployment, especially phase reconstruction, multi-interferer mixtures, and simulation-to-real transfer.
Primary: Institute of Engineering, Tribhuvan University
All Institutions: Institute of Engineering, Tribhuvan University
IsoNet presents a novel approach to audio-visual target speech extraction, effectively addressing the limitations of compact microphone arrays in challenging acoustic environments. The combination of advanced methodologies and thorough experimental validation positions this work as a meaningful contribution to the field of machine learning and audio processing.
The proposed methodology in IsoNet is robust, combining multi-channel STFT features, GCC-PHAT spatial cues, and face-conditioned visual embeddings within a U-Net architecture. The use of curriculum learning to progressively introduce SNR challenges is a thoughtful approach that enhances model robustness. The architecture is designed to address specific failure modes of compact microphone arrays, making it relevant for practical applications. The integration of auxiliary direction-of-arrival supervision is a notable addition that helps regularize the learning process.
The experiments are well-structured, utilizing a large dataset of 25,000 simulated mixtures from VoxCeleb, which is appropriate for the task. The evaluation metrics (SI-SDR, PESQ, and STOI) provide a comprehensive view of both objective and perceptual quality. The results demonstrate significant improvements over baseline methods, particularly in challenging SNR conditions. The ablation studies effectively isolate the contributions of different components of the model, providing clear insights into the efficacy of visual and spatial conditioning.
The paper provides sufficient detail on the experimental setup, including the architecture, training procedures, and evaluation metrics, which supports reproducibility. However, the lack of publicly available code or datasets limits the ability for independent verification of results.
The study primarily focuses on scenarios with a single interfering speaker, which may not fully capture the complexities of real-world environments with multiple speakers and background noise. Additionally, the reliance on simulated data may introduce discrepancies when transitioning to real-world applications. The phase reconstruction method used could also be improved for better performance in low SNR conditions.
The proposed IsoNet system has significant implications for various applications, including voice assistants, hearing aids, and augmented reality devices, where selective listening is crucial. By enhancing the ability to extract target speech in complex acoustic environments, this research could improve user experiences in everyday communication scenarios. IsoNet presents a novel approach to audio-visual target speech extraction, effectively addressing the limitations of compact microphone arrays in challenging acoustic environments. The combination of advanced methodologies and thorough experimental validation positions this work as a meaningful contribution to the field of machine learning and audio processing.
Current methods for creating drum loop audio in digital music production, such as using one-shot samples or resampling, often demand non-trivial efforts of creators. While recent generative models achieve high fidelity and adhere to text, they lack the specific control needed for such a task. Existing symbolic-to-audio research often focuses on single, tonal instruments, leaving the challenge of polyphonic, percussive drum synthesis unaddressed. We address this gap by introducing ``Break-the-Beat!,'' a model capable of rendering a drum MIDI with the timbre of a reference audio. It is built by fine-tuning a pre-trained text-to-audio model with our proposed content encoder and a effective hybrid conditioning mechanism. To enable this, we construct a new dataset of paired target-reference drum audio from existing drum audio datasets. Experiments demonstrate that our model generates high-quality drum audio that follows high-resolution drum MIDI, achieving strong performance across metrics of audio quality, rhythmic alignment, and beat continuity. This offer producers a new, controllable tool for creative production. Demo page: https://ik4sumii.github.io/break-the-beat/
Primary: Sony Group Corporation
All Institutions: Sony Group Corporation, Sony AI
The main contribution of this paper is the introduction of "Break-the-Beat!", a novel model for controllable MIDI-to-drum audio synthesis that combines advanced conditioning mechanisms with a pre-trained audio generation framework. This work not only fills a crucial gap in the existing literature but also offers practical tools for music producers, enhancing the creative process in digital music production.
The methodology presented in the paper is robust and innovative, leveraging a pre-trained text-to-audio model (SAO) and introducing a dual-input content encoder that effectively combines MIDI and reference audio for drum synthesis. The hybrid conditioning mechanism is a noteworthy contribution, allowing for precise control over both rhythm and timbre. The use of a novel dataset constructed from existing drum audio datasets is a significant step towards addressing the lack of resources in this area. The authors provide a clear overview of their approach, detailing the input representations, conditioning mechanisms, and training strategies, which enhances the clarity and reproducibility of their work.
The experimental evaluation is thorough, utilizing a well-defined dataset and a variety of metrics to assess the performance of the proposed model. The results demonstrate significant improvements in audio quality, rhythmic alignment, and beat continuity, particularly when using higher temporal resolutions for MIDI input. The paper effectively compares its method against various baselines and provides qualitative and quantitative analyses, which strengthen the validity of the findings. However, the paper could benefit from additional user studies or subjective evaluations to further substantiate the claims of improved audio quality.
The paper provides sufficient details regarding the model architecture, training procedures, and evaluation metrics, which facilitates reproducibility. However, the absence of a publicly available code repository limits the ability of other researchers to fully replicate the study. Providing access to the trained models or code would significantly enhance the reproducibility of the results.
One limitation of the study is the reliance on a specific dataset, which may not encompass the full diversity of drum sounds and styles encountered in real-world music production. Additionally, while the model performs well on the evaluated metrics, the subjective quality of generated audio in practical scenarios remains to be fully explored. The paper also does not address potential computational costs associated with training and inference, which could be a barrier for some users.
The proposed model has the potential to significantly impact digital music production by providing a tool that allows for greater control and creativity in drum synthesis. This could democratize music production for non-experts and enhance the workflow of professional producers. Furthermore, the findings could inspire future research in the area of symbolic-to-audio synthesis, particularly for other instrument types and musical styles. The main contribution of this paper is the introduction of "Break-the-Beat!", a novel model for controllable MIDI-to-drum audio synthesis that combines advanced conditioning mechanisms with a pre-trained audio generation framework. This work not only fills a crucial gap in the existing literature but also offers practical tools for music producers, enhancing the creative process in digital music production.