Modern audio generation predominantly relies on latent-space compression, introducing additional complexity and potential information loss. In this work, we challenge this paradigm with WavFlow, a framework that generates high-fidelity audio directly in raw waveform space without intermediate representations. To overcome the inherent difficulties of modeling high-dimensional and low-energy signals, we reshape audio into 2D token grids through waveform patchify and introduce amplitude lifting to align signal scales, enabling stable optimization via direct x-prediction in flow matching. To capture complex semantic alignment and temporal synchronization, we leverage an automated data pipeline to curate 5 million high-quality video-text-audio triplets, allowing the model to learn fine-grained acoustic patterns from scratch. Experimental results show that WavFlow achieves competitive performance on the video-to-audio benchmark VGGSound (FD_PaSST: 59.98, IS_PANNs: 17.40, DeSync: 0.44) and the text-to-audio benchmark AudioCaps (FD_PANNs: 10.63, IS_PANNs: 12.62), matching or exceeding the performance of established latent-based methods. Our work demonstrates that intermediate compression is not a prerequisite for high-quality synthesis, offering a simpler and more scalable alternative for multimodal audio generation.
Primary: Meta AI
All Institutions: Meta AI, Northeastern University
WavFlow presents a novel approach to audio generation by directly synthesizing waveforms without latent-space compression. This work significantly advances the field of multimodal audio generation, offering a simpler and more scalable alternative that achieves competitive performance on established benchmarks.
The methodology presented in WavFlow is innovative as it directly generates audio in the waveform space without relying on latent-space compression, which is a significant departure from traditional approaches. The authors introduce waveform patchification and amplitude lifting techniques to manage the high-dimensional nature of raw audio, which enhances the model's ability to learn complex acoustic patterns. The use of conditional flow matching and the x-prediction strategy for stable training are noteworthy advancements that contribute to the robustness of the model. The architecture is designed to facilitate multimodal learning, effectively integrating video and text inputs to enhance audio generation quality.
The experimental evaluation is thorough, utilizing large-scale datasets and benchmarking against state-of-the-art models in both video-to-audio (VT2A) and text-to-audio (T2A) tasks. The results demonstrate that WavFlow achieves competitive performance, often surpassing existing latent-based methods in various metrics such as Fréchet Distance and Inception Score. The extensive ablation studies provide insights into the impact of different architectural choices and training configurations, reinforcing the validity of the proposed methods.
The paper provides detailed training configurations and experimental setups, which enhance reproducibility. However, the absence of a publicly available demo or project URL limits the ease with which other researchers can replicate the findings. The authors do mention using proprietary datasets, which may also pose challenges for full reproducibility in terms of data access.
The paper acknowledges limitations, particularly in the context of speech or singing synthesis, which are not explicitly addressed by the current model. The authors suggest that extending the framework to include these aspects would require larger datasets and more granular linguistic annotations. Additionally, the reliance on high-quality data curation may limit the model's applicability in scenarios with less controlled data environments.
The implications of this work are significant for the field of audio generation, particularly in applications requiring high-fidelity synthesis and multimodal integration. By demonstrating that high-quality audio can be generated without intermediate latent representations, this research opens avenues for more efficient audio generation frameworks. The potential applications span various domains, including film production, video game sound design, and interactive media, where real-time audio generation aligned with visual content is crucial. WavFlow presents a novel approach to audio generation by directly synthesizing waveforms without latent-space compression. This work significantly advances the field of multimodal audio generation, offering a simpler and more scalable alternative that achieves competitive performance on established benchmarks.
Decoding speech from non-invasive brain signals is challenging. For the LibriBrain 2025 Speech Detection task, we propose a novel two-step framework that bypasses direct reconstruction. First, a contrastive learning model retrieves the matching speech segment for the given test MEG from a large-scale audio library (LibriVox). Second, a speech detection model generates the binary silence/speech sequence directly from this retrieved audio. With this approach, our team Sherlock Holmes achieved first place in the extended track (F1-score: 0.962), demonstrating that leveraging external audio databases is a highly effective strategy.
Primary: Peking University
All Institutions: College of Future Technology, Academy for Advanced Interdisciplinary Studies, Center for BioMed-X Research, Institute of Molecular Medicine, National Biomedical Imaging Center, Peking-Tsinghua Center for Life Sciences, School of Intelligence Science and Technology, Speech and Hearing Research Center, State Key Laboratory of General Artificial Intelligence, State Key Laboratory of Membrane Biology
The paper presents a novel two-step framework for speech detection from MEG signals, achieving state-of-the-art results by leveraging large-scale audio retrieval. This work demonstrates a significant advancement in the field of non-invasive BCIs and opens new avenues for research in audio processing and brain signal interpretation.
The proposed two-step framework is innovative in its approach to bypass direct reconstruction of speech from MEG signals by leveraging a large-scale audio library for retrieval. The use of contrastive learning for matching MEG segments with audio segments is a novel application in this context, highlighting the potential of match-mismatch tasks over traditional regression methods. The methodology is well-structured, with clear steps outlined for both the retrieval and detection phases, although the paper could benefit from more detailed explanations of the model architectures and hyperparameter choices.
The experiments are robust, with a clear description of data preparation, model training, and testing procedures. The authors achieved an impressive F1-score of 0.962, which is a significant contribution to the field, particularly given the challenges associated with decoding speech from noisy brain signals. However, the paper lacks a comparative analysis with other existing methods, which would strengthen the claims of superiority.
While the paper provides a good overview of the methods and results, it lacks detailed implementation specifics such as code availability, which is crucial for reproducibility. The absence of a public repository or demo limits the ability of other researchers to replicate the results.
One limitation is the reliance on a specific audio library (LibriVox), which may not generalize well to other datasets or real-world applications. Additionally, the method's performance on diverse speech types or accents is not addressed, which could affect its applicability. The paper also does not discuss the computational resources required for the proposed approach, which may limit accessibility for some researchers.
This research has the potential to significantly advance non-invasive brain-computer interfaces (BCIs) and improve communication methods for individuals with speech impairments. The innovative use of audio retrieval could inspire further exploration in related fields, such as cognitive neuroscience and assistive technologies. The paper presents a novel two-step framework for speech detection from MEG signals, achieving state-of-the-art results by leveraging large-scale audio retrieval. This work demonstrates a significant advancement in the field of non-invasive BCIs and opens new avenues for research in audio processing and brain signal interpretation.
As audio-first agents become increasingly common in physical AI, conversational robots, and screenless wearables, audio large language models (audio-LLMs) must integrate speaker-specific understanding to support user authorization, personalization, and context-aware interaction. This requires modeling who is speaking, how the voice sounds, and how recording conditions affect speaker cues. Conventional speaker verification systems provide strong scalar scores but little linguistic evidence, while current audio-LLMs and speaker-aware language models have limited ability to organize speaker information beyond binary labels or descriptive profiles. We present SpeakerLLM, a speaker-specialized audio-LLM framework that unifies single-utterance speaker profiling, recording-condition understanding, utterance-pair speaker comparison, and evidence-organized verification reasoning within a natural-language interface. We construct verification-reasoning targets and a decision-composition policy that separate profile-level evidence from the final same-or-different decision and organize recording condition, profile evidence, and the decision into a structured trace. At its core, SpeakerLLM uses a hierarchical speaker tokenizer designed to capture multiple granularities of speaker evidence. Utterance-level speaker embeddings summarize identity and profile-level cues, whereas frame-level speaker features preserve fine-grained acoustic descriptors. Experiments show that SpeakerLLM-Base improves speaker-profile and recording-condition understanding over general audio-LLMs, while SpeakerLLM-VR preserves strong generated-verdict accuracy and produces decision traces grounded in the supervised verification reasoning schema. We will release the metadata-enriched supervision dataset and target-construction code for reproducibility.
Primary: Korea Advanced Institute of Science and Technology (KAIST)
All Institutions: Korea Advanced Institute of Science and Technology (KAIST), University of Seoul
The main contribution of this paper is the introduction of SpeakerLLM, a speaker-specialized audio-LLM framework that effectively integrates speaker understanding and verification reasoning within a natural-language interface. This work significantly advances the field of audio processing by enhancing the explainability and accuracy of speaker verification systems, making it a valuable addition to the literature.
The paper presents a well-structured methodology with a clear two-stage training process for SpeakerLLM, which effectively integrates speaker profiling, recording condition understanding, and verification reasoning. The hierarchical speaker tokenizer is a novel approach that captures different granularities of speaker evidence, enhancing the model's ability to process and understand speaker-specific cues. The decision-composition policy that separates profile-level evidence from the final decision is a significant advancement in explainability for speaker verification systems.
The experiments are comprehensive, demonstrating the effectiveness of SpeakerLLM-Base and SpeakerLLM-VR through various tasks, including speaker profiling and verification reasoning. The results show substantial improvements over general audio-LLMs, especially in tasks requiring fine-grained acoustic evidence. The use of a controlled dataset and clear evaluation metrics strengthens the findings.
The authors commit to releasing the metadata-enriched supervision dataset and target-construction code, which is crucial for reproducibility. However, the paper could benefit from additional details on the implementation of the models and the specific configurations used during training.
The paper acknowledges limitations, including the need for further evaluation of the model in real-world noisy environments and the necessity of consent-aware interfaces for user privacy. The reliance on specific datasets may limit the generalizability of the findings.
The proposed framework has significant implications for the development of audio-first AI systems, particularly in enhancing user interaction through personalized and context-aware speaker verification. The ability to provide explainable decisions in speaker verification can improve trust and usability in applications like conversational agents and security systems. The main contribution of this paper is the introduction of SpeakerLLM, a speaker-specialized audio-LLM framework that effectively integrates speaker understanding and verification reasoning within a natural-language interface. This work significantly advances the field of audio processing by enhancing the explainability and accuracy of speaker verification systems, making it a valuable addition to the literature.
Contextual biasing is essential to improving the recognition of rare and domain-specific words in an automatic speech recognition (ASR) system. While numerous methods have been proposed in recent years, most of them focus on offline settings and do not explicitly address the challenges of streaming ASR. For example, CTC-based word spotting (CTC-WS) have demonstrated strong performance by directly detecting keywords from CTC log-probabilities, but they are limited to offline processing and require access to the full utterance. In This work, we present a streaming extension of CTC-WS for real-time contextual biasing. Our method maintains active keyword paths across audio chunks using a stateful token passing algorithm, enabling the detection of keywords that span multiple chunks. To ensure low latency and stable output, we introduce an incremental commitment mechanism that only emits segments guaranteed not to be affected by future audio, while deferring uncertain regions. This method naturally integrates with streaming ASR pipelines and does not require modifications to the underlying acoustic model or additional training, making it practical for real-world deployment. Experimental results show that our method reduces overall WER and effectively improves keyword F-score, demonstrating its effectiveness for real-time ASR applications.
Primary: National Taiwan Normal University
All Institutions: National Taiwan Normal University
This paper presents a novel approach to contextual biasing in streaming ASR, effectively addressing the challenges of recognizing rare and domain-specific words in real-time applications. The methodology is innovative, and the results indicate a meaningful contribution to the field of automatic speech recognition.
The proposed methodology extends CTC-based word spotting to a streaming ASR context, which is a significant advancement given the limitations of existing methods that primarily focus on offline processing. The introduction of a stateful token passing algorithm and an incremental commitment mechanism allows for the detection of keywords that may span across audio chunks, addressing a critical challenge in streaming ASR. The method's design ensures that it integrates seamlessly with existing ASR pipelines without requiring retraining or architectural changes, enhancing its practical applicability.
The experimental results demonstrate a clear improvement in both word error rate (WER) and keyword F-score across two datasets specifically designed for named entities. The comparisons with existing methods, such as GPU-accelerated phrase boosting, further validate the effectiveness of the proposed approach. The experiments are well-structured, utilizing appropriate datasets and metrics to assess the performance of the proposed method.
The paper provides sufficient details regarding the datasets used and the experimental setup, including the model architecture and evaluation metrics. However, the lack of publicly available code or a demo limits the reproducibility of the results. Future work could benefit from sharing the implementation to facilitate further research and validation.
One limitation is the reliance on specific datasets (STOP1 and STOP2) that may not generalize across all ASR applications. Additionally, while the method shows improvements in performance, the computational overhead introduced by the word spotting mechanism may still pose challenges in highly resource-constrained environments.
The advancements presented in this paper have significant implications for real-time applications such as live captioning, voice assistants, and interactive systems where accurate recognition of domain-specific terms is crucial. By improving the recognition of rare and context-specific words, this work could enhance user experience and accessibility in various speech-driven technologies. This paper presents a novel approach to contextual biasing in streaming ASR, effectively addressing the challenges of recognizing rare and domain-specific words in real-time applications. The methodology is innovative, and the results indicate a meaningful contribution to the field of automatic speech recognition.
ADD in real-world scenarios has evolved from speech-only spoofing to more challenging component-level settings, where speech and environmental sounds may be independently manipulated. To tackle this, we propose EnvTriCascade, an Environment-Aware Tri-Stage Cascaded framework for the ESDD2 Challenge. First, a mix-consistency detector provides a binary prior to distinguish original recordings from manipulated mixtures, which calibrates the final decisions. Next, two complementary five-class detectors, leveraging SSLAM+XLS-R and EAT-large+XLS-R representations, extract robust multi-branch features integrated via a cross-branch attention-gated classifier. To enhance robustness against diverse mixing conditions, we incorporate RawBoost augmentation. Trained exclusively on the official CompSpoofV2 dataset, our system achieves a Macro-F1 score of 0.8266 on the test set, significantly outperforming the official baseline and ranking second in the challenge.
Primary: Communication University of China
All Institutions: Communication University of China, Ant Group
The paper introduces EnvTriCascade, a tri-stage cascaded framework for audio deepfake detection, achieving a Macro-F1 score of 0.8266 and ranking second in the ESDD2 Challenge. The methodology demonstrates a sophisticated approach to addressing the complexities of mixed audio environments, combining innovative feature extraction and classification strategies that could significantly advance the field of audio deepfake detection.
The paper presents a novel tri-stage cascaded framework (EnvTriCascade) for audio deepfake detection that effectively addresses the challenges of component-level spoofing in mixed audio environments. The methodology is well-structured, incorporating a mix-consistency detector for binary classification, followed by dual-branch multi-class detectors that leverage self-supervised learning representations. The use of a cross-branch attention-gated classifier and RawBoost augmentation enhances the robustness of the system against diverse acoustic conditions. The approach of fusing multiple feature representations and employing a calibration mechanism to mitigate decision conflicts is innovative and demonstrates a solid understanding of the complexities involved in audio deepfake detection.
The experiments are conducted on the CompSpoofV2 dataset, which is substantial in size and complexity, providing a strong basis for evaluating the proposed methodology. The reported Macro-F1 score of 0.8266, which significantly outperforms the baseline, indicates the effectiveness of the proposed framework. The paper includes detailed comparisons of various system configurations, showcasing the contributions of each component to the overall performance. However, the absence of external validation datasets or comparisons with other state-of-the-art methods limits the breadth of the evaluation.
The implementation details are well-documented, including the architecture, training process, and hyperparameters. The use of frozen pre-trained models and specific augmentation strategies is clearly described, which aids in reproducibility. However, the lack of publicly available code or a demo URL limits the ability for others to replicate the results independently.
One limitation of the study is the reliance on a single dataset for training and evaluation, which may affect the generalizability of the results. Additionally, while the proposed system achieves high performance, the complexity of the model with a large number of parameters could pose challenges in real-world applications regarding computational efficiency and deployment.
The proposed framework has significant implications for the field of audio deepfake detection, particularly in enhancing the reliability of audio authenticity verification in various applications such as media, security, and communications. The advancements in component-level spoofing detection could lead to improved systems for combating misinformation and ensuring the integrity of audio content. The paper introduces EnvTriCascade, a tri-stage cascaded framework for audio deepfake detection, achieving a Macro-F1 score of 0.8266 and ranking second in the ESDD2 Challenge. The methodology demonstrates a sophisticated approach to addressing the complexities of mixed audio environments, combining innovative feature extraction and classification strategies that could significantly advance the field of audio deepfake detection.
Recently, a spatially selective non-linear filter (SSF) has been proposed for target speaker extraction, using the target direction-of-arrival (DOA) as a spatial cue. Since learned intermediate features are tied to the microphone geometry, the performance of the SSF degrades significantly when evaluated on mismatched array geometries. In this paper, we propose a geometry-conditioned SSF (GC-SSF), which incorporates a geometry-conditioning branch based on FiLM layers. Furthermore, we propose a feature that jointly encodes the DOA and the microphone positions (DOA-MPE). The conditioning branch modulates the intermediate feature maps of the SSF using the DOA-MPE feature to capture the spatial relationship between the microphone positions and the target speaker. Experimental results across circular, uniform linear, and random microphone arrays show that the proposed GC-SSF generalizes better to mismatched geometries while maintaining high spatial selectivity, demonstrating its ability to effectively adapt the filtering process to different array geometries
Primary: Carl von Ossietzky Universität Oldenburg
All Institutions: Carl von Ossietzky Universität Oldenburg, German Research Foundation
The main contribution of this paper is the introduction of a geometry-conditioned spatially selective filter (GC-SSF) that enhances target speaker extraction across varying microphone geometries, significantly improving generalization and robustness. This work represents a meaningful step forward in the field of audio processing, addressing a critical challenge with innovative methods and thorough experimental validation.
The proposed methodology introduces a geometry-conditioned spatially selective non-linear filter (GC-SSF) that effectively incorporates a geometry-conditioning branch using FiLM layers and a novel DOA-Microphone Positional Encoding (DOA-MPE) feature. This approach addresses the limitations of existing spatially selective filters by allowing the model to generalize across different microphone geometries, which is a significant advancement in the field of target speaker extraction. The integration of positional encoding and conditioning mechanisms is well-justified and demonstrates a thoughtful approach to enhancing the robustness of the extraction process.
The experimental setup is comprehensive, utilizing various microphone array configurations (circular, uniform linear, and random) to evaluate the performance of the GC-SSF. The results indicate that the proposed method consistently outperforms baseline systems in terms of generalization across mismatched geometries, with clear metrics provided (PESQ and SI-SDR) to support the claims. The sensitivity analysis regarding target DOA errors further strengthens the findings, showcasing the model's robustness and spatial selectivity.
The paper provides sufficient details regarding the experimental setup, including datasets, network architecture, and training procedures, which would allow for reproducibility of the results. However, the absence of a public code repository or demo URL limits the ease of access for other researchers to validate and build upon this work.
One identified limitation is that the current architecture is designed for a fixed number of microphones, which may restrict its applicability in more dynamic or ad-hoc acoustic environments. Additionally, while the results are promising, the paper does not explore the potential impact of varying environmental conditions beyond the simulated scenarios.
The advancements presented in this paper have significant implications for real-world applications in acoustic signal processing, particularly in environments where microphone configurations may vary. The ability to generalize across different geometries could enhance the performance of speaker extraction systems in various settings, such as conference rooms, public spaces, and smart devices. The main contribution of this paper is the introduction of a geometry-conditioned spatially selective filter (GC-SSF) that enhances target speaker extraction across varying microphone geometries, significantly improving generalization and robustness. This work represents a meaningful step forward in the field of audio processing, addressing a critical challenge with innovative methods and thorough experimental validation.
The conventional normalized subband p-norm (NSPN) algorithm achieves robustness in $α$-stable noise ($1<α\leq 2$) by utilizing low-order error moments. However, its performance degrades significantly under three scenarios: (1) non-Gaussian inputs, (2) $α$-stable noise with $0<α\leq 1$, and (3) sparse system identification. To address these limitations, this paper proposes a fractional-order NSPN algorithm based on the nearest Kronecker product (NKP) decomposition and fractional-order stochastic gradient descent, termed NKP-FoNSPN. Theoretical bounds for the fractional-order parameter $β$ are also derived. Notably, when $β=1$, the NKP-FoNSPN reduces to a new NKP-NSPN algorithm, while its non-NKP decomposition variant becomes the fractional-order NSPN (FoNSPN) algorithm. Furthermore, a novel transformation-based NKP (TNKP) decomposition technique is designed, which exhibits lower computational complexity than conventional NKP for specific filter structures. The resulting TNKP-based FoNSPN (TNKP-FoNSPN) achieves lower steady-state misadjustment and multiplication cost compared with the NKP-FoNSPN algorithm. Additionally, complete computational complexity analyses are provided. For active noise control (ANC) scenarios, we develop filtered-x variants: NKP-FxFoNSPN and TNKP-FxFoNSPN. From the former, two additional variants are derived: NKP-FxNSPN and FxFoNSPN. Simulations using diverse noise sources (pink, helicopter, gunshot, pile driver, and traction substation noise) demonstrate the superiority of the proposed algorithms. Finally, we validate their noise reduction performance in a real constructed single-channel duct ANC and a simulated multi-channel ANC systems.
Primary: Southwest Jiaotong University
All Institutions: Southwest Jiaotong University, Ministry of Education, School of Electrical Engineering, Key Laboratory of Magnetic Suspension Technology and Maglev Vehicle
The main contribution of this paper is the development of a fractional-order subband p-norm adaptive filter that effectively addresses the limitations of existing algorithms in active noise control scenarios. This work significantly advances the state-of-the-art in adaptive filtering by introducing innovative methodologies and demonstrating their effectiveness through rigorous experimentation.
The paper introduces a novel fractional-order normalized subband p-norm adaptive filter (NKP-FoNSPN) that leverages the nearest Kronecker product decomposition and fractional-order stochastic gradient descent. The methodology is well-structured, addressing specific limitations of existing algorithms in handling non-Gaussian inputs and sparse system identification. The introduction of the transformation-based nearest Kronecker product decomposition (TNKP) technique is particularly noteworthy, as it reduces computational complexity while enhancing performance. The theoretical bounds for the fractional-order parameter are derived, which adds rigor to the proposed approach. The paper effectively combines theoretical insights with practical algorithm development, making it a significant contribution to adaptive filtering in noise control scenarios.
The experimental evaluation is comprehensive, utilizing various noise sources (pink, helicopter, gunshot, pile driver, and traction substation noise) to validate the proposed algorithms. The simulations demonstrate the superiority of the NKP-FoNSPN and TNKP-FoNSPN algorithms over existing methods, particularly in challenging noise environments. The performance metrics used, such as normalized mean-square deviation (NMSD), are appropriate for the context, and the results are well-presented, showing clear advantages in convergence rates and steady-state misadjustment.
The paper lacks specific implementation details that would facilitate reproducibility, such as code availability or links to datasets used in the experiments. While the methodology is described in detail, the absence of a demo or project URL limits the ability of other researchers to replicate the findings directly.
One limitation is the lack of real-world applicability testing beyond the constructed single-channel duct ANC and simulated multi-channel ANC systems. The performance in diverse and uncontrolled environments remains to be validated. Additionally, while the paper addresses several scenarios, it may not cover all potential edge cases in adaptive filtering, particularly with more complex noise profiles.
The proposed algorithms have significant implications for active noise control applications, particularly in environments where non-Gaussian noise is prevalent. The advancements in adaptive filtering techniques can enhance various fields, including audio processing, telecommunications, and environmental noise management. The integration of fractional-order calculus into adaptive filtering may inspire further research into novel approaches for handling complex signal processing challenges. The main contribution of this paper is the development of a fractional-order subband p-norm adaptive filter that effectively addresses the limitations of existing algorithms in active noise control scenarios. This work significantly advances the state-of-the-art in adaptive filtering by introducing innovative methodologies and demonstrating their effectiveness through rigorous experimentation.
Detecting AI-generated music is crucial for preserving artistic authenticity and preventing the misuse of generative music technologies. However, existing discriminative detectors typically rely on generated samples during training and often suffer from severe performance degradation when confronted with music produced by unseen generators, which limits their real-world applicability. To address this issue, we formulate a zero-shot setting for AI-generated music detection, where the detector is trained exclusively on real music without access to any generated samples. Under this setting, we propose MusicDET, a generator-agnostic detection framework based on frequency-guided normalizing flows that probabilistically models the distribution of real music features. By evaluating the likelihood of an input sample under the learned real-music distribution, MusicDET enables effective detection of out-of-distribution music signals. Experiments on the FakeMusicCaps and SONICS datasets show that MusicDET consistently outperforms conventional discriminative detectors, particularly when detecting music generated by previously unseen models.
Primary: Southeast University
All Institutions: Southeast University, Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Purple Mountain Laboratories, Engineering Research Center of Blockchain Application
The main contribution of this paper is the introduction of MusicDET, a zero-shot AI-generated music detection framework that utilizes frequency-guided normalizing flows to model real music distributions, achieving state-of-the-art performance in cross-generator evaluations. This work represents a significant advancement in the field of audio detection, addressing a critical need for reliable methods to distinguish between human-created and AI-generated music.
The methodology presented in MusicDET is innovative, leveraging frequency-guided normalizing flows to model the distribution of real music features for zero-shot detection of AI-generated music. This approach is particularly noteworthy as it circumvents the need for training on generated samples, which is a significant limitation in existing methods. The use of a probabilistic framework allows for effective detection of out-of-distribution samples, and the detailed design of frequency-wise decomposition and band-wise normalizing flows demonstrates a deep understanding of the complexities of musical data.
The experimental evaluation is thorough, utilizing two benchmark datasets (FakeMusicCaps and SONICS) to validate the effectiveness of MusicDET. The results indicate that it consistently outperforms conventional discriminative detectors, particularly in cross-generator scenarios. The use of Equal Error Rate (EER) as a metric is appropriate for the task, and the paper provides a comprehensive analysis of the results, including comparisons with state-of-the-art methods. The experiments also include ablation studies that enhance the understanding of the model's performance and robustness.
The paper provides sufficient implementation details, including the architecture of the model, training procedures, and evaluation metrics. The authors mention the use of a GitHub repository for code access, which supports reproducibility. However, the reliance on specific hardware (NVIDIA RTX 4090) may limit accessibility for some researchers.
One limitation is the potential for MusicDET to struggle with robustness against audio manipulations, as indicated in the experiments. Additionally, while the zero-shot approach is a significant advancement, it may not cover all practical scenarios, especially as generative models evolve. The paper also acknowledges the need for further research into robustness against adversarial attacks and real-world post-processing.
The work has significant implications for the music industry, particularly in protecting artistic integrity and addressing copyright issues associated with AI-generated content. By providing a reliable detection method, MusicDET could help mitigate the risks of misuse of generative music technologies, fostering a more equitable music ecosystem. The research also opens avenues for future work in audio authenticity and anomaly detection across various domains. The main contribution of this paper is the introduction of MusicDET, a zero-shot AI-generated music detection framework that utilizes frequency-guided normalizing flows to model real music distributions, achieving state-of-the-art performance in cross-generator evaluations. This work represents a significant advancement in the field of audio detection, addressing a critical need for reliable methods to distinguish between human-created and AI-generated music.
The rapid advancement of generative AI has made audio deepfakes increasingly indistinguishable from authentic human vocals, posing significant threats to persons-of-interest (POI) such as public figures. Current detection systems primarily rely on generic, black-box models that fail to capture speaker-specific idiosyncratic traits and lack interpretability. In this paper, we propose Phoneme-based Voice Profiling (PVP), a novel personalized defense framework. By shifting the detection paradigm from macro-utterance analysis to micro-phonetic modeling, PVP captures the unique acoustic distributions underlying a POI's habitual articulatory patterns. Specifically, our framework models speaker-specific phonetic realizations using lightweight Gaussian Mixture Models (GMMs) estimated solely from bona fide reference speech. This design enables data-efficient profiling and robust generalization to previously unseen spoofing attacks without requiring heavy spoof-specific training. Furthermore, we introduce the first large-scale Chinese POI deepfake dataset to benchmark speaker-specific detection. Experimental results demonstrate that PVP significantly outperforms state-of-the-art generic detectors in POI spoofing scenarios, achieving substantial EER reductions while providing fine-grained, phoneme-level interpretability for forensic analysis. Code and data are available at: https://github.com/JunXue-tech/PVP
Primary: Wuhan University
All Institutions: Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University
The paper presents a significant advancement in speaker-specific deepfake detection through the innovative use of phoneme-level profiling, offering a robust and interpretable framework that outperforms existing methods.
The proposed Phoneme-based Voice Profiling (PVP) framework introduces a novel approach to deepfake detection by focusing on phoneme-level analysis rather than macro-utterance assessments. This shift allows for capturing speaker-specific articulatory patterns through lightweight Gaussian Mixture Models (GMMs), enhancing interpretability and robustness against unseen spoofing attacks. The methodology is well-structured, combining phoneme-level consistency scoring with global speaker identity modeling, which is a significant advancement over traditional black-box models.
The experimental evaluation is robust, utilizing both a newly created Chinese POI deepfake dataset and the Famous Figures dataset to benchmark the proposed method. The results demonstrate substantial improvements in detection performance, with significant reductions in Equal Error Rate (EER) compared to state-of-the-art methods. The ablation studies further validate the importance of each methodological component, reinforcing the effectiveness of phoneme-level profiling.
The paper provides sufficient implementation details, including model configurations and evaluation metrics, which enhances reproducibility. The availability of the code and dataset on GitHub supports further research and validation of the findings.
While the framework shows promising results, it may still be limited by the diversity of the training data, particularly in capturing all possible phonetic variations across different speakers and languages. Additionally, the reliance on a small amount of reference speech data may not generalize well to all scenarios.
The implications of this research are significant, particularly in the realms of security and forensic analysis, where accurate detection of deepfake audio can prevent misinformation and protect public figures. The interpretability of the model also opens avenues for its application in legal contexts, where understanding the rationale behind detection decisions is crucial. The paper presents a significant advancement in speaker-specific deepfake detection through the innovative use of phoneme-level profiling, offering a robust and interpretable framework that outperforms existing methods.
Latent representations are at the heart of the majority of modern generative models. In the audio domain they are typically produced by a neural-audio-codec autoencoder. In this work we introduce SAME (Semantically-Aligned Music autoEncoder), an autoencoder for stereo music and general audio that reaches a 4096$\times$ temporal compression ratio while maintaining reconstruction quality and downstream generative performance. We achieve this by combining a tranformer-based backbone with set of semantic regularisation approaches, phase-aware reconstruction losses and improved discriminator designs. The architecture delivers substantial computational cost benefits, through both its high compression ratio and its reliance on well-optimised transformer primitives. Two variants (a large SAME-L and a CPU-deployable SAME-S) are released in open-weights form.
Primary: Stability AI
All Institutions: Stability AI
The paper introduces SAME, a stereo music and general-audio autoencoder that achieves a remarkable 4096Ă— temporal compression ratio while maintaining sound quality and generative performance. The comprehensive analysis of the technical contributions, innovative methodology, and significant implications for the field highlight the paper's relevance and potential impact in advancing audio generative models.
The paper presents a novel architecture for audio autoencoding, termed SAME (Semantically-Aligned Music autoEncoder), which integrates a transformer-based backbone with semantic regularization, phase-aware reconstruction losses, and improved discriminator designs. The methodology is well-structured, employing a combination of innovative techniques such as query-based transformer resampling and a soft-normalization bottleneck, which collectively enhance the model's generative capabilities while achieving a high compression ratio. The use of auxiliary losses to shape the latent space for downstream tasks is particularly noteworthy, as it demonstrates a thoughtful approach to improving generative performance without relying on traditional VAE formulations.
The evaluation is thorough, employing both objective metrics (such as SI-SDR and MEL log-magnitude) and subjective assessments via MUSHRA tests. The results indicate that SAME-L outperforms several baselines in terms of audio quality and computational efficiency, which is a significant achievement given the high compression ratio. The inclusion of ablation studies further strengthens the findings by isolating the contributions of various components of the model.
The paper provides sufficient detail regarding the architecture, training procedures, and evaluation metrics, which should facilitate reproducibility. However, the lack of a publicly available code repository or demo limits the ability for independent verification of results. The authors mention releasing model weights, which is a positive step but does not fully address the reproducibility of the entire system.
One limitation is the reliance on a specific dataset (Audiosparx production music) for training, which may affect the generalizability of the model to other audio domains. Additionally, while the model achieves impressive results, the computational demands of the larger variant (SAME-L) may limit its accessibility for broader applications, particularly in resource-constrained environments.
The advancements presented in this paper have the potential to significantly impact the field of audio processing and generative models. By achieving high compression ratios while maintaining audio quality, the model could facilitate more efficient audio streaming and storage solutions. Furthermore, the integration of semantic alignment in audio generation opens avenues for more contextually aware audio applications, such as music generation that aligns with specific themes or emotions. The paper introduces SAME, a stereo music and general-audio autoencoder that achieves a remarkable 4096Ă— temporal compression ratio while maintaining sound quality and generative performance. The comprehensive analysis of the technical contributions, innovative methodology, and significant implications for the field highlight the paper's relevance and potential impact in advancing audio generative models.
The sonata form is a musically rich and hierarchically structured form that poses significant challenges for automatic analysis. While music structure analysis has seen strides of progress in recent years, sonata form analysis remains in its early stages. This is largely due to the time-consuming and high barrier of the music background requirement for annotating classical music structures. To advance research in this area, we curated SoSA-Moz, the first large-scale dataset featuring comprehensive hierarchical structure annotations. This work establishes a foundation for systematic sonata form analysis. Leveraging this newly contributed resource, we further propose Sonalyzer-Moz, a baseline model specifically designed for investigating complex sonata structures. This framework integrates feature aggregation with sequential modeling, enabling it to capture both local feature and upper-level structural dependencies. Experiment results show that Sonalyzer-Moz is capable of identifying the components' boundaries of the upper-level structure that are critical to understanding sonata form. Therefore, this method demonstrates, for the first time, the effectiveness of automatic upper-level analysis of sonata form, and provides a robust baseline for future research in the automatic understanding of sonata form while advancing the study of classical music structure analysis.
Primary: Monash University Malaysia
All Institutions: Monash University Malaysia, La Trobe University, Monash University
The main contribution of this paper is the introduction of the SoSA-Moz dataset and the Sonalyzer-Moz framework, which together provide a novel approach to analyzing the complex hierarchical structure of Mozart's sonata form using deep learning techniques. This work not only fills a gap in the literature but also sets a foundation for future research in automatic music structure analysis.
The methodology presented in the paper is well-structured, introducing the SoSA-Moz dataset as a foundational resource for sonata form analysis. The Sonalyzer-Moz framework employs a combination of feature aggregation and sequential modeling through CNN and LSTM layers, which is appropriate for capturing the hierarchical nature of sonata form. The integration of dynamic self-similarity matrices and statistical features enhances the model's ability to identify structural boundaries. However, the paper could benefit from a more detailed explanation of the hyperparameter tuning process and the rationale behind the chosen configurations.
The experimental evaluation is robust, with a clear division of the dataset into training, validation, and test sets to prevent data leakage. The paper provides a comprehensive comparison against state-of-the-art methods for popular music, which demonstrates the effectiveness of Sonalyzer-Moz. The reported performance metrics (HR3R, HR3P, HR3F) are relevant and provide insight into the model's capabilities. However, the paper lacks a detailed discussion on the significance of the performance metrics and how they relate to the specific challenges of sonata form analysis.
The implementation details are adequately described, including the use of specific hardware and software configurations. The availability of the dataset and code as open-source contributes positively to reproducibility. However, the paper could enhance reproducibility by providing more explicit instructions for setting up the environment and running the experiments.
One notable limitation is the reliance on a single composer (Mozart) for the dataset, which may limit the generalizability of the findings to other composers or styles of classical music. Additionally, the model's performance, while competitive, still leaves room for improvement, particularly in capturing the nuances of the sonata form. The authors acknowledge that the current model may not fully exploit the potential of deep learning architectures for this specific domain.
The work has the potential to significantly advance the field of music structure analysis, particularly in classical music. By providing a large-scale dataset and a baseline model, it opens avenues for further research and development of more sophisticated models that could enhance music education, music generation, and music recommendation systems. The focus on sonata form analysis may also inspire interdisciplinary collaborations between musicology and machine learning. The main contribution of this paper is the introduction of the SoSA-Moz dataset and the Sonalyzer-Moz framework, which together provide a novel approach to analyzing the complex hierarchical structure of Mozart's sonata form using deep learning techniques. This work not only fills a gap in the literature but also sets a foundation for future research in automatic music structure analysis.
Stable Audio 3 is a family of fast latent diffusion models (small, medium, large) for variable-length audio generation and editing. Since our models can generate several minutes of audio, variable-length generations are key to avoid the cost of producing full-length generations for short sounds. We also support inpainting, enabling targeted audio editing and the continuation of short recordings. Our latent diffusion models operate on top of a novel semantic-acoustic autoencoder that projects audio into a compact latent space, enabling efficient diffusion-based generation while preserving audio fidelity and encouraging semantic structure in the latent. Finally, we run adversarial post-training to both accelerate inference and improve generation quality, reducing the number of inference steps while improving fidelity and prompt adherence. Stable Audio 3 models are trained on licensed and Creative Commons data to generate music and sounds in less than a 2s on an H200 GPU and less than a few seconds on a MacBook Pro M4. We release the weights of small and medium, that can run on consumer-grade hardware, together with their training and inference pipeline.
Primary: Stability AI
All Institutions: Stability AI
Stable Audio 3 represents a significant advancement in audio generation technology, combining innovative methodologies with practical applications for both consumers and professionals. The paper's contributions to variable-length audio generation, inpainting capabilities, and efficient model training are poised to impact the field of machine learning and audio synthesis significantly.
The methodology presented in Stable Audio 3 is robust and innovative, particularly in its approach to variable-length audio generation using latent diffusion models. The introduction of a semantic-acoustic autoencoder, which allows for efficient audio representation while preserving fidelity and semantic structure, is a significant advancement in audio generation. The use of adversarial post-training to enhance inference speed and output quality is also noteworthy, as it addresses a critical challenge in generative models. The paper effectively combines multiple techniques, including flow matching, distillation, and adversarial training, to create a comprehensive training pipeline that enhances both the quality and efficiency of audio generation.
The paper provides a thorough evaluation of the models against existing state-of-the-art systems, demonstrating significant improvements in audio generation quality and inference speed. The experiments are well-structured, showcasing the models' capabilities in generating variable-length audio and performing inpainting tasks. However, specific quantitative metrics and user studies could further substantiate the claims of superior performance, particularly in subjective evaluations of audio quality.
The authors have made the model weights for the small and medium versions available, along with the training and inference pipeline, which is a positive step towards reproducibility. However, the paper could benefit from more detailed implementation instructions and hyperparameter settings to facilitate easier replication of results by other researchers.
One limitation of the study is the reliance on licensed and Creative Commons data, which may restrict the diversity of the audio used for training. Additionally, while the models are designed for consumer-grade hardware, the computational requirements for the larger models may still be prohibitive for some users. The paper also does not address potential biases in the training data that could affect the generated outputs.
The implications of Stable Audio 3 are significant for various applications, including music production, sound design, and interactive media. By enabling high-quality audio generation on consumer hardware, the model democratizes access to advanced audio synthesis tools, potentially fostering creativity and innovation in the audio domain. The ability to perform targeted audio editing through inpainting could also enhance workflows in professional audio production. Stable Audio 3 represents a significant advancement in audio generation technology, combining innovative methodologies with practical applications for both consumers and professionals. The paper's contributions to variable-length audio generation, inpainting capabilities, and efficient model training are poised to impact the field of machine learning and audio synthesis significantly.
Despite 230 million speakers, Urdu remains critically under-resourced in speech technology. We introduce UrduSpeech: a large high-fidelity Urdu corpus comprising 156 hours of audio with 12-dimension paralinguistic metadata, encompassing US-Std, US-CS, US-EngPk. To address Right-to-Left script constraints and frequent code-switching, we developed UrduSpeech, a LLM-driven pipeline to curate data across 12 diverse categories, including news, drama, and rare literary forms like Bait-Bazi. We also release a 9-hour US-Benchmark set, manually corrected by native annotators to serve as a standard. Human quality assessment of the primary 156-hour corpus yielded a Mean Opinion Score (MOS) of 4.6 (std = 0.7) with inter-rater reliability confirmed by a 0.68 Cohen's Kappa, validating our curation pipeline's 97.6% confidence score. The corpus maintains a 60-40 gender balance across 71,792 utterances. Our work represents a significant leap toward linguistic inclusivity in global AI. The corpus and code are open-sourced, and a demo page is available.
Primary: Northwestern Polytechnical University
All Institutions: Northwestern Polytechnical University
The main contribution of this paper is the introduction of the UrduSpeech corpus, a comprehensive resource for Urdu speech technology that includes high-fidelity audio and extensive paralinguistic annotations. This work significantly enhances the landscape of speech resources for under-resourced languages, addressing critical gaps in existing datasets and methodologies.
The methodology is robust, leveraging a multi-stage pipeline for data curation that addresses the unique challenges of Urdu speech, including RTL script and code-switching. The authors employed advanced techniques such as speaker diarization and noise removal, ensuring high-quality audio segments. The integration of 12-dimensional paralinguistic annotations is a significant enhancement, allowing for detailed analysis of emotional and vocal characteristics. The use of generative models for transcription and annotation, along with a rigorous human-centric validation framework, further strengthens the methodology.
The experiments are comprehensive, with a clear focus on establishing a baseline for Urdu ASR and TTS. The authors conducted a pilot study and a thorough evaluation of various transcription models, providing detailed comparisons and insights into their performance. The Mean Opinion Score (MOS) and inter-rater reliability metrics demonstrate the corpus's high fidelity and reliability, which are crucial for future research and applications.
The paper outlines the data collection and preprocessing steps in detail, which aids in reproducibility. However, the absence of a publicly accessible project URL limits the ability for others to directly replicate the study. The authors mention open-sourcing the corpus and code, which is a positive aspect for reproducibility.
The paper acknowledges several limitations, including potential over-segmentation in speaker diarization and the presence of background noise in some audio segments. While the authors have made efforts to validate the gender distribution and speaker IDs, ongoing work is needed to ensure absolute compliance. The reliance on automated systems for initial processing may introduce errors that require manual correction.
The UrduSpeech corpus represents a significant advancement in the field of speech technology for under-resourced languages, particularly Urdu. By providing a high-quality, diverse dataset, this work has the potential to enhance the performance of ASR and TTS systems for Urdu and related dialects, fostering linguistic inclusivity in AI applications. The integration of paralinguistic metadata opens new avenues for research in affective computing and speaker profiling. The main contribution of this paper is the introduction of the UrduSpeech corpus, a comprehensive resource for Urdu speech technology that includes high-fidelity audio and extensive paralinguistic annotations. This work significantly enhances the landscape of speech resources for under-resourced languages, addressing critical gaps in existing datasets and methodologies.
Real-time magnetic resonance imaging (rtMRI) of speech production enables non-invasive visualization of dynamic vocal-tract motion and is valuable for speech science and clinical assessment. However, rtMRI is fundamentally constrained by trade-offs among spatial resolution, temporal resolution, and acquisition speed, often leading to undersampled k-space measurements and degraded reconstructions. We propose SIREM, a speech-informed MRI reconstruction framework that uses synchronized speech as a cross-modal prior. The central idea is that vocal-tract configurations during speech are correlated with the produced acoustics, making part of the image content predictable from audio. SIREM models each frame as a fusion of an audio-driven component and an MRI-driven component through a spatial weighting map. The audio branch predicts articulator-related structure from speech, while the MRI branch reconstructs complementary content from measured k-space data. We further introduce a learnable soft weighting profile over spiral arms, enabling a differentiable study of how k-space arm usage interacts with speech-informed fusion. This yields a unified multimodal formulation that combines audio-driven prediction, MRI reconstruction, and sampling adaptation. We evaluate SIREM on the USC speech rtMRI benchmark against standard baselines, including gridding, wavelet-based compressed sensing, and total variation. SIREM introduces a speech-informed reconstruction paradigm that operates in a substantially higher-throughput regime than iterative methods while preserving anatomically plausible vocal-tract structure. These results establish an initial benchmark for multimodal speech-informed rtMRI reconstruction and highlight the potential of synchronized speech as an auxiliary prior for fast reconstruction. The source code is available at https://github.com/mdhasanai/SIREM
Primary: Institute of Radiology, University Hospital Erlangen
All Institutions: Institute of Radiology, University Hospital Erlangen, Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg, Institut für Informationsverarbeitung, Leibniz Universität Hannover, Department of Radiology, Harvard Medical School and Massachusetts General Hospital
The paper introduces SIREM, a novel speech-informed MRI reconstruction framework that leverages synchronized audio to enhance real-time imaging of speech production. This work represents a meaningful advancement in multimodal imaging techniques, combining audio and MRI data to improve reconstruction quality and efficiency, thereby addressing critical challenges in the field of speech science and clinical assessment.
The proposed SIREM framework innovatively combines synchronized audio with MRI reconstruction to address the challenges of real-time magnetic resonance imaging of speech. By modeling the reconstruction as a fusion of audio-driven and MRI-driven components, the methodology effectively leverages the correlation between vocal tract configurations and produced acoustics. The introduction of a learnable soft weighting profile over spiral arms adds a differentiable mechanism for optimizing k-space sampling, which is a significant advancement in the field. However, the reliance on a fixed segmentation-derived explained-by-audio map limits the flexibility of the model.
The experiments are well-structured, utilizing the USC speech rtMRI benchmark and comparing SIREM against established baselines such as gridding, wavelet-based compressed sensing, and total variation. The evaluation metrics are comprehensive, covering both distortion-based and perceptual measures, which provide a thorough assessment of reconstruction quality. While SIREM does not uniformly outperform classical methods, it demonstrates the utility of synchronized speech as a prior, achieving notable improvements in certain metrics.
The paper provides sufficient implementation details, including model architecture, training procedures, and hyperparameters, which enhances reproducibility. The availability of the source code on GitHub further supports this aspect, allowing other researchers to replicate the study and build upon the proposed method.
Key limitations include the use of a fixed explained-by-audio map, which may not capture the full variability of the audio signal's predictive power across different anatomical regions. Additionally, the evaluation is based on a relatively small dataset, which may affect the generalizability of the results. Future work should explore learned fusion maps and prospective sampling strategies to enhance the model's adaptability.
The SIREM framework has significant potential applications in speech science and clinical assessment, particularly in improving the efficiency and quality of rtMRI for speech production analysis. By reducing scan times and enhancing reconstruction fidelity, this method could facilitate more effective clinical evaluations and research into speech disorders. The integration of multimodal data also opens avenues for further exploration in related fields such as audio-visual speech synthesis and real-time imaging technologies. The paper introduces SIREM, a novel speech-informed MRI reconstruction framework that leverages synchronized audio to enhance real-time imaging of speech production. This work represents a meaningful advancement in multimodal imaging techniques, combining audio and MRI data to improve reconstruction quality and efficiency, thereby addressing critical challenges in the field of speech science and clinical assessment.
Weakly labeled datasets such as AudioSet have driven recent progress in audio tagging. However, annotation quality varies across sound classes. Labels may be incomplete, ambiguous, or unreliable, which introduces class-dependent supervision bias during optimisation. The issue becomes harder as real and generated audio are increasingly mixed in training, and generated samples do not always match their intended semantic labels. Prior work mainly addressed unreliable supervision from missing-positive labels, while this paper targets three other sources of unreliable supervision: spurious additions, misassignments between similar classes, and weakened label evidence. These effects introduce class-dependent optimisation bias that is not explicitly modeled by most existing methods. To bridge this gap, the paper proposes a Class-wise Supervision Unreliability (CSU) framework that controls supervision strength at the class level during training. CSU learns a separate unreliability parameter for each class and down-weights less reliable supervision without changing the model architecture or inference process. To support evaluations, this paper also introduces ESC-FreeGen50, a manually verified benchmark of 50 sound classes that combines real and generated audio. Experiments on controlled benchmarks and AudioSet show that CSU improves robustness across different architectures and different sources of supervision unreliability. The results indicate that explicit class-wise modeling of supervision unreliability is an effective and practical strategy for robust audio tagging under large-scale weakly labeled training. Code and data are available at: https://github.com/Yuanbo2020/CSU
Primary: University of Oxford
All Institutions: University of Oxford, KU Leuven, Harbin Engineering University, KTH Royal Institute of Technology, University of Surrey
The main contribution of this paper is the introduction of the Class-wise Supervision Unreliability (CSU) framework, which effectively addresses the challenges posed by unreliable supervision in audio tagging tasks. The comprehensive evaluation and the introduction of a new benchmark dataset significantly advance the state of the art in robust audio tagging methodologies.
The paper introduces the Class-wise Supervision Unreliability (CSU) framework, which innovatively addresses the problem of unreliable supervision in audio tagging by learning separate unreliability parameters for each class. This approach allows for dynamic down-weighting of less reliable supervision without altering the model architecture or inference process. The methodology is well-structured, addressing three specific types of supervision unreliability (spurious additions, misassignments, and weakened label evidence) and providing a clear rationale for the need for class-wise control mechanisms. The incorporation of a new benchmark dataset, ESC-FreeGen50, further enhances the methodology by allowing controlled evaluations of the proposed framework.
The experiments are comprehensive, utilizing both the newly introduced ESC-FreeGen50 dataset and the well-established AudioSet for validation. The results demonstrate that CSU significantly improves robustness across various architectures and types of supervision unreliability. The evaluation metrics used, including mean Average Precision (mAP) and F1-score, are appropriate for the audio tagging task. The paper effectively shows the performance gains of CSU over baseline models and other robust learning methods, providing strong empirical support for the proposed framework.
The paper provides sufficient details regarding the implementation of the CSU framework and the experimental setup, including model architectures, training procedures, and evaluation metrics. However, the lack of a demo URL or direct access to the experimental results may hinder full reproducibility for external researchers. Nonetheless, the availability of the code and dataset on GitHub is a positive aspect for reproducibility.
While the paper presents a robust framework, it does not thoroughly explore the potential limitations of the CSU approach, such as the impact of varying the number of classes or the generalizability of the learned unreliability parameters across different datasets. Additionally, the reliance on manually verified labels for the ESC-FreeGen50 dataset may limit its scalability and applicability to larger, less curated datasets.
The proposed CSU framework has significant implications for the field of audio tagging and weakly supervised learning, particularly in real-world applications where annotation quality is often inconsistent. By improving robustness against unreliable supervision, this work can enhance the performance of audio tagging systems in various domains, including environmental sound recognition and multimedia content analysis. The introduction of the ESC-FreeGen50 dataset also provides a valuable resource for future research in this area. The main contribution of this paper is the introduction of the Class-wise Supervision Unreliability (CSU) framework, which effectively addresses the challenges posed by unreliable supervision in audio tagging tasks. The comprehensive evaluation and the introduction of a new benchmark dataset significantly advance the state of the art in robust audio tagging methodologies.
This paper describes a novel paradigm that formalizes automatic piano transcription (APT) as an optimal transport (OT) problem, not as a frame-level multi-label binary classification problem. Our method learns to minimize the cost of transporting a predicted distribution of note events to the ground-truth distribution over time and frequency. The OT loss can thus accommodate temporal misalignment, leading to perceptually relevant optimization. We also propose a convolutional recurrent neural network (CRNN) with a harmonics-aware attention mechanism to capture the spectro-temporal dependencies inherent in music.Our experiments using the MAESTRO dataset showed that our method attained a state-of-the-art performance in onset detection. We confirmed the versatility of the OT loss in application to existing models.
Primary: Graduate School of Informatics, Kyoto University
All Institutions: Graduate School of Informatics, Kyoto University, Graduate School of Engineering, Kyoto University, Independent Researcher, Hong Kong
This paper presents a significant advancement in automatic piano transcription by introducing an optimal transport framework that enhances the model's ability to handle temporal misalignments. The combination of a novel loss function and a well-designed neural architecture positions this work as a meaningful contribution to the field of machine learning in music.
The paper introduces a novel approach to automatic piano transcription (APT) by framing it as an optimal transport (OT) problem rather than a traditional frame-level multi-label classification task. This shift is significant as it allows for more flexible handling of temporal misalignments in note predictions. The proposed convolutional recurrent neural network (CRNN) architecture, SFT-CRNN, incorporates a harmonics-aware attention mechanism, enhancing its ability to model spectro-temporal dependencies. The methodology is well-structured, with a clear explanation of the OT loss function and its application to the APT task, making it accessible for replication and further exploration.
The experiments are robust, utilizing the MAESTRO dataset, which is a well-regarded benchmark in the field. The authors report state-of-the-art results in onset detection, achieving an F1-score of 98.36%, which demonstrates the effectiveness of their approach. The comparative study against established baselines and the ablation studies provide strong evidence for the contributions of the OT loss and the model architecture. The evaluation metrics used (precision, recall, F1-score) are appropriate for the task, and the results are presented clearly.
While the paper provides a detailed description of the model architecture and training procedures, it lacks specific implementation details such as code availability or links to a repository. This omission may hinder reproducibility, as other researchers would need to rely solely on the descriptions provided to replicate the results.
One limitation noted in the paper is the model's performance on offset detection, which does not exceed the best-performing systems. The authors attribute this to the absence of a dedicated sustain pedal detection module, indicating a potential area for future work. Additionally, the reliance on a specific dataset (MAESTRO) may limit the generalizability of the results to other musical contexts.
The proposed method has the potential to significantly impact the field of music information retrieval, particularly in applications requiring accurate transcription of musical performances. The ability to handle temporal misalignments could improve the usability of APT systems in real-world scenarios, such as music education, automated accompaniment, and music analysis tools. Furthermore, the model's adaptability to other tasks within music information retrieval suggests broader applicability. This paper presents a significant advancement in automatic piano transcription by introducing an optimal transport framework that enhances the model's ability to handle temporal misalignments. The combination of a novel loss function and a well-designed neural architecture positions this work as a meaningful contribution to the field of machine learning in music.
Robust selective auditory attention under multilingual interference is critical for reliable deployment of Large Audio Language Models (LALMs). We introduce MUSA, a cocktail party-inspired multilingual benchmark for source-grounded spoken-language understanding and reasoning. Each item pairs an English target dialogue with a semantically plausible distractor in English, Spanish, Korean, or Chinese, and evaluates models across (1) single, (2) source separation-based two-stage, (3) and end-to-end cocktail party settings under controlled SNRs. Evaluating two closed-source and four open-weight LALMs, we find that strong single performance does not ensure robust selective auditory attention: cocktail party accuracy degrades under severe SNRs, and errors are dominated by distractor-grounded source confusion. In addition, separation reduces acoustic overlap but leaves source attribution unresolved, often yielding confident wrong-stream answers. Data and code will be released upon publication.
Primary: University of Illinois Urbana-Champaign
All Institutions: University of Illinois Urbana-Champaign
The main contribution of this paper is the introduction of the MUSA benchmark, which evaluates the selective auditory attention of LALMs in multilingual contexts, revealing critical insights into their performance limitations. This work significantly advances the understanding of LALMs' capabilities and highlights the importance of robust auditory attention mechanisms in real-world applications.
The paper introduces a novel benchmark, MUSA, designed to evaluate the selective auditory attention capabilities of Large Audio Language Models (LALMs) in the presence of multilingual distractors. The methodology is well-structured, employing a cocktail party paradigm that mimics real-world scenarios where multiple languages may interfere with audio processing. The authors rigorously define the experimental settings, including single, separation-based, and cocktail party conditions, and provide a detailed diagnostic error taxonomy that categorizes model failures. This structured approach allows for a comprehensive understanding of the models' performance under varying signal-to-noise ratios (SNRs), which is a significant advancement in the field.
The experiments are robust, involving six different LALMs evaluated across multiple settings and SNR levels. The results clearly demonstrate that high performance in single-stream conditions does not translate to effective performance in cocktail party scenarios, highlighting a critical gap in current LALM capabilities. The authors provide detailed statistical analyses and error distributions, which enhance the validity of their findings. The use of synthesized audio ensures consistency, although it may limit ecological validity.
The paper mentions that data and code will be released upon publication, which is a positive aspect for reproducibility. However, the specifics of the implementation details, such as the exact configurations of the models and the separation techniques used, could be better documented to facilitate independent replication of the results.
The study is limited by the relatively small dataset of 200 synthesized cases, which may not capture the full variability of natural speech. Additionally, the focus on English as the target language and the use of a single off-the-shelf separator may restrict the generalizability of the findings. The authors also acknowledge potential confounding factors that could influence cross-lingual understanding, which are not fully controlled.
The findings have significant implications for the deployment of LALMs in high-stakes environments such as healthcare and aviation, where accurate audio understanding is critical. By addressing the challenges of multilingual interference, this research paves the way for more reliable audio processing systems that can operate effectively in diverse linguistic contexts. The introduction of MUSA as a benchmark could stimulate further research in this area, leading to advancements in model architectures and training methodologies. The main contribution of this paper is the introduction of the MUSA benchmark, which evaluates the selective auditory attention of LALMs in multilingual contexts, revealing critical insights into their performance limitations. This work significantly advances the understanding of LALMs' capabilities and highlights the importance of robust auditory attention mechanisms in real-world applications.
High-fidelity text-to-music generation typically relies on massive proprietary datasets and immense computational resources. Existing models often struggle to generate coherent pure musical accompaniments and lack precise, localized semantic control due to their reliance on coarse, track-level annotations. To address these limitations under constrained data and computing resources, we propose S2Accompanist, a Semantic-Aware and Structure-Guided Diffusion Model developed for the ICME2026 ATTM Grand Challenge. Specifically, we design an automated data pipeline comprising structural segmentation, Large Audio-Language Model driven segment-level captioning, and dual-metric quality grading to overcome the absence of localized metadata in raw datasets. Furthermore, we propose a semantic-aware Variational Autoencoder fine-tuning strategy that explicitly distills foundational LeadSheet structures into the acoustic latent space, effectively improving the overall audio fidelity. Extensive experiments demonstrate that S2Accompanist achieves state-of-the-art objective performance on the ATTM Grand Challenge benchmark across both the Efficiency and Performance Tracks. With only 402M parameters, our model remains competitive compared to larger-scale unconstrained models and secured first place in the Efficiency Track.
Primary: Northwestern Polytechnical University
All Institutions: Northwestern Polytechnical University, WeNet Open Source Community
S2Accompanist presents a significant advancement in music accompaniment generation through its innovative data pipeline and semantic-aware modeling techniques. The comprehensive evaluation against benchmarks demonstrates its effectiveness and potential impact on the field of machine learning in audio.
The methodology presented in S2Accompanist is robust and innovative, particularly in its automated data pipeline that integrates structural segmentation and semantic captioning. The use of a Large Audio-Language Model (LALM) for generating fine-grained captions is a significant advancement, allowing for better semantic control in music generation. The introduction of a semantic-aware Variational Autoencoder (VAE) fine-tuning strategy is a notable contribution, as it effectively distills musical structures into the latent space, enhancing audio fidelity. The overall architecture, which combines these elements into a diffusion model, is well-structured and addresses the limitations of existing models in generating coherent musical accompaniments.
The experimental evaluation is thorough, with extensive testing against established benchmarks in the ATTM Grand Challenge. The results demonstrate S2Accompanist's superiority in both objective metrics (FAD, CCS) and subjective evaluations (MOS), securing the top position in the Efficiency Track. The use of dual-metric grading for data selection is particularly effective, ensuring high-quality training data. The paper provides clear comparisons with other models, showcasing the competitive performance of S2Accompanist despite its smaller parameter size.
The paper includes sufficient details regarding the training process, model architecture, and evaluation metrics, which aids in reproducibility. However, the lack of a publicly available code repository or demo limits the ability for others to replicate the results directly. Future work could benefit from releasing the model and data pipeline to the community.
One limitation is the reliance on the MTG-Jamendo dataset, which may not encompass the full diversity of musical styles and genres, potentially affecting the generalizability of the model. Additionally, while the model performs well under constrained conditions, its performance in less controlled environments or with more complex musical tasks remains to be tested.
The advancements made in S2Accompanist have significant implications for the field of music generation, particularly in enabling high-fidelity accompaniment generation with limited data resources. This could democratize access to music generation technologies, allowing smaller developers and researchers to create sophisticated models without the need for extensive datasets or computational power. The model's approach to integrating semantic understanding into music generation could also inspire future research in multimodal AI applications. S2Accompanist presents a significant advancement in music accompaniment generation through its innovative data pipeline and semantic-aware modeling techniques. The comprehensive evaluation against benchmarks demonstrates its effectiveness and potential impact on the field of machine learning in audio.
Finding sound effects or environmental sounds that match a creator's intended impression remains a largely manual process in multimedia production. This is especially relevant for comics and other visual media, where visually stylized onomatopoeic expressions convey auditory impressions through letter shapes, strokes, layouts, and decorative patterns. However, cross-modal retrieval between onomatopoeic images and general sounds has been largely unexplored. This paper thus introduces a bidirectional retrieval framework between onomatopoeic images and the corresponding sound clips. Instead of directly comparing embeddings extracted from pretrained image and audio encoder, we train modality-specific projection heads that re-align the embeddings for visual onomatopoeia and corresponding sounds. We then construct the Multimodal Image-Audio Onomatopoeia dataset (MIAO), which contains paired onomatopoeic images and sound clips across 50 sound event classes. Experimental results show that the proposed method substantially outperforms a zero-shot baseline using pretrained CLIP and CLAP embeddings. These results demonstrate that adapting pretrained representations enables effective retrieval in both directions: from onomatopoeic images to sounds and from sounds to onomatopoeic images.
Primary: Kyoto University
All Institutions: Kyoto University, Doshisha University
This paper presents a novel approach to cross-modal retrieval between onomatopoeic images and sounds, significantly contributing to the field of audio-visual machine learning. The methodology effectively adapts existing models to a unique dataset, demonstrating the potential for improved retrieval performance in multimedia applications.
The proposed methodology introduces a novel bidirectional retrieval framework that leverages modality-specific projection heads to align embeddings from pretrained image and audio encoders for onomatopoeic images and sounds. This approach effectively addresses the challenge of cross-modal retrieval in a previously unexplored area, demonstrating a thoughtful adaptation of existing models (CLIP and CLAP) to a unique dataset (MIAO). The use of projection heads to refine the embedding space is a significant methodological advancement that enhances retrieval performance.
The experiments conducted are robust, utilizing a well-constructed dataset (MIAO) with a clear evaluation strategy. The results show substantial improvements over a zero-shot baseline, indicating the effectiveness of the proposed method. The evaluation metrics used (mAP, R@k, MRR) are appropriate for the task and provide a comprehensive view of the retrieval performance in both directions. However, the paper could benefit from additional comparisons with more sophisticated baselines or alternative methods to further validate the effectiveness of the proposed approach.
The paper provides sufficient detail regarding the methodology, including the dataset construction, experimental setup, and evaluation metrics, which supports reproducibility. However, the absence of a publicly available code repository limits the ability for others to directly replicate the results. Providing implementation details or a link to a codebase would significantly enhance reproducibility.
One identified limitation is the reliance on pretrained models, which may not fully capture the nuances of onomatopoeic images. Additionally, the model's performance varies between retrieval directions, suggesting that further investigation into the variability of visual representations is needed. The dataset's size and diversity may also limit generalizability, as the results are based on a specific set of illustrators and sound classes.
The proposed framework has potential applications in multimedia production, particularly in enhancing the efficiency of sound effect selection in visual media like comics and animations. By automating the retrieval process based on visual cues, this work could significantly reduce the manual effort required by creators, leading to more streamlined workflows in multimedia content creation. Furthermore, the insights gained from this research could inspire future studies on cross-modal retrieval and representation learning in other domains. This paper presents a novel approach to cross-modal retrieval between onomatopoeic images and sounds, significantly contributing to the field of audio-visual machine learning. The methodology effectively adapts existing models to a unique dataset, demonstrating the potential for improved retrieval performance in multimedia applications.
Continuous autoregressive speech synthesis has recently emerged as a promising direction for zero-shot text-to-speech (TTS). However, existing methods still suffer from a fundamental mismatch between semantic-prosodic modeling and reconstruction-driven continuous speech representations. This mismatch causes TTS models to focus excessively on low-level acoustic textures at the expense of high-level semantic coherence, further exacerbating error accumulation in autoregressive generation. To address this challenge, we propose SemaVoice, a semantic-aware continuous autoregressive framework for high-fidelity zero-shot TTS. SemaVoice introduces a Speech Foundation Model (SFM) guided alignment mechanism that refines continuous speech representations to better capture both local semantic consistency and global structural relationships. These representations condition a patch-wise diffusion head within the autoregressive framework for high-quality speech synthesis. Experimental results on the Seed-TTS benchmark show that SemaVoice achieves an English WER of 1.71\% and remains highly competitive with state-of-the-art open-source systems in both objective and subjective evaluations. The effectiveness of SFM guided alignment is further confirmed by significant improvements under varying representation granularities with a fixed information-rate constraint.
Primary: The Chinese University of Hong Kong
All Institutions: The Chinese University of Hong Kong, Tsinghua University, SenseTime Research
The main contribution of this paper is the introduction of SemaVoice, a semantic-aware continuous autoregressive framework that significantly improves high-fidelity zero-shot text-to-speech synthesis through an innovative SFM-guided alignment mechanism. This work represents a meaningful advancement in the field of speech synthesis, addressing critical limitations in existing models and demonstrating strong experimental results.
The proposed SemaVoice framework introduces a novel SFM-guided alignment mechanism that effectively addresses the mismatch between semantic-prosodic modeling and reconstruction-driven continuous speech representations. This innovative approach enhances the semantic coherence of generated speech while maintaining acoustic fidelity. The use of a continuous autoregressive framework, combined with a patch-wise diffusion head, is a significant advancement over traditional TTS architectures. The methodology is well-structured, with a clear explanation of the components and their interactions, although the complexity may pose challenges for replication.
The experimental evaluation is robust, utilizing a large-scale bilingual dataset of 150K hours for training and thorough testing on the Seed-TTS benchmark. The results demonstrate competitive performance against state-of-the-art systems in both objective (WER, speaker similarity) and subjective (MOS) metrics, indicating the effectiveness of the proposed framework. The ablation studies provide valuable insights into the contributions of key components, reinforcing the importance of the SFM-guided alignment mechanism.
While the paper provides a detailed description of the architecture and training process, it lacks a publicly available implementation or code repository, which hinders reproducibility. The absence of a demo URL also limits practical engagement with the model.
The evaluation is limited to a bilingual dataset, which may restrict the generalizability of the findings. Additionally, the paper acknowledges inherent challenges with sequential inference latency and error accumulation in autoregressive generation, which could impact real-time applications.
The advancements in zero-shot TTS synthesis have significant implications for applications in voice cloning, virtual assistants, and accessibility technologies. By improving the semantic coherence and acoustic fidelity of synthesized speech, SemaVoice could enhance user experience in various domains, including entertainment, education, and communication. The main contribution of this paper is the introduction of SemaVoice, a semantic-aware continuous autoregressive framework that significantly improves high-fidelity zero-shot text-to-speech synthesis through an innovative SFM-guided alignment mechanism. This work represents a meaningful advancement in the field of speech synthesis, addressing critical limitations in existing models and demonstrating strong experimental results.
Early detection of exacerbations in asthma and chronic obstructive pulmonary disease (COPD) is important for timely intervention. Speech has emerged as a promising tool for continuous, non-invasive respiratory disease monitoring. However, speech signals inherently carry speaker-identifiable attributes that may dominate model predictions, which may compromise both diagnosis performance and patient privacy. Furthermore, the acoustic features associated with respiratory disease and speaker identity remain unclear in respiratory disease monitoring. We propose an adversarial learning architecture that disentangles pathology-related acoustic patterns from speaker-identifiable attributes. The framework optimizes two clinically hierarchical tasks: (i) respiratory status classification (stable vs. exacerbated) and (ii) exacerbation type classification (asthma exacerbation vs. COPD exacerbation). Speaker identity is suppressed through gradient reversal-based adversarial training. To enhance clinical interpretability, we employ SHapley Additive exPlanations (SHAP) to quantify the contributions of acoustic features to pathology-related predictions versus speaker identity. On the TACTICAS dataset, our method outperforms the single-task baseline across both tasks. For the respiratory status task (stable vs. exacerbated), the AUC improves from 0.897 to 0.910. For the exacerbation type task (asthma exacerbation vs. COPD exacerbation), the AUC increases from 0.674 to 0.793. Concurrently, the J-ratio decreases, confirming effective suppression of speaker information. SHAP analysis reveals the contributions of the acoustic features to both tasks. External validation on the Bridge2AI-Voice dataset further demonstrates consistent performance improvement and reduced speaker dependency, confirming cross-dataset generalizability.
Primary: Maastricht University
All Institutions: Maastricht University, Maastricht University Medical Centre, NUTRIM Research Institute of Nutrition and Translational Research in Metabolism
The main contribution of this paper is the development of a multi-task adversarial learning framework that enhances the accuracy of speech-based monitoring for asthma and COPD exacerbations while preserving patient privacy. This work represents a significant step forward in the intersection of machine learning, healthcare, and privacy, providing a foundation for future research and applications in remote health monitoring.
The proposed methodology utilizes an innovative adversarial learning framework that effectively disentangles speaker-identifiable attributes from pathology-related acoustic features. This approach is well-justified, addressing critical issues of privacy and model generalizability in speech-based monitoring of respiratory diseases. The use of gradient reversal for adversarial training is a solid choice, and the integration of SHAP for interpretability adds significant value to the methodology. However, the paper could benefit from a more detailed explanation of the hyperparameter tuning process and the rationale behind the choice of specific features.
The experiments are robust, utilizing two distinct datasets (TACTICAS and Bridge2AI-Voice) to validate the model's performance. The reported improvements in AUC scores across both tasks indicate a significant enhancement in diagnostic accuracy. The use of the J-ratio to measure speaker information leakage is a novel contribution that strengthens the findings. However, the paper could improve by providing more detailed statistical analyses and comparisons with baseline models to better contextualize the results.
While the paper outlines the methodology and datasets used, it lacks sufficient implementation details that would allow for full reproducibility. Key aspects such as the specific configurations of the model architecture, training procedures, and data preprocessing steps are not thoroughly documented. Providing access to code or supplementary materials would greatly enhance reproducibility.
The study is limited by its focus on Dutch speakers, which may affect the generalizability of the findings to other languages and dialects. Additionally, the model's performance on a wider range of respiratory conditions beyond asthma and COPD is not explored, which could limit its applicability in clinical settings. The reliance on specific acoustic features may also overlook other potentially relevant indicators of respiratory health.
This research has significant implications for the development of non-invasive, privacy-preserving monitoring systems for chronic respiratory diseases. By improving diagnostic accuracy while safeguarding patient identity, the framework could facilitate wider adoption of speech-based health monitoring technologies in clinical practice. The findings could also inspire further research into adversarial learning applications in healthcare, particularly in areas where patient privacy is a concern. The main contribution of this paper is the development of a multi-task adversarial learning framework that enhances the accuracy of speech-based monitoring for asthma and COPD exacerbations while preserving patient privacy. This work represents a significant step forward in the intersection of machine learning, healthcare, and privacy, providing a foundation for future research and applications in remote health monitoring.
Latent diffusion models have emerged as the dominant paradigm for many generation tasks including audio generation such as text-to-audio, text-to-music and text-to-speech. A key component of latent diffusion is an autoencoder (VAE) that compresses high-dimensional signals into a low frame rate continuous representation that is conducive for downstream prediction. Regularizing these VAEs is challenging, as there is a trade-off between over-regularized (poor output quality) and under-regularized (difficult to predict) latent representations. We propose a framework for studying this trade-off through compression and train Audio VAEs at specific bitrates via target-KL regularization. This allows direct comparison to well-studied discrete neural audio codec models, and the construction of rate-distortion curves for audio VAEs. We evaluate the impact of target-KL regularization on text-to-sound generation and find that sweeping compression rates is helpful in identifying the optimal generation setting.
Primary: Adobe Research
All Institutions: Adobe Research
The main contribution of this paper is the introduction of target-KL regularization for training continuous VAEs at fixed bitrates, enabling systematic comparisons with discrete audio codecs and enhancing the understanding of the compression-reconstruction trade-off in audio generation tasks. This work represents a meaningful advancement in the field of audio machine learning, with potential applications in various generative audio tasks.
The proposed method of target-KL regularization is a significant advancement in the training of continuous VAEs for audio generation. By systematically addressing the trade-off between compression and reconstruction quality, the authors provide a novel framework that allows for targeted bitrate control during training. This approach not only enhances the understanding of latent representations in VAEs but also facilitates direct comparisons with discrete audio codecs, which is a notable contribution to the field. The methodology is well-structured, with clear definitions and a solid theoretical foundation linking compression theory to VAE training.
The experiments conducted are thorough and well-documented, utilizing a variety of datasets and architectures to evaluate the performance of the proposed DAC-VAE models. The results demonstrate a clear advantage of the target-KL regularization in achieving optimal compression rates for different audio tasks, including text-to-sound and text-to-speech generation. The use of rate-distortion curves to visualize the performance of various models is particularly effective in illustrating the benefits of the proposed method. However, the paper could benefit from more extensive qualitative evaluations and comparisons with a wider range of existing models.
The paper provides a detailed description of the model architecture, training procedures, and evaluation metrics, which aids in reproducibility. However, the absence of a publicly available code repository or demo URL limits the ability for others to replicate the results directly. Including such resources would significantly enhance the reproducibility of the findings.
One limitation of the study is the reliance on proprietary datasets, which may restrict the generalizability of the results. Additionally, while the authors discuss the trade-offs involved in compression rates, there is limited exploration of how these findings might apply to other audio generation tasks beyond those tested. The qualitative aspects of generated audio, such as naturalness and emotional expressiveness, could also be further investigated.
The implications of this research are significant for the audio generation community, particularly in applications involving text-to-audio synthesis and music generation. By providing a framework for systematically studying the trade-offs in audio compression, this work could lead to advancements in the development of more efficient and higher-quality generative audio models. The findings may also influence future research directions in multimodal audio applications and the integration of audio generation with other machine learning tasks. The main contribution of this paper is the introduction of target-KL regularization for training continuous VAEs at fixed bitrates, enabling systematic comparisons with discrete audio codecs and enhancing the understanding of the compression-reconstruction trade-off in audio generation tasks. This work represents a meaningful advancement in the field of audio machine learning, with potential applications in various generative audio tasks.
Artificially generated speech is increasingly embedded in everyday life. Voice cloning in particular enables applications where identity preservation is important, such as completing a recording, dubbing in a new language, or preserving the voices of individuals with speech loss. However, in our work, we find that despite the term, voice cloning does not faithfully ''clone'' an individual's voice. Instead, we find that widely-used voice cloning models systematically apply style transfer to source voices. As rated by human annotators, cloned voices are perceived as more authoritative, warm, customer-service-like, and human-like compared to their sources. Human annotators also report greater trust in cloned voices than source voices, and a greater willingness to disclose sensitive personal information to them. Our work furthermore shows that voice cloning leads to homogenization of speaker characteristics, as measured by reduced variance in accent, speaking rate, and the audio embedding space. Together, our results highlight a new set of limitations and risks of voice cloning technology and their potential impact on human behavior.
Primary: Cornell University
All Institutions: Cornell University, TogetherAI, Stanford University
This paper presents a critical examination of voice cloning technologies, revealing that they often apply style transformations rather than faithfully reproducing individual voices. The findings underscore the need for greater awareness and regulation of voice cloning technologies to mitigate potential risks to personal identity and societal norms.
The methodology employed in this study is robust, utilizing a diverse participant pool and a systematic approach to evaluate voice cloning systems. The authors effectively use paired audio samples for human annotation, which allows for a direct comparison between source and cloned voices. The use of multiple TTS models and the inclusion of ablation studies to explore the effects of clip duration and generation settings provide a comprehensive understanding of the phenomena observed. However, the reliance on subjective human ratings introduces potential biases that could affect the results.
The experiments are well-structured, with a clear focus on evaluating the perceived qualities of cloned voices compared to their sources. The statistical significance of the findings is appropriately reported, and the use of various metrics to assess human perception adds depth to the analysis. The findings regarding the homogenization of speaker characteristics and the behavioral implications of voice cloning are particularly noteworthy. However, the paper could benefit from more extensive quantitative analysis alongside the qualitative assessments.
The authors provide sufficient detail regarding their experimental setup, including participant demographics, data collection methods, and the models used for voice cloning. The availability of datasets and code on GitHub enhances reproducibility. However, the paper lacks explicit details on the training processes and hyperparameters used for the TTS models, which could hinder full replication of the results.
The study acknowledges several limitations, including the potential biases in human ratings and the lack of demographic diversity in the participant pool. Additionally, the focus on a limited number of TTS models may not fully capture the variability across different voice cloning technologies. The implications of voice cloning on identity and cultural representation are significant but require further exploration in future work.
This research has substantial implications for the development and deployment of voice cloning technologies. The findings raise important ethical questions regarding identity preservation, trust in synthetic voices, and the potential for misuse in sensitive contexts. The study highlights the need for transparency in how voice cloning systems operate and the societal impacts they may have, particularly in terms of cultural homogenization and the reinforcement of existing biases in voice perception. This paper presents a critical examination of voice cloning technologies, revealing that they often apply style transformations rather than faithfully reproducing individual voices. The findings underscore the need for greater awareness and regulation of voice cloning technologies to mitigate potential risks to personal identity and societal norms.
Audio super-resolution (SR), also referred to as bandwidth extension (BWE), aims to reconstruct high-fidelity signals from low-resolution (LR) or band-limited (BL) observations, an inherently ill-posed task due to the ambiguity of missing high-frequency (HF) content. This survey provides a comprehensive overview of the field, with a particular focus on the paradigm shift from discriminative mapping to modern generative modeling. We first review early discriminative deep neural network (DNN) models, which formulate BWE/SR as a deterministic mapping problem and are prone to regression-to-the-mean effects and spectral over-smoothing. We then systematically review generative approaches, including autoregressive (AR) models, variational autoencoders (VAEs), generative adversarial networks (GANs), diffusion and score-based models, flow-based methods, and Schrödinger bridges. Across these approaches, we examine key design aspects, including representation domain, architecture, conditioning mechanisms, and trade-offs among reconstruction fidelity, perceptual quality, robustness, and computational efficiency. Furthermore, we discuss emerging directions involving large language models (LLMs) and multimodal foundation models, and highlight open challenges in perceptual evaluation, phase modeling, and real-world generalization. By providing a structured taxonomy and unified perspective, this survey establishes a comprehensive foundation and offers a practical roadmap for advancing BWE/SR from deterministic point estimation toward distribution-aware generative modeling.
Primary: Stony Brook University
All Institutions: Stony Brook University, Northeastern University, University of Illinois Chicago, Discovery Partners Institute
The main contribution of this paper is its comprehensive survey of audio super-resolution and bandwidth extension techniques, providing a structured taxonomy and critical evaluation of existing methodologies. This work serves as a valuable resource for researchers seeking to understand the evolution of the field and the current state of generative modeling approaches.
The paper presents a comprehensive survey of audio super-resolution (SR) and bandwidth extension (BWE), effectively categorizing existing methodologies into discriminative and generative models. It critically evaluates the limitations of traditional deterministic approaches and highlights the advantages of generative frameworks, such as GANs and diffusion models. The authors provide a structured taxonomy that clarifies the relationship between BWE and SR, which is a significant contribution to the field. However, while the survey is thorough, it lacks original experimental results or novel methodologies that could further enhance its impact.
The paper does not present original experiments or results but instead synthesizes existing literature and methodologies. It reviews various datasets and evaluation metrics commonly used in the field, including subjective and objective measures. The lack of new experimental validation limits the paper's technical impact, as it primarily serves as a literature review rather than presenting novel findings.
As a survey paper, reproducibility is not directly applicable; however, the authors do provide a clear overview of existing methodologies and their evaluation metrics. The absence of new experimental results means there are no implementation details to reproduce, which is a common limitation in survey papers.
The primary limitation of this paper is its lack of original experimental contributions or novel methodologies. While it provides a comprehensive overview, it does not advance the field with new insights or findings. Additionally, the survey may not cover the most recent developments if they emerged after the paper's submission.
The survey has significant implications for researchers in audio processing, as it provides a structured overview of the evolution of BWE and SR techniques. By highlighting the shift towards generative models, it may guide future research directions and inspire the development of new methodologies. The discussion of emerging trends, such as the integration of large language models, indicates potential avenues for future exploration in multimodal audio systems. The main contribution of this paper is its comprehensive survey of audio super-resolution and bandwidth extension techniques, providing a structured taxonomy and critical evaluation of existing methodologies. This work serves as a valuable resource for researchers seeking to understand the evolution of the field and the current state of generative modeling approaches.
Training data attribution (TDA) for music generation must answer two questions that copyright analysis requires, namely which training songs influence a generated output and along which musical aspects the influence operates. Existing methods reduce influence to a single scalar, without revealing which musical aspects are dominant in that influence. We propose ARIA, a framework that decomposes attribution along musical aspects (five for symbolic music, three for audio) and pairs the decomposition with reliability diagnostics computed from the segment-level score matrix. It measures within-group similarity among the top-K attributed tracks against random reference groups drawn from the training pool, and diagnoses the score matrix through its singular value decomposition and column statistics. On a symbolic-music model where attribution ground truth is available through counterfactual retraining, the reliability diagnostics rank four attribution methods identically to that ground truth. On an audio music generation model, ARIA reveals attribution behaviors that vary substantially across TDA methods, flags score matrices whose retrieved tracks are nearly identical across queries rather than reflecting per-query attribution, and characterizes embedding-similarity retrieval baselines by the musical aspect each encoder surfaces. Together, ARIA produces per-aspect attribution evidence aligned with the musical aspects considered under the idea-expression distinction in copyright analysis.
Primary: Chalmers University of Technology
All Institutions: Chalmers University of Technology, University of Gothenburg
The paper presents ARIA, a novel framework for music training data attribution that effectively decomposes influence along musical aspects and provides reliability diagnostics, addressing a critical need in the intersection of machine learning and copyright law.
The proposed ARIA framework innovatively decomposes training data attribution (TDA) along multiple musical aspects, addressing a significant gap in existing methods that reduce influence to a single scalar. The methodology includes reliability diagnostics based on segment-level score matrices and singular value decomposition, which are crucial for understanding the attribution behavior of different methods. This multi-faceted approach is particularly relevant in the context of music generation and copyright analysis, as it aligns with the legal framework of idea-expression distinction.
The experiments conducted on both symbolic and audio music generation models are well-structured, utilizing a benchmark with ground truth for validation and exploring the performance of various attribution methods. The results demonstrate the effectiveness of ARIA in revealing the influence of training songs on generated outputs and highlight the variability of attribution behaviors across different methods. The use of statistical measures to assess within-group similarity adds robustness to the findings.
The paper provides comprehensive details on the experimental setup, including model architectures, datasets, and evaluation metrics, which enhances reproducibility. However, the absence of publicly available code or a demo limits the practical reproducibility of the results.
One limitation is the reliance on existing benchmarks and the challenges associated with creating ground truth for audio attribution, which may affect the generalizability of the findings. Additionally, the framework's performance may vary with different types of music or genres, which is not fully explored in the experiments.
The implications of this research extend to the legal domain, particularly in copyright analysis, as it provides a framework for understanding the influence of training data on generative models. This could aid in developing fair compensation mechanisms for artists and inform future regulations regarding AI-generated content. The framework also sets a foundation for further research in music generation and attribution, potentially influencing how generative models are evaluated and utilized in practice. The paper presents ARIA, a novel framework for music training data attribution that effectively decomposes influence along musical aspects and provides reliability diagnostics, addressing a critical need in the intersection of machine learning and copyright law.
Toxic speech detection has become a crucial challenge in maintaining safe online communication environments. However, existing approaches to toxic speech detection often neglect the contribution of paralinguistic cues, such as emotion, intonation, and speech rate, which are key to detecting speech toxicity. Moreover, current toxic speech datasets are predominantly text-based, limiting the development of models that can capture paralinguistic cues.To address these challenges, we present ToxiAlert-Bench, a large-scale audio dataset comprising over 30,000 audio clips annotated with seven major toxic categories and twenty fine-grained toxic labels. Uniquely, our dataset annotates toxicity sources -- distinguishing between textual content and paralinguistic origins -- for comprehensive toxic speech analysis.Furthermore, we propose a dual-head neural network with a multi-stage training strategy tailored for toxic speech detection. This architecture features two task-specific classification headers: one for identifying the source of sensitivity (textual or paralinguistic), and the other for categorizing the specific toxic type. The training process involves independent head training followed by joint fine-tuning to reduce task interference. To mitigate data class imbalance, we incorporate class-balanced sampling and weighted loss functions.Our experimental results show that leveraging paralinguistic features significantly improves detection performance. Our method consistently outperforms existing baselines across multiple evaluation metrics, with a 21.1% relative improvement in Macro-F1 score and a 13.0% relative gain in accuracy over the strongest baseline, highlighting its enhanced effectiveness and practical applicability.
Primary: Zhejiang University
All Institutions: Zhejiang University, Zhejiang Provincial Natural Science Foundation, National Natural Science Foundation of China
The main contribution of this work is the introduction of ToxiAlert-Bench, a comprehensive dataset for paralinguistic-aware toxic speech detection, and a dual-head neural network that significantly improves detection performance by integrating both textual and paralinguistic features. This paper represents a meaningful advancement in the field of audio-based machine learning, addressing a critical gap in existing research and providing a robust framework for future studies.
The paper introduces a novel dual-head neural network architecture designed specifically for detecting toxic speech by leveraging both textual and paralinguistic cues. The methodology is well-structured, involving a multi-stage training strategy that effectively reduces task interference and addresses data imbalance through class-balanced sampling and weighted loss functions. The dataset, ToxiAlert-Bench, is comprehensive, comprising over 30,000 audio clips with detailed annotations that allow for nuanced analysis of toxicity sources. The use of both real and synthesized audio samples enhances the dataset's robustness and diversity.
The experiments are thorough, comparing the proposed method against several state-of-the-art baselines. The results demonstrate significant improvements in detection performance, particularly in identifying toxicity conveyed through paralinguistic cues. The paper provides detailed metrics, including accuracy and Macro-F1 scores, which support the claims of the model's effectiveness. The ablation studies further validate the contributions of the model's components, reinforcing the robustness of the findings.
The authors have taken steps to ensure reproducibility by documenting the dataset construction process and providing a GitHub repository for the model. However, the paper could benefit from more detailed implementation specifics, such as hyperparameter settings and training protocols, to facilitate easier replication by other researchers.
One limitation is the reliance on the quality of the synthetic data generated, which may not fully capture the complexity of real-world toxic speech. Additionally, while the dataset is extensive, the focus on English may limit the applicability of the findings to other languages and cultural contexts. The paper does not address potential biases in the dataset or the model's performance across different demographics.
This research has significant implications for online communication platforms, particularly in enhancing moderation systems for audio content. By addressing the nuances of toxic speech that are often overlooked in text-based moderation, the findings could lead to more effective tools for preventing harassment and promoting safer online environments. The dataset and model could serve as foundational resources for future research in audio-based toxicity detection. The main contribution of this work is the introduction of ToxiAlert-Bench, a comprehensive dataset for paralinguistic-aware toxic speech detection, and a dual-head neural network that significantly improves detection performance by integrating both textual and paralinguistic features. This paper represents a meaningful advancement in the field of audio-based machine learning, addressing a critical gap in existing research and providing a robust framework for future studies.
Autoregressive music generation depends strongly on the audio tokenizer. Existing high-fidelity codecs often use residual multi-codebook quantization, which preserves reconstruction quality but complicates language modeling after sequence flattening, as the residual hierarchy imposes strong sequential dependencies and can amplify error accumulation. We propose BandTok, a generation-oriented 2D Mel-spectrogram tokenizer that represents each frame with Mel-frequency band tokens from a single shared codebook. This design yields a physically interpretable time-frequency token grid with a more independent token structure, making it better suited for autoregressive modeling. BandTok improves reconstruction with a multi-scale PatchGAN objective and EMA codebook updates. We further introduce an autoregressive language model with 2D Rotary Position Embedding (2D RoPE) to preserve temporal and frequency-band structure during generation. Experiments show that BandTok improves over residual-codebook tokenizers and achieves strong results in a data-limited setting. The source code and generation demos for this work are publicly available.
Primary: Central Conservatory of Music
All Institutions: Central Conservatory of Music, Zhipu AI
The main contribution of this paper is the introduction of BandTok, a novel 2D Mel-spectrogram tokenizer that enhances autoregressive music generation through improved token independence and reconstruction fidelity. This work significantly advances the field by addressing limitations of existing tokenization methods and providing a robust framework for future research in audio generation.
The paper presents BandTok, a novel 2D Mel-spectrogram tokenizer specifically designed for autoregressive music generation. The methodology is well-structured, focusing on improving token independence and reducing error propagation through a shared codebook of Mel-frequency band tokens. The use of a multi-scale PatchGAN discriminator and EMA codebook updates enhances reconstruction fidelity, while the introduction of 2D Rotary Position Embedding (RoPE) effectively preserves the temporal and frequency-band structure during generation. The approach is innovative, leveraging a unique tokenization strategy that contrasts with traditional residual multi-codebook methods.
The experiments are comprehensive, comparing BandTok against existing tokenizers and evaluating both reconstruction quality and generation performance. The use of objective metrics like FAD and CLAP scores, alongside subjective assessments, provides a robust evaluation framework. The results indicate that BandTok outperforms residual-codebook tokenizers, demonstrating its effectiveness in a data-limited setting. However, the paper could benefit from more extensive ablation studies to isolate the impact of each component of the proposed method.
The paper provides sufficient implementation details, including training configurations, datasets, and evaluation metrics, which should facilitate reproducibility. The source code and generation demos are publicly available, further supporting the reproducibility of the results. However, the lack of a clear description of the datasets used for training and evaluation could pose challenges for researchers attempting to replicate the study.
One limitation is the reliance on specific datasets, which may affect the generalizability of the results. The paper also does not address potential biases in the training data, which could influence the quality of generated music. Additionally, while the proposed method shows improvements over existing approaches, the paper does not explore the scalability of BandTok with larger datasets or more complex music generation tasks.
The proposed method has significant implications for the field of music generation, particularly in enhancing the quality and fidelity of generated audio. By improving tokenization strategies, BandTok could facilitate advancements in various applications, including music composition, sound design, and interactive audio systems. The integration of multimodal aspects, such as text conditioning, opens avenues for more sophisticated music generation frameworks that could benefit artists and content creators. The main contribution of this paper is the introduction of BandTok, a novel 2D Mel-spectrogram tokenizer that enhances autoregressive music generation through improved token independence and reconstruction fidelity. This work significantly advances the field by addressing limitations of existing tokenization methods and providing a robust framework for future research in audio generation.
Generative models are capable to address difficult problems with non-unique solutions like bandwidth extension and gap filling, removing highly non-linear artifacts from codecs, clipping and distortion, as opposed to removing linear additive components like noise and reverb. While large offline processing models have shown impressive results, these tasks have not been solved with real-time capable models with low latency and compute. We propose a few-step flow matching model using Data Prediction Mean Flows in combination with suitable novel low-latency architecture to make flow matching models an attractive choice under theses constraints. Compared to state-of-the-art, our proposed mean flow model uses 120x less compute and introduces no algorithmic latency other than the STFT, while achieving similar audio quality.
Primary: Microsoft Research
All Institutions: Microsoft Research
This work presents a significant advancement in real-time speech restoration using generative models, demonstrating a 120x reduction in computational complexity while maintaining audio quality. The combination of innovative methodologies and thorough experimental validation positions this research as a notable contribution to the field of machine learning and audio processing.
The paper introduces a novel few-step flow matching model utilizing Data Prediction Mean Flows (DP-MF) for real-time speech restoration. The methodology is well-structured, addressing the limitations of existing generative models in terms of latency and computational efficiency. The combination of innovative training techniques, such as the introduction of a data prediction loss and the careful design of flow time distributions, demonstrates a significant advancement in the field. The architecture is designed to minimize latency while maximizing audio quality, which is critical for real-time applications.
The experiments are comprehensive, utilizing a large-scale dataset that simulates real-world audio degradation scenarios. The evaluation metrics include both subjective (MOS, WER) and objective (DNSMOS SIG) measures, which provide a balanced view of the model's performance. The results indicate that the proposed model outperforms existing state-of-the-art models in terms of quality while significantly reducing computational requirements, showcasing the effectiveness of the proposed approach.
The paper provides sufficient details regarding the architecture, training data, and evaluation metrics, which would allow for reproducibility. However, the absence of a public code repository limits accessibility for other researchers wishing to replicate or build upon this work.
While the proposed model shows substantial improvements in latency and computational efficiency, there are still gaps in performance compared to non-causal models, particularly in terms of WER. Additionally, the reliance on specific training data and augmentation techniques may limit generalizability to other types of audio restoration tasks.
The advancements made in this paper have significant implications for various applications, including telecommunications, hearing aids, and augmented reality devices. By enabling real-time speech restoration with reduced computational demands, this work could enhance user experiences in environments where audio quality is critical. This work presents a significant advancement in real-time speech restoration using generative models, demonstrating a 120x reduction in computational complexity while maintaining audio quality. The combination of innovative methodologies and thorough experimental validation positions this research as a notable contribution to the field of machine learning and audio processing.
Recent breakthroughs in multi-talker ASR (MT-ASR) and speaker diarization (SD) rely on synthetic data to mitigate the scarcity of large-scale conversational recordings, yet the impact of specific simulation choices remains poorly understood. To mind the gap between simulated mixtures and real-world interactions, we present a study of synthetic data generation for leading MT-ASR (DiCoW) and SD (Sortformer) systems. By introducing FastMSS, a highly efficient open-source simulator, we analyze turn-taking dynamics, source domain, acoustic augmentation, and data mixing strategies. Our findings reveal that optimal simulation recipes are highly task-dependent: increasing speech overlap benefits ASR but degrades diarization. Furthermore, broad source diversity consistently outperforms exact domain matching. Ultimately, synthetic-only training approaches real-data baselines, and combining simulated data with real recordings yields substantial gains over real-only training across both tasks.
Primary: Carnegie Mellon University
All Institutions: Brno University of Technology, Carnegie Mellon University, NVIDIA
The paper presents a comprehensive study on the impact of synthetic conversational data on multi-talker ASR and speaker diarization, revealing critical insights into simulation strategies and their task-dependent effects. The introduction of FastMSS as an open-source toolkit represents a significant advancement in the field, enabling further research and application in multi-talker speech processing.
The paper introduces FastMSS, an open-source simulator that allows for the generation of synthetic multi-talker conversations with configurable parameters. The methodology is robust, systematically varying key factors such as turn-taking dynamics and source domain diversity. The authors provide a clear rationale for their choices and demonstrate the importance of task-specific simulation strategies, which is a significant contribution to the field. The use of two leading models, DiCoW for MT-ASR and Sortformer for SD, adds depth to the analysis, allowing for a comprehensive understanding of how synthetic data can be optimized for different tasks.
The experiments are well-designed, utilizing a variety of datasets that reflect real-world conditions. The results are clearly presented, showing the impact of different simulation strategies on performance metrics such as tcpWER for ASR and DER for diarization. The findings that synthetic data can approach real-data performance and that combining both yields the best results are particularly noteworthy. The paper effectively demonstrates the practical implications of its findings, making it relevant for both academic and industry applications.
The authors emphasize reproducibility by releasing FastMSS as an open-source toolkit, which is a commendable practice in the research community. They provide detailed descriptions of their experimental setup, including datasets and evaluation metrics, which further enhances the reproducibility of their results. However, the reliance on specific configurations and hyperparameters may require careful attention from users to replicate the results exactly.
One limitation noted in the paper is the potential lack of inter-turn semantic coherence in the generated conversations, which could affect the performance of ASR systems. Additionally, while the study covers a range of simulation strategies, the generalizability of the findings to other tasks or domains outside those tested remains uncertain. The paper could also benefit from a more extensive discussion on the ethical implications of using synthetic data in real-world applications.
The research has significant implications for the fields of speech recognition and speaker diarization, particularly in scenarios where real conversational data is scarce. By demonstrating that synthetic data can effectively complement or even substitute real data, this work opens avenues for more efficient training of ASR and diarization systems. The findings could lead to advancements in applications such as virtual assistants, automated meeting transcriptions, and other multi-talker environments. The paper presents a comprehensive study on the impact of synthetic conversational data on multi-talker ASR and speaker diarization, revealing critical insights into simulation strategies and their task-dependent effects. The introduction of FastMSS as an open-source toolkit represents a significant advancement in the field, enabling further research and application in multi-talker speech processing.
Subretinal injection is a delicate vitreoretinal procedure requiring precise needle placement within the subretinal space while avoiding perforation of the retinal pigment epithelium (RPE), a layer directly beneath the target with extremely limited regenerative capacity. To enhance depth perception during cannula advancement, intraoperative optical coherence tomography (iOCT) offers high-resolution cross-sectional visualization of needle-tissue interaction; however, interpreting these images requires sustained visual attention alongside the en face microscope view, thereby increasing cognitive load during critical phases and placing additional demands on the surgeon's proprioceptive control. In this paper, we propose a structured, real-time sonification framework designed for extensible mapping of iOCT-derived anatomical features into perceptual auditory feedback. The method employs a physics-inspired acoustic model driven by segmented retinal layers from a stream of iOCT B-scans, with needle motion and injection-induced retinal layer displacements serving as excitation inputs to the sound model, enabling perception of tool position and retinal deformation. In a controlled user study (n=34), the proposed sonification achieved high retinal layer identification accuracy and robust detection of retinal deformation-related events, significantly outperforming a state-of-the-art baseline in overall event identification (83.4% vs. 60.6%, p < 0.001), with gains driven primarily by enhanced detection of injection-induced retinal deformation. Evaluation by experts (n=4) confirmed the clinical relevance and potential intraoperative applicability of the method. These results establish structured iOCT sonification as a viable complementary modality for real-time surgical guidance in subretinal injection.
Primary: Princeton University
All Institutions: Princeton University, Technische Universität München, Rotterdam Eye Hospital, Centre for Tactile Internet with Human-in-the-Loop, Technische Universität Dresden, Munich Center for Machine Learning, Chair for Social Affective Touch
This paper presents a novel real-time sonification framework for enhancing surgical guidance during subretinal injections, demonstrating significant improvements in event identification accuracy through innovative auditory feedback mechanisms. The methodology and experimental results indicate a strong potential for clinical impact, although further validation in diverse surgical contexts is necessary for widespread adoption.
The proposed methodology introduces a structured sonification framework that effectively maps iOCT-derived anatomical features into auditory feedback, leveraging a physics-inspired acoustic model. The approach is well-defined, utilizing real-time updates based on segmented retinal layers and employing a mass-spring-damper system to reflect dynamic interactions during subretinal injections. The integration of both tool-driven and anatomy-driven excitations is innovative, enhancing the auditory feedback's relevance to surgical contexts. However, the reliance on a specific anatomical model may limit generalizability across different surgical scenarios.
The user study involving 34 participants provides robust evidence of the proposed method's effectiveness, demonstrating significant improvements in event identification accuracy compared to a baseline. The statistical significance of the results (p < 0.001) strengthens the claims of enhanced performance. The qualitative evaluations and feedback from expert surgeons further validate the clinical applicability of the framework. However, additional details on participant demographics and the specific experimental setup would enhance the evaluation's transparency.
The paper provides a GitHub repository link for the code, which is a positive step towards reproducibility. However, the implementation details could be more thoroughly documented to facilitate easier replication by other researchers. The reliance on specific software libraries (e.g., miPhysics) should also be clearly stated to avoid potential compatibility issues.
The study's limitations include a small sample size for expert feedback and the potential for bias in participant selection. The framework's performance in diverse surgical scenarios beyond subretinal injection remains untested. Additionally, the auditory feedback's effectiveness may vary based on individual surgeon preferences and experiences, which could affect its adoption in clinical practice.
The proposed sonification framework has the potential to significantly enhance surgical precision and reduce cognitive load during delicate procedures like subretinal injections. By providing real-time auditory feedback, it could improve patient outcomes and streamline surgical workflows. The approach may also inspire further research into auditory feedback systems in other medical domains, potentially leading to broader applications in minimally invasive surgeries. This paper presents a novel real-time sonification framework for enhancing surgical guidance during subretinal injections, demonstrating significant improvements in event identification accuracy through innovative auditory feedback mechanisms. The methodology and experimental results indicate a strong potential for clinical impact, although further validation in diverse surgical contexts is necessary for widespread adoption.
Segment Anything Model 2 (SAM2) exhibits strong generalisation for promptable segmentation in video clips; however, its integration with the audio modality remains underexplored. Existing approaches either convert audio into visual prompts (e.g., boxes) via foundation models, or inject adapters into the image encoder for audio-visual fusion. Yet both directions fall short in human-in-the-loop scenarios due to limited prompt accuracy and increased inference overhead. In particular, these adapter-based methods often suffer from audio prompt dilution, where the signal gradually weakens as it propagates through the network. In this work, we propose AuralSAM2, which integrates audio into SAM2 while largely preserving its promptable segmentation capability. Its core module, AuralFuser, fuses audio and visual features to generate sparse and dense prompts. Guided by audio and built upon SAM2's feature pyramid, these prompts propagate auditory cues across visual layers, reinforcing cross-modal influence. To further align modalities, we introduce an audio-guided contrastive loss that emphasises auditory relevance in dominant visual features. Our method achieves notable accuracy gains on public benchmarks with only minimal impact on the interactive efficiency of promptable segmentation. Our code is available at https://github.com/yyliu01/AuralSAM2.
Primary: University of Oxford
All Institutions: University of Oxford, Australian Institute for Machine Learning, Stanford University, University of Central Florida, University of Surrey
The paper presents AuralSAM2, a novel framework that enhances the Segment Anything Model 2 by integrating audio features for improved promptable segmentation. This work significantly advances the field of audio-visual integration in machine learning, providing a robust methodology and strong experimental results that demonstrate its potential impact on future research and applications.
The methodology introduces AuralFuser, which effectively integrates audio features into the SAM2 framework without modifying its visual backbone. This is achieved through a novel approach that generates both sparse and dense prompts, enhancing the model's ability to leverage audio cues in segmentation tasks. The introduction of an audio-guided contrastive loss (AudioCon) is particularly innovative as it addresses the challenge of visual dominance in the latent space, ensuring that audio signals are prioritized in the learning process. The hierarchical design of the feature pyramid is a significant methodological advancement that preserves audio influence throughout the network.
The experimental evaluation is robust, utilizing two public benchmarks (Ref-AVS and AVSBench) to demonstrate the efficacy of AuralSAM2. The results show significant improvements in segmentation accuracy compared to existing methods, particularly in human-in-the-loop scenarios, which is a critical application area. The ablation studies effectively highlight the contributions of different components of the proposed method, reinforcing the validity of the results.
The paper provides a link to the code repository, which is essential for reproducibility. However, the implementation details could be more comprehensive, particularly regarding the training setup and hyperparameters used. Clearer documentation would enhance the ability of other researchers to replicate the results.
One limitation is the reliance on the SAM2 framework, which may restrict the generalizability of the proposed method to other architectures. Additionally, while the integration of audio is innovative, the paper does not extensively discuss the potential challenges in real-world applications, such as varying audio quality or background noise.
The integration of audio into visual segmentation tasks has significant implications for various applications, including video analysis, surveillance, and human-computer interaction. By improving the accuracy of segmentation in scenarios where audio cues are present, this work could enhance the usability of AI systems in real-world environments, making them more efficient and effective. The paper presents AuralSAM2, a novel framework that enhances the Segment Anything Model 2 by integrating audio features for improved promptable segmentation. This work significantly advances the field of audio-visual integration in machine learning, providing a robust methodology and strong experimental results that demonstrate its potential impact on future research and applications.
Multimedia verification requires not only accurate conclusions but also transparent and contestable reasoning. We propose a contestable multi-agent framework that integrates multimodal large language models, external verification tools, and arena-based quantitative bipolar argumentation (A-QBAF) as a submission to the ICMR 2026 Grand Challenge on Multimedia Verification. Our method decomposes each case into claim-centered sections, retrieves targeted evidence, and converts evidence into structured support and attack arguments with provenance and strength scores. These arguments are resolved through small local argument graphs with selective clash resolution and uncertainty-aware escalation. The resulting system generates section-wise verification reports that are transparent, editable, and computationally practical for real-world multimedia verification. Our implementation is public at: https://github.com/Analytics-Everywhere-Lab/MV2026_the_liems.
Primary: University of New Brunswick
All Institutions: University of New Brunswick, FPT Software, University of Science
The paper presents a contestable multi-agent framework for multimedia verification that integrates multimodal large language models and an arena-based argumentation approach. The methodology is innovative and addresses critical issues in multimedia verification, although empirical validation and detailed experimental results are needed to fully assess its impact.
The proposed methodology is innovative, integrating multimodal large language models with an arena-based quantitative bipolar argumentation framework. The multi-agent approach effectively decomposes multimedia verification tasks into claim-centered sections, allowing for structured argumentation and transparent reasoning. The use of selective clash resolution and uncertainty-aware escalation enhances the system's robustness and practicality for real-world applications.
The paper lacks detailed experimental results or benchmarks that validate the proposed framework's effectiveness. While it describes the methodology in depth, the absence of empirical data or comparisons against existing methods limits the assessment of its performance and impact.
The implementation is publicly available on GitHub, which is a positive aspect for reproducibility. However, the paper does not provide sufficient details on the datasets used, evaluation metrics, or specific experimental setups, which could hinder full reproducibility.
The paper does not address potential limitations in terms of scalability, the complexity of the argumentation process, or the handling of ambiguous cases. Additionally, the reliance on external verification tools may introduce variability in results based on the quality of those tools.
The framework has significant implications for multimedia verification, particularly in combating misinformation in digital media. Its emphasis on contestability and transparency could enhance trust in automated verification systems, making it a valuable tool for journalists, fact-checkers, and the general public. The paper presents a contestable multi-agent framework for multimedia verification that integrates multimodal large language models and an arena-based argumentation approach. The methodology is innovative and addresses critical issues in multimedia verification, although empirical validation and detailed experimental results are needed to fully assess its impact.
Persian music, with its unique tonalities, modal systems (Dastgah), and rhythmic structures, presents significant challenges for music generation models trained primarily on Western music. We address this gap by curating the first large-scale dataset of Persian songs, comprising over 900 hours high-quality audio samples across diverse sub-genres, including pop, traditional, and contemporary styles. This dataset captures the rich melodic and cultural diversity of Persian music and serves as the foundation for fine-tuning MusicGen, a state-of-the-art generative music model. We adapt MusicGen to this domain and evaluate its performance by utilizing subjective and objective metrics. To assess the semantic alignment between generated music and intended style tags, we report the proportion of relevant tags accurately reflected in the generated outputs. Our results demonstrate that the fine-tuned model produces compositions that more align with Persian stylistic conventions. This work introduces a new resource for generative music research and illustrates the adaptability of music generation models to underrepresented cultural and linguistic contexts.
Primary: Sharif University of Technology
All Institutions: Sharif University of Technology, Independent Researcher
This paper presents the first large-scale dataset of Persian music and successfully adapts a state-of-the-art generative model to this culturally rich domain. The comprehensive methodology and promising results underscore the potential for AI to engage with and celebrate diverse musical traditions.
The methodology is robust, featuring a comprehensive dataset curation process that addresses the significant gap in Persian music resources. The authors employed a sophisticated approach for audio segmentation, tagging, and conditioning using state-of-the-art models. The three-stage training pipeline for adapting MusicGen to Persian music is well-structured, emphasizing unsupervised domain adaptation, instrument-focused fine-tuning, and supervised fine-tuning, which collectively enhance the model's cultural fidelity and stylistic accuracy. However, the reliance on automated tagging and the absence of expert validation for some aspects of the dataset may introduce noise and inaccuracies.
The experimental evaluation is thorough, utilizing both objective metrics (KLD and Chroma Cosine Similarity) and a hybrid evaluation strategy. The results indicate that the fine-tuned model significantly outperforms the baseline in generating culturally coherent Persian music. However, the evaluation could benefit from a more extensive subjective assessment involving trained musicians to capture perceptual qualities that are critical in music generation.
The paper provides a clear description of the dataset creation process and model training, which facilitates reproducibility. However, some details regarding the specific configurations used during training and the exact nature of the evaluation metrics could be elaborated upon to enhance clarity for future researchers attempting to replicate the study.
Key limitations include the dataset's skewed genre distribution towards Persian pop, which may affect the model's generalizability across other Persian music styles. The automatic tagging process may introduce inaccuracies, and the evaluation metrics used do not fully capture the richness of Persian music, particularly in terms of microtonal fidelity and ornamentation. Additionally, the model's performance may be constrained by the smaller variant of MusicGen used for fine-tuning.
This research has significant implications for the field of generative music, particularly in promoting cultural diversity in AI-generated content. By addressing the underrepresentation of Persian music in generative models, this work opens avenues for further exploration of other non-Western musical traditions. The dataset created can serve as a valuable resource for future research in music generation, potentially influencing the development of more culturally-aware AI systems. This paper presents the first large-scale dataset of Persian music and successfully adapts a state-of-the-art generative model to this culturally rich domain. The comprehensive methodology and promising results underscore the potential for AI to engage with and celebrate diverse musical traditions.
LLM-based automatic speech recognition models demonstrate strong performance by connecting audio encoders and LLMs. However, data scarcity of paired speech and transcription often hinders their adaptation to new domains, making text-only domain adaptation crucial. Existing methods typically rely on either fine-tuning the LLM alone or employing pseudo-audio prompts. The former neglects essential acoustic context, while the latter either suffers from limited scalability in data-scarce conditions, or yields inexpressive prompts by leveraging only textual features, ignoring audio modality. To address this, we propose an enhanced framework that explicitly models speech-text alignment. Our method efficiently generates highly expressive pseudo-audio prompts that bridges the modality gap, enabling effective target-domain adaptation. Experiments demonstrate that our approach outperforms existing text-only methods, improving both overall error rates and out-of-vocabulary coverage.
Primary: Kyoto University
All Institutions: Kyoto University, LY Corporation
The main contribution of this paper is the introduction of the TE2SL framework, which enhances text-only domain adaptation in LLM-based ASR by generating expressive pseudo-audio prompts through a learnable refinement module. This work represents a significant advancement in bridging the modality gap in ASR systems, with promising implications for improving performance in data-scarce environments.
The proposed Text-Embedding-to-Speech-Latent (TE2SL) framework innovatively addresses the challenge of text-only domain adaptation in LLM-based ASR by introducing a learnable refinement module that enhances the quality of pseudo-audio prompts. This method effectively bridges the modality gap by ensuring that the synthesized prompts are both sample-dependent and aligned with the characteristics of the audio encoder and projector. The methodology is well-structured, with a clear distinction between training and adaptation phases, and utilizes a Conformer architecture to achieve this refinement. The focus on architecture-aware synthesis is a significant advancement over previous heuristic approaches.
The experiments conducted are thorough, comparing the TE2SL framework against established baselines, including LLM-only fine-tuning and pseudo-audio prompt methods. The results demonstrate substantial improvements in both recognition accuracy and out-of-vocabulary (OOV) recall across multiple datasets in English and Japanese, validating the effectiveness of the proposed method. The use of diverse datasets strengthens the generalizability of the findings, and the metrics employed (WER and CER) are appropriate for evaluating ASR performance.
The paper provides a detailed description of the experimental setup, including model architectures, training configurations, and evaluation metrics. However, the absence of a publicly available code repository or demo limits the reproducibility of the results. Clearer documentation or a supplementary material section with implementation details could enhance reproducibility.
One limitation is the reliance on the quality of the audio encoder and projector, which may vary across different languages or domains. Additionally, while the method shows promise in improving OOV recall, the paper does not extensively discuss the implications of these improvements in practical applications. The scalability of the TE2SL framework in low-resource settings, where high-quality audio encoders may not be available, also warrants further exploration.
The proposed approach has significant potential applications in various domains where ASR systems are deployed, particularly in low-resource languages or specialized fields with limited paired data. By improving domain adaptation capabilities, this work can enhance accessibility and usability of ASR technologies in diverse linguistic contexts. The findings could also inform future research on multimodal learning and integration of audio-visual data in ASR systems. The main contribution of this paper is the introduction of the TE2SL framework, which enhances text-only domain adaptation in LLM-based ASR by generating expressive pseudo-audio prompts through a learnable refinement module. This work represents a significant advancement in bridging the modality gap in ASR systems, with promising implications for improving performance in data-scarce environments.
Target speech extraction remains difficult for compact devices because monaural neural models lack spatial evidence and classical beamformers lose resolving power when the microphone aperture is only a few centimetres. We present IsoNet, a user-selectable audio-visual target speech extraction system for a compact 4-microphone array. IsoNet combines complex multi-channel STFT features, GCC-PHAT spatial cues, face-conditioned visual embeddings, and auxiliary direction-of-arrival supervision inside a U-Net mask estimation network. Three curriculum variants were trained on 25,000 simulated VoxCeleb mixtures with progressively difficult SNR regimes. On a hard test set spanning -1 to 10 dB SNR, IsoNet-CL1 achieves 9.31 dB SI-SDR, a 4.85 dB improvement over the mixture, with PESQ 2.13 and STOI 0.84. Oracle delay-and-sum and MVDR beamformers degrade the same mixtures by 4.82 dB and 6.08 dB SI-SDRi, respectively, showing that the proposed learned multimodal conditioning solves a regime where conventional spatial filtering is ineffective. Ablation studies show consistent gains from visual conditioning, GCC-PHAT features, and extended delay-bin encoding. The results establish a compact-array, face-selectable speech extraction baseline under controlled simulation and identify the remaining barriers to real deployment, especially phase reconstruction, multi-interferer mixtures, and simulation-to-real transfer.
Primary: Institute of Engineering, Tribhuvan University
All Institutions: Institute of Engineering, Tribhuvan University
IsoNet presents a novel approach to audio-visual target speech extraction, effectively addressing the limitations of compact microphone arrays in challenging acoustic environments. The combination of advanced methodologies and thorough experimental validation positions this work as a meaningful contribution to the field of machine learning and audio processing.
The proposed methodology in IsoNet is robust, combining multi-channel STFT features, GCC-PHAT spatial cues, and face-conditioned visual embeddings within a U-Net architecture. The use of curriculum learning to progressively introduce SNR challenges is a thoughtful approach that enhances model robustness. The architecture is designed to address specific failure modes of compact microphone arrays, making it relevant for practical applications. The integration of auxiliary direction-of-arrival supervision is a notable addition that helps regularize the learning process.
The experiments are well-structured, utilizing a large dataset of 25,000 simulated mixtures from VoxCeleb, which is appropriate for the task. The evaluation metrics (SI-SDR, PESQ, and STOI) provide a comprehensive view of both objective and perceptual quality. The results demonstrate significant improvements over baseline methods, particularly in challenging SNR conditions. The ablation studies effectively isolate the contributions of different components of the model, providing clear insights into the efficacy of visual and spatial conditioning.
The paper provides sufficient detail on the experimental setup, including the architecture, training procedures, and evaluation metrics, which supports reproducibility. However, the lack of publicly available code or datasets limits the ability for independent verification of results.
The study primarily focuses on scenarios with a single interfering speaker, which may not fully capture the complexities of real-world environments with multiple speakers and background noise. Additionally, the reliance on simulated data may introduce discrepancies when transitioning to real-world applications. The phase reconstruction method used could also be improved for better performance in low SNR conditions.
The proposed IsoNet system has significant implications for various applications, including voice assistants, hearing aids, and augmented reality devices, where selective listening is crucial. By enhancing the ability to extract target speech in complex acoustic environments, this research could improve user experiences in everyday communication scenarios. IsoNet presents a novel approach to audio-visual target speech extraction, effectively addressing the limitations of compact microphone arrays in challenging acoustic environments. The combination of advanced methodologies and thorough experimental validation positions this work as a meaningful contribution to the field of machine learning and audio processing.
Current methods for creating drum loop audio in digital music production, such as using one-shot samples or resampling, often demand non-trivial efforts of creators. While recent generative models achieve high fidelity and adhere to text, they lack the specific control needed for such a task. Existing symbolic-to-audio research often focuses on single, tonal instruments, leaving the challenge of polyphonic, percussive drum synthesis unaddressed. We address this gap by introducing ``Break-the-Beat!,'' a model capable of rendering a drum MIDI with the timbre of a reference audio. It is built by fine-tuning a pre-trained text-to-audio model with our proposed content encoder and a effective hybrid conditioning mechanism. To enable this, we construct a new dataset of paired target-reference drum audio from existing drum audio datasets. Experiments demonstrate that our model generates high-quality drum audio that follows high-resolution drum MIDI, achieving strong performance across metrics of audio quality, rhythmic alignment, and beat continuity. This offer producers a new, controllable tool for creative production. Demo page: https://ik4sumii.github.io/break-the-beat/
Primary: Sony Group Corporation
All Institutions: Sony Group Corporation, Sony AI
The main contribution of this paper is the introduction of "Break-the-Beat!", a novel model for controllable MIDI-to-drum audio synthesis that combines advanced conditioning mechanisms with a pre-trained audio generation framework. This work not only fills a crucial gap in the existing literature but also offers practical tools for music producers, enhancing the creative process in digital music production.
The methodology presented in the paper is robust and innovative, leveraging a pre-trained text-to-audio model (SAO) and introducing a dual-input content encoder that effectively combines MIDI and reference audio for drum synthesis. The hybrid conditioning mechanism is a noteworthy contribution, allowing for precise control over both rhythm and timbre. The use of a novel dataset constructed from existing drum audio datasets is a significant step towards addressing the lack of resources in this area. The authors provide a clear overview of their approach, detailing the input representations, conditioning mechanisms, and training strategies, which enhances the clarity and reproducibility of their work.
The experimental evaluation is thorough, utilizing a well-defined dataset and a variety of metrics to assess the performance of the proposed model. The results demonstrate significant improvements in audio quality, rhythmic alignment, and beat continuity, particularly when using higher temporal resolutions for MIDI input. The paper effectively compares its method against various baselines and provides qualitative and quantitative analyses, which strengthen the validity of the findings. However, the paper could benefit from additional user studies or subjective evaluations to further substantiate the claims of improved audio quality.
The paper provides sufficient details regarding the model architecture, training procedures, and evaluation metrics, which facilitates reproducibility. However, the absence of a publicly available code repository limits the ability of other researchers to fully replicate the study. Providing access to the trained models or code would significantly enhance the reproducibility of the results.
One limitation of the study is the reliance on a specific dataset, which may not encompass the full diversity of drum sounds and styles encountered in real-world music production. Additionally, while the model performs well on the evaluated metrics, the subjective quality of generated audio in practical scenarios remains to be fully explored. The paper also does not address potential computational costs associated with training and inference, which could be a barrier for some users.
The proposed model has the potential to significantly impact digital music production by providing a tool that allows for greater control and creativity in drum synthesis. This could democratize music production for non-experts and enhance the workflow of professional producers. Furthermore, the findings could inspire future research in the area of symbolic-to-audio synthesis, particularly for other instrument types and musical styles. The main contribution of this paper is the introduction of "Break-the-Beat!", a novel model for controllable MIDI-to-drum audio synthesis that combines advanced conditioning mechanisms with a pre-trained audio generation framework. This work not only fills a crucial gap in the existing literature but also offers practical tools for music producers, enhancing the creative process in digital music production.
Early-stage Parkinson's disease (EarlyPD) detection from speech is clinically meaningful yet underexplored, and published results are hard to compare because studies differ in datasets, languages, tasks, evaluation protocols, and EarlyPD definitions. To address this issue, we propose the first benchmark for speech-based EarlyPD detection, with a speaker-independent split designed for fair and replicable cross-method evaluation on researcher-accessible datasets. The benchmark covers three common speech tasks and evaluates methods under different training-resource settings. We also present multi-dimensional evaluation breakdowns by dataset, aggregation level, gender, and disease stage to support fine-grained comparisons and clinical adoption. Our results provide a replicable reference and actionable insights, encouraging the adoption of this publicly available benchmark to advance robust and clinically meaningful EarlyPD detection from speech.
Primary: Radboud University
All Institutions: Radboud University, Radboud University Medical Center
This paper presents the first benchmark for speech-based EarlyPD detection, addressing a critical gap in the literature. The comprehensive methodology and robust experimental evaluation provide a significant contribution to the field, encouraging further research and development in clinically meaningful detection methods.
The paper introduces a well-structured benchmark for Early-stage Parkinson's Disease (EarlyPD) detection from speech, addressing the critical issue of comparability in existing research. The methodology includes a speaker-independent split for datasets, a clear definition of EarlyPD, and a multi-dimensional evaluation framework that allows for nuanced comparisons across various factors, such as gender and disease stage. The use of diverse training-resource settings and the inclusion of both public and private datasets enhance the robustness of the proposed benchmark.
The experiments are comprehensive, utilizing multiple speech tasks and a variety of machine learning models. The results are presented clearly, with a focus on both aggregate and utterance-level performance. The findings indicate significant improvements in EarlyPD detection when expanding speaker diversity, which is a valuable insight for future research. The evaluation metrics used (AUC and F1) are appropriate for the clinical context, ensuring relevance to real-world applications.
The authors emphasize transparency and reproducibility by providing all necessary resources and protocols for replicating their benchmark. The fixed training and evaluation settings, along with the release of datasets, contribute to a high level of reproducibility. However, the reliance on specific datasets may limit generalizability if future datasets differ significantly.
One limitation is the potential bias introduced by the datasets used, particularly in terms of gender representation and the skewed nature of some datasets. Additionally, while the benchmark is robust, the focus on specific speech tasks may not encompass the full range of speech variability seen in real-world clinical settings. The authors also note that spontaneous speech tasks were not included, which could be a significant aspect of EarlyPD detection.
The proposed benchmark has the potential to significantly advance the field of speech-based EarlyPD detection, promoting more reliable and clinically relevant research. By establishing a standardized evaluation protocol, it encourages the adoption of best practices in the community, ultimately leading to improved diagnostic tools for Parkinson's disease. The emphasis on explainability in model design also aligns with current trends in AI, making the findings particularly relevant for future developments in healthcare technology. This paper presents the first benchmark for speech-based EarlyPD detection, addressing a critical gap in the literature. The comprehensive methodology and robust experimental evaluation provide a significant contribution to the field, encouraging further research and development in clinically meaningful detection methods.
Voice agents, artificial intelligence systems that conduct spoken conversations to complete tasks, are increasingly deployed across enterprise applications. However, no existing benchmark jointly addresses two core evaluation challenges: generating realistic simulated conversations, and measuring quality across the full scope of voice-specific failure modes. We present EVA-Bench, an end-to-end evaluation framework that addresses both. On the simulation side, EVA-Bench orchestrates bot-to-bot audio conversations over dynamic multi-turn dialogues, with automatic simulation validation that detects user simulator error and appropriately regenerates conversations before scoring. On the measurement side, EVA-Bench introduces two composite metrics: EVA-A (Accuracy), capturing task completion, faithfulness, and audio-level speech fidelity; and EVA-X (Experience), capturing conversation progression, spoken conciseness, and turn-taking timing. Both metrics apply to different agent architectures, enabling direct cross-architecture comparison. EVA-Bench includes 213 scenarios across three enterprise domains, a controlled perturbation suite for accent and noise robustness, and pass@1, pass@k, pass^k measurements that distinguish peak from reliable capability. Across 12 systems spanning all three architectures, we find: (1) no system simultaneously exceeds 0.5 on both EVA-A pass@1 and EVA-X pass@1; (2) peak and reliable performance diverge substantially (median pass@k - pass^k gap of 0.44 on EVA-A); and (3) accent and noise perturbations expose substantial robustness gaps, with effects varying across architectures, systems, and metrics (mean up to 0.314). We release the full framework, evaluation suite, and benchmark data under an open-source license.
Primary: ServiceNow
All Institutions: ServiceNow
The main contribution of this paper is the introduction of EVA-Bench, a novel evaluation framework for voice agents that combines realistic simulation with comprehensive metrics to assess performance across various architectures and conditions. This work significantly advances the field by addressing critical evaluation challenges and providing a foundation for future research in voice agent technology.
The methodology presented in this paper is robust and addresses significant gaps in the evaluation of voice agents. The authors introduce an end-to-end framework, EVA-Bench, that combines realistic bot-to-bot audio simulation with comprehensive measurement metrics (EVA-A and EVA-X). The simulation methodology is particularly noteworthy as it incorporates automated validation to ensure the quality of user simulations, which is critical for obtaining reliable evaluation scores. The introduction of controlled perturbations to assess robustness against accent and noise variations further strengthens the methodology, allowing for a nuanced understanding of system performance across different conditions.
The experimental evaluation is thorough, involving 12 systems across three distinct architectures and a total of 213 scenarios. The results reveal critical insights into the performance of voice agents, particularly the divergence between peak and reliable performance, which is a crucial finding for real-world applications. The use of multiple trials and the introduction of pass@1, pass@k, and pass^k metrics provide a comprehensive view of system capabilities, although the paper could benefit from additional comparative analysis against existing benchmarks to contextualize the findings further.
The authors emphasize reproducibility by providing open-source access to the framework, evaluation suite, and benchmark data. They include detailed implementation instructions and configurations, which are essential for other researchers to replicate the study. However, the reliance on commercial model APIs for full reproduction may limit accessibility for some researchers, potentially impacting the overall reproducibility of the findings.
The paper acknowledges several limitations, including potential biases in the LLM-based judges, the lack of multilingual coverage, and the constraints of the user simulator in replicating real human caller behaviors. Additionally, the evaluation does not account for harmful outputs or sensitive information exposure, which is particularly relevant in high-stakes domains. The authors also note that the framework does not assess more complex agent configurations, which may limit its applicability in certain scenarios.
The EVA-Bench framework has significant implications for the development and evaluation of voice agents in enterprise applications. By providing a comprehensive evaluation methodology, it can help improve the reliability and user experience of voice agents, ultimately leading to better deployment in real-world settings. The findings regarding performance gaps and robustness under perturbations can inform future research and development efforts, guiding improvements in voice agent architectures and evaluation practices. The main contribution of this paper is the introduction of EVA-Bench, a novel evaluation framework for voice agents that combines realistic simulation with comprehensive metrics to assess performance across various architectures and conditions. This work significantly advances the field by addressing critical evaluation challenges and providing a foundation for future research in voice agent technology.
High-quality training datasets are essential for the performance of neural networks. However, the audio domain still lacks a large-scale, strongly-labeled, and single-source sound event dataset. The FSD50K dataset, despite being relatively large and open, contains a considerable fraction of multi-source samples where background interference or overlapping events could limit the usefulness of the data. To address this challenge, we introduce a data curation framework designed for large-scale open audio corpora. Our approach leverages a generative diffusion model to synthesize clean single-class events to construct controlled noisy mixtures for supervision. We subsequently employ a pre-trained audio encoder coupled with a discriminative classifier to automatically identify and filter out multi-source samples. Experiments show that our framework achieves strong performance on a human expert-curated test set. Finally, we release FSD50K-Solo, a model-curated subset of FSD50K containing single-source audio samples identified by our method. Beyond FSD50K, our method establishes a scalable paradigm for curating open source audio corpora.
Primary: Stony Brook University
All Institutions: Stony Brook University, Bose Corporation
This paper introduces a systematic framework for automated curation of single-source sound events, addressing critical data quality challenges in audio machine learning. The innovative use of generative models for dataset enhancement and the strong experimental results position this work as a significant contribution to the field.
The proposed methodology employs a generative diffusion model to synthesize clean single-source audio events, which is a novel approach to address the challenge of multi-source interference in existing datasets. The framework's reliance on a pre-trained audio encoder and a discriminative classifier for filtering multi-source samples is a significant advancement in automated data curation. The systematic approach to generating controlled noisy mixtures for supervision demonstrates a thoughtful integration of generative modeling with traditional classification techniques.
The experiments are well-structured, utilizing both generated data and a human-curated internal dataset for evaluation. The performance metrics, including traditional classification metrics and Audiobox Aesthetics scores, provide a robust assessment of the model's effectiveness. The results indicate strong classification performance, particularly on the expert-curated dataset, which underscores the model's practical applicability.
The paper states that the complete clip-level metadata of FSD50K-Solo will be released, supporting reproducibility. However, the lack of a direct link to the dataset or code repository limits immediate access for other researchers. The methodology is described in sufficient detail to allow for replication, but the absence of a public project URL is a drawback.
One limitation acknowledged is the potential domain gap between generated data and real-world audio data, which could affect generalization. Additionally, while the framework shows promise, the exploration of its performance on unseen event classes is still required. The reliance on a human-curated dataset for validation may introduce biases inherent in the curation process.
The release of FSD50K-Solo and the proposed curation framework has the potential to significantly advance audio machine learning research by providing a high-quality dataset that can enhance model training and evaluation. The methodology can be applied to other audio corpora, promoting better practices in dataset curation across the field. The implications of improved audio datasets extend to various applications, including sound event detection, audio synthesis, and machine learning in general. This paper introduces a systematic framework for automated curation of single-source sound events, addressing critical data quality challenges in audio machine learning. The innovative use of generative models for dataset enhancement and the strong experimental results position this work as a significant contribution to the field.
Audio provides critical situational cues, yet current Audio Language Models (ALMs) face an attention bottleneck in long-form recordings where dominant background patterns can dilute rare, salient events. We introduce NAACA, a training-free NeuroAuditory Attentive Cognitive Architecture that reframes attention allocation as an auditory salience filtering problem. At its core is OWM, a neuro-inspired Oscillatory Working Memory that maintains stable attractor-like states and triggers higher-cognition ALM processing only when adaptive energy fluctuations signal perceptual salience, triggering higher-level reasoning. On XD-Violence, NAACA improves AudioQwen's average precision (AP) from 53.50% to 70.60% while reducing unnecessary ALM invocations. Furthermore, qualitative case studies on the Urban Soundscapes of the World (USoW) dataset show that OWM captures novel events and subcategory shifts while remaining robust to transient pauses and ambient urban noise.
Primary: Ghent University
All Institutions: Ghent University, Vrije Universiteit Brussel, Queen Mary University of London
The main contribution of this paper is the introduction of NAACA, a training-free neuro-inspired architecture that employs oscillatory dynamics for salience-driven attention gating in audio processing. This innovative approach addresses critical limitations in existing audio language models, offering a promising direction for future research and applications in audio understanding.
The methodology presented in NAACA is innovative, leveraging a neuro-inspired Oscillatory Working Memory (OWM) to address the attention bottleneck in Audio Language Models (ALMs). The approach of framing salience detection as an auditory filtering problem is well-grounded in cognitive neuroscience, and the training-free aspect of the architecture is particularly noteworthy. The use of oscillatory dynamics to maintain stable memory states while adapting to salient changes in audio streams is a significant advancement over traditional methods that rely on extensive historical data or training. The detailed formulation of OWM and its integration into the NAACA framework is technically sound, although the complexity of the model may pose challenges for practical implementation.
The experiments conducted on the XD-Violence and Urban Soundscapes of the World (USoW) datasets provide robust evidence of NAACA's effectiveness. The reported improvement in average precision (AP) demonstrates a clear performance gain over existing models, and the qualitative case studies further illustrate the model's ability to detect salient events in complex audio environments. However, the paper could benefit from a more comprehensive comparison with a wider range of baseline models to fully contextualize its performance.
The paper provides a thorough description of the methods and implementation details, which enhances reproducibility. However, the lack of publicly available code or datasets limits the ability of other researchers to replicate the findings. Including a demo or project URL would greatly enhance the paper's impact and usability within the community.
The primary limitation noted is the dependency on the performance of the chosen audio encoder, which may restrict the model's applicability to out-of-distribution sound events. Additionally, the hard-gating mechanism may overlook contextual information that could be preserved with more flexible attention mechanisms. The evaluation metrics focus mainly on anomaly detection, suggesting that future work should explore broader audio understanding tasks.
The implications of this research are significant, particularly in fields such as public safety surveillance, environmental monitoring, and any domain where audio analysis is critical. By improving the efficiency and effectiveness of audio processing in real-time applications, NAACA has the potential to enhance situational awareness and response capabilities in various contexts. The main contribution of this paper is the introduction of NAACA, a training-free neuro-inspired architecture that employs oscillatory dynamics for salience-driven attention gating in audio processing. This innovative approach addresses critical limitations in existing audio language models, offering a promising direction for future research and applications in audio understanding.
Symbolic-control drum generation requires preserving explicit event timing and dynamics while synthesizing acoustically plausible waveforms. We present Sec2Drum-DAC, a conditional latent-diffusion model for symbolic-to-audio drum rendering. The model conditions on event features sampled in physical time at codec-frame locations and predicts standardized principal-component coordinates of frozen DAC summed-codebook embeddings rather than waveform samples. In the evaluated DAC configuration, 72 principal components capture the observed training-frame summed-latent subspace under the stated SVD threshold, yielding a compact continuous denoising target with a deterministic reconstruction path to the 1024-dimensional DAC latent space before waveform decoding. Across 1,733 held-out four-beat windows, PCA diffusion improves paired spectral and transient metrics over deterministic PCA regression and a symbolic rendering baseline, while direct regression remains stronger on phase-sensitive waveform L1. Auxiliary RVQ cross-entropy improves short-step diffusion on mel error, onset-flux cosine, and waveform L1, with the most favorable trade-offs occurring at 6-25 denoising steps depending on the metric.
Primary: Hellenic Mediterranean University
All Institutions: Hellenic Mediterranean University, Athena RC
This paper contributes a significant advancement in symbolic-to-audio drum rendering through a novel latent-diffusion model that preserves event timing and dynamics while synthesizing realistic audio. The comprehensive methodology and robust experimental evaluation position it as a meaningful contribution to the field of machine learning in audio applications.
The paper presents a novel approach to symbolic-to-audio drum rendering using a conditional latent-diffusion model, which aligns symbolic conditioning to physical time and utilizes PCA for dimensionality reduction in the latent space. The methodology is well-structured, incorporating auxiliary RVQ cross-entropy for improved performance and demonstrating a clear pipeline from symbolic input to audio output. The use of PCA coordinates as a denoising target rather than direct waveform samples is innovative and addresses the challenges of maintaining control over the generated audio while ensuring acoustic fidelity.
The experimental setup is robust, utilizing a substantial dataset of 11,523 training examples and a variety of evaluation metrics that capture different aspects of audio quality, including spectral fidelity and transient accuracy. The results indicate significant improvements over baseline methods, particularly in spectral and transient metrics, although direct regression outperforms on phase-sensitive waveform metrics. The comprehensive evaluation across multiple configurations and the use of statistical testing to validate findings enhance the credibility of the results.
The paper outlines the training and evaluation processes in detail, including hyperparameters and data preprocessing steps, which supports reproducibility. However, the lack of a public repository at the time of review limits immediate reproducibility. The authors mention plans to release the code, which would further aid in this aspect.
The study is narrow in scope, focusing on short four-beat segments rather than full musical compositions, which may limit the generalizability of the findings. Additionally, the reliance on automatic evaluation metrics without a human listening study raises questions about perceived audio quality. The fixed PCA representation may not be optimal for all contexts, and the evaluation does not account for sampling variability.
The proposed method has significant implications for music technology, particularly in enhancing the controllability and fidelity of drum synthesis in various applications, including music production and interactive audio systems. The approach could inspire further research into symbolic-to-audio translation methods and their integration into broader music generation frameworks. This paper contributes a significant advancement in symbolic-to-audio drum rendering through a novel latent-diffusion model that preserves event timing and dynamics while synthesizing realistic audio. The comprehensive methodology and robust experimental evaluation position it as a meaningful contribution to the field of machine learning in audio applications.
Fine-tuning multilingual ASR models like Whisper for low-resource languages often improves read speech but degrades spontaneous audio performance, a phenomenon we term studio-bias. To diagnose this mismatch, we introduce Vividh-ASR, a complexity-stratified benchmark for Hindi and Malayalam across four tiers: studio, broadcast, spontaneous, and synthetic noise. Through a controlled study of learning-rate timing and curriculum ordering, we find that early large parameter updates improve global WER by 12 absolute points, while a hard-to-easy curriculum adds gains for spontaneous speech. These findings motivate reverse multi-stage fine-tuning (R-MFT), a training recipe that enables a parameter-efficient 244M Whisper model to match or exceed conventionally fine-tuned 769M counterparts. Representational analysis via CKA and SVD reveals effective schedules concentrate adaptation in the decoder, preserving the pre-trained encoder's acoustic geometry. We release the benchmark and models.
Primary: Adalat AI, India
All Institutions: Adalat AI, India
The paper presents Vividh-ASR, a complexity-tiered benchmark and a novel training strategy (R-MFT) that significantly enhances the performance of ASR systems for low-resource Indic languages. This work is a valuable contribution to the field, addressing critical challenges in adapting multilingual ASR models while preserving their foundational acoustic capabilities.
The paper introduces a systematic factorial design to dissect the effects of learning rate timing and curriculum ordering on ASR performance. The proposed Reverse Multi-Stage Fine-Tuning (R-MFT) method is well-structured, allowing for a clear understanding of how different training strategies impact model adaptation. The complexity-tiered benchmark, Vividh-ASR, is a significant methodological contribution, providing a structured way to evaluate ASR models across varying levels of acoustic complexity.
The experiments are rigorous, employing a controlled factorial design that isolates key variables affecting performance. The results are clearly presented, demonstrating substantial improvements in WER through the proposed methods. The analysis of internal model representations using CKA and SVD adds depth to the evaluation, linking empirical results to theoretical insights about model adaptation.
The paper provides sufficient details on the implementation, including model architectures, training stages, and hyperparameters, which facilitates reproducibility. However, the lack of a publicly available demo or project URL limits the ease of access to the exact experimental setup.
The study primarily focuses on Hindi and Malayalam, which may limit the generalizability of the findings to other Indic languages or low-resource languages in general. Additionally, while the paper discusses the preservation of the encoder's acoustic geometry, it does not fully explore the implications of this for other model architectures or training paradigms.
The findings have significant implications for improving ASR systems in low-resource languages, potentially enhancing accessibility and usability in diverse linguistic contexts. The introduction of a complexity-tiered benchmark could inspire further research and development in ASR, particularly for languages that have been historically underrepresented in machine learning research. The paper presents Vividh-ASR, a complexity-tiered benchmark and a novel training strategy (R-MFT) that significantly enhances the performance of ASR systems for low-resource Indic languages. This work is a valuable contribution to the field, addressing critical challenges in adapting multilingual ASR models while preserving their foundational acoustic capabilities.
Developing text-driven symbolic music generation models remains challenging due to the scarcity of aligned text-music datasets and the unreliability of automated captioning pipelines. While most efforts have focused on MIDI, sheet music representations are largely underexplored in text-driven generation. We present Text2Score, a two-stage framework comprising a planning stage and an execution stage for generating sheet music from natural language prompts. By deriving supervision signals directly from symbolic XML data, we propose an alternative training paradigm that bypasses noisy or scarce text-music pairs. In the planning stage, an LLM orchestrator translates a natural language prompt into a structured measure-wise plan defining musical attributes such as instruments, key, time signatures, harmony, etc. This plan is then consumed by a generative model in the execution stage to produce interleaved ABC notation conditioned on the plan's structural constraints. To assess output quality, we introduce an evaluation framework covering playability, readability, instrument utilization, structural complexity, and prompt adherence, validated by expert musicians. Text2Score consistently outperforms both a pure LLM-based agentic framework and three end-to-end baselines across objective and subjective dimensions. We open-source the dataset, code, evaluation set and LLM prompts used in this work; a demo is available on our project page (https://keshavbhandari.github.io/portfolio/text2score).
Primary: unknown
All Institutions: unknown
Text2Score presents a novel two-stage framework for generating sheet music from natural language prompts, significantly advancing the state of symbolic music generation. The methodology effectively separates planning and execution, yielding high-quality outputs that outperform existing models, while the comprehensive evaluation framework sets a new standard for assessing music generation systems.
The methodology presented in Text2Score is innovative, utilizing a two-stage framework that separates the planning and execution phases of music generation. This approach allows for more structured reasoning about musical attributes, which is a significant advancement over traditional end-to-end models. The use of an LLM orchestrator to create a structured measure-wise plan is particularly noteworthy, as it addresses issues related to the lack of aligned text-music datasets. The integration of a generative model that processes this plan through a hierarchical decoder further enhances the robustness of the generation process. The detailed definition of the structural plan and the metrics for evaluation are well-articulated, providing a clear framework for assessing the generated outputs.
The experimental evaluation is thorough, employing both objective metrics and subjective assessments from expert musicians. The paper provides a comprehensive suite of evaluation metrics that cover playability, readability, and prompt adherence, which are crucial for assessing the quality of generated sheet music. The results demonstrate that Text2Score outperforms several baseline models, indicating the effectiveness of the proposed framework. However, the paper could benefit from a more detailed discussion of the dataset's diversity and the specific prompts used in evaluations to better contextualize the results.
The paper includes sufficient details regarding the implementation, including the architecture of the models and the training procedures. The use of ModernBERT and a hierarchical decoder is clearly described, and the authors have made their dataset and code available, which supports reproducibility. However, the lack of specific details about the dataset curation process and the exact nature of the prompts used in evaluations could hinder full reproducibility.
One limitation noted is the potential for the LLM-generated inference plan to diverge from training plans, which could lead to discrepancies in output quality. Additionally, while the evaluation metrics are comprehensive, they may not capture all aspects of musical quality, particularly in terms of expressive nuances that could be important for professional compositions. The paper also acknowledges the need for richer annotations to capture finer musical details, which could enhance the model's performance.
The implications of this work are significant for the fields of music generation and artificial intelligence. By providing a framework that can generate high-quality sheet music from textual prompts, Text2Score opens new avenues for composers and musicians, potentially streamlining the creative process. The open-sourcing of the dataset and code encourages further research and development in this area, promoting collaboration and innovation. The integration of LLMs in music generation also highlights the potential for AI to assist in creative fields, which could lead to broader applications in music education and composition. Text2Score presents a novel two-stage framework for generating sheet music from natural language prompts, significantly advancing the state of symbolic music generation. The methodology effectively separates planning and execution, yielding high-quality outputs that outperform existing models, while the comprehensive evaluation framework sets a new standard for assessing music generation systems.