Speech deepfake detection is a well-established research field with different models, datasets, and training strategies. However, the lack of standardized implementations and evaluation protocols limits reproducibility, benchmarking, and comparison across studies. In this work, we present DeepFense, a comprehensive, open-source PyTorch toolkit integrating the latest architectures, loss functions, and augmentation pipelines, alongside over 100 recipes. Using DeepFense, we conducted a large-scale evaluation of more than 400 models. Our findings reveal that while carefully curated training data improves cross-domain generalization, the choice of pre-trained front-end feature extractor dominates overall performance variance. Crucially, we show severe biases in high-performing models regarding audio quality, speaker gender, and language. DeepFense is expected to facilitate real-world deployment with the necessary tools to address equitable training data selection and front-end fine-tuning.
Primary: German Research Center for Artificial Intelligence (DFKI)
All Institutions: German Research Center for Artificial Intelligence (DFKI), University of Stuttgart, National Institute of Informatics, Technical University of Berlin
The main contribution of this paper is the introduction of DeepFense, a comprehensive, modular, and extensible framework for robust deepfake audio detection that facilitates reproducible research and addresses critical biases in model performance. This work significantly advances the field by providing a standardized toolkit that enhances the ability to benchmark and compare deepfake detection models effectively.
The methodology presented in DeepFense is robust and well-structured, focusing on creating a modular and extensible framework for deepfake audio detection. The use of a configuration-driven design allows for easy experimentation and reproducibility, which is a significant advancement in the field. The integration of over 400 models and 100 recipes enhances the toolkit's utility for researchers. The modular architecture facilitates the isolation of algorithmic innovations from implementation artifacts, which is critical for accurate benchmarking.
The experimental evaluation is extensive, covering a large-scale comparison of 400 models across 13 datasets, which is a notable strength of the paper. The results provide valuable insights into the impact of front-end feature extractors, back-end architectures, and training datasets on model performance. The findings regarding biases in model performance based on audio quality, speaker gender, and language are particularly important for ensuring equitable AI systems.
The paper emphasizes reproducibility through its open-source nature and the provision of a comprehensive toolkit that allows other researchers to replicate experiments easily. The use of a single YAML file for experiment configuration is a strong point, as it simplifies the process of sharing and reproducing results.
While the paper presents a significant advancement, it acknowledges limitations such as the lack of a multi-dataset training pipeline and the focus solely on detection tasks. These limitations suggest areas for future research, including the need for more comprehensive training strategies that can mitigate biases.
The implications of this work are substantial, particularly in the context of increasing concerns about deepfake technology and its potential misuse. By providing a standardized toolkit for deepfake detection, DeepFense can help improve the robustness of systems used in real-world applications, thereby enhancing security and trust in voice biometric systems. The main contribution of this paper is the introduction of DeepFense, a comprehensive, modular, and extensible framework for robust deepfake audio detection that facilitates reproducible research and addresses critical biases in model performance. This work significantly advances the field by providing a standardized toolkit that enhances the ability to benchmark and compare deepfake detection models effectively.
We present Audio Flamingo Next (AF-Next), the next-generation and most capable large audio-language model in the Audio Flamingo series, designed to advance understanding and reasoning over speech, environmental sounds and music. Compared to Audio Flamingo 3, AF-Next introduces: (i) a stronger foundational audio-language model that significantly improves accuracy across diverse audio understanding tasks; (ii) scalable strategies for constructing large-scale audio understanding and reasoning data beyond existing academic benchmarks; (iii) support for long and complex audio inputs up to 30 minutes; and (iv) Temporal Audio Chain-of-Thought, a new reasoning paradigm that explicitly grounds intermediate reasoning steps to timestamps in long audio, enabling fine-grained temporal alignment and improved interpretability. To enable these capabilities, we first conduct a systematic analysis of Audio Flamingo 3 to identify key gaps in audio understanding and reasoning. We then curate and scale new large-scale datasets totaling over 1 million hours to address these limitations and expand the existing AudioSkills-XL, LongAudio-XL, AF-Think and AF-Chat datasets. AF-Next is trained using a curriculum-based strategy spanning pre-training, mid-training and post-training stages. Extensive experiments across 20 audio understanding and reasoning benchmarks, including challenging long-audio tasks, show that AF-Next outperforms similarly sized open models by large margins and remains highly competitive with and sometimes surpasses, much larger open-weight and closed models. Beyond benchmark performance, AF-Next exhibits strong real-world utility and transfers well to unseen tasks, highlighting its robustness and generalization ability. In addition to all data, code and methods, we open-source 3 variants of AF-Next, including AF-Next-Instruct, AF-Next-Think and AF-Next-Captioner.
Primary: University of Maryland
All Institutions: University of Maryland, NVIDIA
The main contribution of this paper is the introduction of Audio Flamingo Next (AF-Next), a state-of-the-art open audio-language model that significantly advances audio understanding and reasoning capabilities, particularly for long and complex audio inputs. The comprehensive methodology, extensive experimental validation, and commitment to open science position this work as a significant milestone in the development of large audio-language models.
The methodology presented in the paper is robust, featuring a systematic analysis of previous models to identify gaps in audio understanding and reasoning. The introduction of the Temporal Audio Chain-of-Thought paradigm is particularly noteworthy, as it enhances the model's ability to handle long audio inputs by grounding reasoning steps to timestamps. The training strategy, which includes a four-stage curriculum and the curation of a large-scale dataset of over 1 million hours, demonstrates a comprehensive approach to improving model performance across various audio tasks. The use of diverse data sources and the focus on real-world applicability are commendable.
The experiments conducted across 20 audio understanding and reasoning benchmarks are extensive and well-structured. The results show that AF-Next consistently outperforms previous models, including both open-weight and closed models, particularly in long-audio tasks. The paper provides a thorough comparison with state-of-the-art models, showcasing significant improvements in accuracy and robustness. The inclusion of qualitative examples further strengthens the evaluation of the model's capabilities.
The authors have committed to open-sourcing the model weights, training data, and code, which is a significant step towards ensuring reproducibility. However, the paper could benefit from more detailed descriptions of the training configurations and hyperparameters used in each stage, as well as clearer guidelines for replicating the experiments.
The paper acknowledges several limitations, including the challenges posed by noisy and unevenly distributed training data, particularly for low-resource languages and rare sound events. Additionally, while the model improves long-audio understanding, it still faces difficulties with temporally distant evidence. The evaluation focuses primarily on established benchmarks, which may not fully capture the model's capabilities in more complex scenarios.
The advancements presented in AF-Next have the potential to significantly enhance audio understanding applications, including automatic speech recognition, audio captioning, and music information retrieval. The model's ability to handle long-form audio and its open-source nature could foster further research and development in the field, promoting transparency and collaboration among researchers. The main contribution of this paper is the introduction of Audio Flamingo Next (AF-Next), a state-of-the-art open audio-language model that significantly advances audio understanding and reasoning capabilities, particularly for long and complex audio inputs. The comprehensive methodology, extensive experimental validation, and commitment to open science position this work as a significant milestone in the development of large audio-language models.
Vocal-to-accompaniment (V2A) generation, which aims to transform a raw vocal recording into a fully arranged accompaniment, inherently requires jointly addressing an accompaniment trilemma: preserving acoustic authenticity, maintaining global coherence with the vocal track, and producing dynamic orchestration across a full song. Existing open-source approaches typically make compromises among these goals. Continuous-latent generation models can capture long musical spans but often struggle to preserve fine-grained acoustic detail. In contrast, discrete autoregressive models retain local fidelity but suffer from unidirectional generation and error accumulation in extended contexts. We present LaDA-Band, an end-to-end framework that introduces Discrete Masked Diffusion to the V2A task. Our approach formulates V2A generation as Discrete Masked Diffusion, i.e., a global, non-autoregressive denoising formulation that combines the representational advantages of discrete audio codec tokens with full-sequence bidirectional context modeling. This design improves long-range structural consistency and temporal synchronization while preserving crisp acoustic details. Built on this formulation, LaDA-Band further introduces a dual-track prefix-conditioning architecture, an auxiliary replaced-token detection objective for weakly anchored accompaniment regions, and a two-stage progressive curriculum to scale Discrete Masked Diffusion to full-song vocal-to-accompaniment generation. Extensive experiments on both academic and real-world benchmarks show that LaDA-Band consistently improves acoustic authenticity, global coherence, and dynamic orchestration over existing baselines, while maintaining strong performance even without auxiliary reference audio. Codes and audio samples are available at https://github.com/Duoluoluos/TME-LaDA-Band .
Primary: Institute of Computing Technology, Chinese Academy of Sciences (CAS)
All Institutions: Institute of Computing Technology, Chinese Academy of Sciences (CAS), Lyra Lab, Tencent Music Entertainment, Pengcheng Laboratory, State Key Lab of AI Safety
LaDA-Band presents a novel approach to vocal-to-accompaniment generation through Discrete Masked Diffusion, significantly improving upon existing methods in terms of acoustic authenticity, coherence, and orchestration. The comprehensive methodology and rigorous experimental validation position this work as a meaningful contribution to the field of machine learning in audio generation.
The methodology presented in LaDA-Band is innovative, leveraging Discrete Masked Diffusion to address the vocal-to-accompaniment generation problem. The dual-track prefix-conditioning architecture and the auxiliary replaced-token detection objective are significant contributions that enhance the model's ability to generate high-quality accompaniment while maintaining acoustic authenticity and global coherence. The two-stage progressive curriculum for training is a well-thought-out approach that allows the model to scale from short-form to full-song generation effectively.
The experiments conducted are extensive, comparing LaDA-Band against a variety of state-of-the-art baselines across multiple metrics. The results demonstrate consistent improvements in acoustic authenticity, global coherence, and dynamic orchestration, particularly under zero-shot conditions. The use of both objective metrics (like FAD and Onset F1) and subjective evaluations (like MOS) provides a comprehensive assessment of the model's performance.
The paper provides detailed implementation specifics, including architecture choices, training procedures, and evaluation metrics, which enhance reproducibility. The availability of the code and audio samples further supports this aspect, allowing other researchers to replicate the study.
While the paper acknowledges limitations such as dependency on the source separation and audio codec pipeline, it also notes challenges in fine-grained control over arrangement details and difficulties with certain stylistically free-form genres. These limitations suggest areas for future research and improvement.
The potential applications of LaDA-Band are significant, particularly in the music production industry, where automated accompaniment generation can streamline workflows for artists and producers. The framework's ability to generate high-quality music without extensive manual intervention could democratize music creation and enhance creative processes. LaDA-Band presents a novel approach to vocal-to-accompaniment generation through Discrete Masked Diffusion, significantly improving upon existing methods in terms of acoustic authenticity, coherence, and orchestration. The comprehensive methodology and rigorous experimental validation position this work as a meaningful contribution to the field of machine learning in audio generation.
Role-playing has garnered rising attention as it provides a strong foundation for human-machine interaction and facilitates sociological research. However, current work is confined to textual modalities, neglecting speech, which plays a predominant role in daily life, thus limiting genuine role-playing. To bridge this gap, we conceptualize and benchmark speech role-playing through ActorMindBench, and we present a corresponding reasoning framework, called ActorMind. Specifically, (1) Speech Role-Playing enables models to deliver spontaneous responses with personalized verbal traits based on their role, the scene, and spoken dialogue. (2) ActorMindBench is a hierarchical benchmark comprises Utterance-Level content with 7,653 utterances, Scene-Level content with 313 scenes, and Role-Level content with 6 roles. (3) ActorMind is an off-the-shelf, multi-agent, chain-of-though style reasoning framework that emulates how human actors perform in theaters. Concretely, ActorMind first reads its assigned role description via Eye Agent, then comprehends emotional cues within contextual spoken dialogues through Ear Agent. Subsequently, Brain Agent generates a descriptive emotional state, and finally, Mouth Agent delivers the scripts infused with corresponding emotion state. Experimental results demonstrate the effectiveness of ActorMind in enhancing speech role-playing.
Primary: The Hong Kong University of Science and Technology
All Institutions: The Hong Kong University of Science and Technology
The paper presents ActorMind, a pioneering framework for speech role-playing that integrates emotional reasoning and contextual understanding through a multi-agent system. This work significantly advances the field of audio-based machine learning by bridging the gap between textual and auditory modalities in role-playing scenarios.
The methodology presented in this paper is innovative, introducing a multi-agent chain-of-thought reasoning framework (ActorMind) that emulates human actor performance in speech role-playing. The four agents (Eye, Ear, Brain, Mouth) are well-defined and contribute to a coherent process for generating emotionally nuanced speech. The hierarchical benchmark (ActorMindBench) is a significant contribution, providing a structured dataset that allows for comprehensive evaluation of speech role-playing capabilities. The design is grounded in established theatrical practices, enhancing its relevance and applicability.
The experimental evaluation is robust, utilizing a well-structured dataset derived from a popular TV series, which ensures familiarity and relatability in the speech role-playing context. The use of subjective evaluation metrics (RP-MOS) adds credibility to the results, allowing for nuanced assessment of emotional expression and delivery accuracy. The paper reports clear performance improvements over baseline models, demonstrating the effectiveness of ActorMind in generating spontaneous and contextually appropriate speech.
The paper provides sufficient implementation details, including the construction pipeline for ActorMindBench and the operational mechanics of each agent in ActorMind. However, the lack of a publicly available demo or audio samples limits the immediate reproducibility of the results. Future work could benefit from sharing more implementation specifics or a demo to facilitate broader validation.
The primary limitation noted is the reliance on a single source (Friends Season 1) for the benchmark, which restricts the diversity of roles and contexts. This could limit the generalizability of the findings. Additionally, while the framework is off-the-shelf, further training could enhance its performance, particularly in more complex role-playing scenarios.
The work has significant implications for human-machine interaction, particularly in applications requiring emotionally intelligent responses, such as virtual assistants, gaming, and therapeutic settings. By advancing speech role-playing capabilities, it opens avenues for more engaging and realistic interactions between humans and machines, potentially transforming user experiences in various domains. The paper presents ActorMind, a pioneering framework for speech role-playing that integrates emotional reasoning and contextual understanding through a multi-agent system. This work significantly advances the field of audio-based machine learning by bridging the gap between textual and auditory modalities in role-playing scenarios.
Multichannel speech enhancement is widely used as a front-end in microphone array processing systems. While most existing approaches produce a single enhanced signal, direction-preserving multiple-input multiple-output (MIMO) methods instead aim to provide enhanced multichannel signals that retain directional properties, enabling downstream applications such as beamforming, binaural rendering, and direction-of-arrival estimation. In this work, we propose a fully blind, direction-preserving MIMO speech enhancement method based on neural estimation of the spatial noise covariance matrix. A lightweight OnlineSpatialNet estimates a scale-normalized Cholesky factor of the frequency-domain noise covariance, which is combined with a direction-preserving MIMO Wiener filter to enhance speech while preserving the spatial characteristics of both target and residual noise. In contrast to prior approaches relying on oracle information or mask-based covariance estimation for single-output systems, the proposed method directly targets accurate multichannel covariance estimation with low computational complexity. Experimental results show improved speech enhancement, covariance estimation capability, and performance in downstream tasks over a mask-based baseline, approaching oracle performance with significantly fewer parameters and computational cost.
Primary: Chalmers University of Technology
All Institutions: Chalmers University of Technology
This paper presents a direction-preserving MIMO speech enhancement method utilizing a neural covariance estimator, which significantly advances the field by improving both computational efficiency and performance in multichannel audio applications. The innovative approach and thorough experimental validation position it as a valuable contribution to audio signal processing research.
The proposed methodology introduces a novel approach to MIMO speech enhancement by utilizing a neural network for covariance estimation, specifically through the OnlineSpatialNet architecture. This method effectively reduces the reliance on oracle information and mask-based techniques, which have been limitations in previous models. The integration of a direction-preserving MIMO Wiener filter enhances the robustness of the approach while maintaining spatial characteristics of the audio signals. The choice of a lightweight network architecture is commendable, as it balances performance with computational efficiency.
The experiments are well-structured, utilizing a comprehensive dataset generated from the DNS challenge and simulating realistic acoustic environments. The comparison against the NICE model provides a solid benchmark, and the reported metrics (SI-SDR, Cholesky loss, and covariance similarity) effectively demonstrate the advantages of the proposed method. The results indicate significant improvements in speech enhancement and covariance estimation, validating the effectiveness of the OnlineSpatialNet architecture.
The paper provides sufficient details regarding the experimental setup, including dataset generation, model configurations, and training procedures. However, the lack of a public code repository may hinder full reproducibility. The authors should consider releasing their code to facilitate further research and validation of their findings.
One identified limitation is the reliance on simulated data, which may not fully capture the complexities of real-world environments. Additionally, while the OnlineSpatialNet shows promising results, it may still struggle in highly reverberant or non-stationary noise conditions, which are common in practical applications. The paper could benefit from discussing these limitations more explicitly.
The proposed method has significant implications for various applications in audio processing, including hearing aids, telecommunication systems, and immersive audio experiences. By preserving directional information while enhancing speech quality, this research can contribute to advancements in spatial audio technologies and improve user experiences in noisy environments. This paper presents a direction-preserving MIMO speech enhancement method utilizing a neural covariance estimator, which significantly advances the field by improving both computational efficiency and performance in multichannel audio applications. The innovative approach and thorough experimental validation position it as a valuable contribution to audio signal processing research.
Evaluating the emotional intelligence (EI) of audio language models (ALMs) is critical. However, existing benchmarks mostly rely on synthesized speech, are limited to single-turn interactions, and depend heavily on open-ended scoring. This paper proposes HumDial-EIBench, a comprehensive benchmark for evaluating ALMs' EI. Using real-recorded human dialogues from the ICASSP 2026 HumDial Challenge, it reformulates emotional tracking and causal reasoning into multiple-choice questions with adversarial distractors, mitigating subjective scoring bias for cognitive tasks. It retains the generation of empathetic responses and introduces an acoustic-semantic conflict task to assess robustness against contradictory multimodal signals. Evaluations of eight ALMs reveal that most models struggle with multi-turn emotional tracking and implicit causal reasoning. Furthermore, all models exhibit decoupled textual and acoustic empathy, alongside a severe text-dominance bias during cross-modal conflicts.
Primary: Nanjing University
All Institutions: Nanjing University, Northwestern Polytechnical University, AISHELL
This paper introduces HumDial-EIBench, a novel benchmark for evaluating the emotional intelligence of audio language models using real human dialogues. The comprehensive methodology and significant findings regarding model performance gaps contribute meaningfully to the advancement of multimodal AI systems, highlighting the need for improved emotional understanding in AI interactions.
The proposed methodology is robust, leveraging real human dialogues to create a comprehensive benchmark for evaluating emotional intelligence in audio language models. The reformulation of tasks into multiple-choice questions with adversarial distractors is innovative and addresses the subjective biases present in previous benchmarks. The introduction of an acoustic-semantic conflict task is particularly noteworthy, as it evaluates models' abilities to handle contradictory multimodal signals, which is a significant gap in existing frameworks. The structured data construction pipeline ensures high-quality recordings and a controlled evaluation environment, enhancing the reliability of the results.
The experiments conducted on eight state-of-the-art audio language models provide valuable insights into their performance across various emotional intelligence tasks. The results highlight critical deficiencies in current models, particularly in multi-turn emotional tracking and implicit causal reasoning. The use of both automated and human evaluations for different tasks adds depth to the analysis, although the reliance on LLMs for some scoring introduces variability. The findings are well-supported by quantitative metrics and qualitative assessments, making a strong case for the proposed benchmark's effectiveness.
The paper provides a clear description of the methodology and evaluation metrics, along with a link to the GitHub repository for accessing the benchmark. However, details on the specific implementations of the evaluated models and their configurations are limited, which may hinder full reproducibility. The authors could enhance this aspect by providing more granular information on the experimental setup and model parameters.
The study acknowledges limitations, such as the high variance in text empathy evaluation scores, indicating challenges in objectively quantifying empathy depth. Additionally, the acoustic-semantic conflict evaluation is currently limited to single-turn utterances, which may not fully capture the complexities of real-world interactions. Future work is needed to expand multi-turn conflict scenarios and improve automatic evaluation metrics.
The development of HumDial-EIBench has significant implications for the field of emotional intelligence in AI, particularly in enhancing the capabilities of audio language models. By addressing critical gaps in existing benchmarks, this work paves the way for more nuanced evaluations of multimodal systems, potentially leading to advancements in applications such as conversational agents, mental health support systems, and interactive entertainment. This paper introduces HumDial-EIBench, a novel benchmark for evaluating the emotional intelligence of audio language models using real human dialogues. The comprehensive methodology and significant findings regarding model performance gaps contribute meaningfully to the advancement of multimodal AI systems, highlighting the need for improved emotional understanding in AI interactions.
Voice imitation aims to transform source speech to match a reference speaker's timbre and speaking style while preserving linguistic content. A straightforward approach is to train on triplets of (source, reference, target), where source and target share the same content but target matches the reference's voice characteristics, yet such data is extremely scarce. Existing approaches either employ carefully designed disentanglement architectures to bypass this data scarcity or leverage external systems to synthesize pseudo-parallel training data. However, the former requires intricate model design, and the latter faces a quality ceiling when synthetic speech is used as training targets. To address these limitations, we propose MimicLM, which takes a novel approach by using synthetic speech as training sources while retaining real recordings as targets. This design enables the model to learn directly from real speech distributions, breaking the synthetic quality ceiling. Building on this data construction approach, we incorporate interleaved text-audio modeling to guide the generation of content-accurate speech and apply post-training with preference alignment to mitigate the inherent distributional mismatch when training on synthetic data. Experiments demonstrate that MimicLM achieves superior voice imitation quality with a simple yet effective architecture, significantly outperforming existing methods in naturalness while maintaining competitive similarity scores across speaker identity, accent, and emotion dimensions.
Primary: Tsinghua University
All Institutions: Tsinghua University, The Chinese University of Hong Kong, Shenzhen
MimicLM presents a novel approach to voice imitation that leverages synthetic speech as training sources while retaining real recordings as targets, significantly enhancing the quality and naturalness of generated speech. The comprehensive evaluation and innovative methodology position this work as a meaningful contribution to the field of machine learning and audio processing.
The proposed methodology in MimicLM is innovative, particularly in its role-swapping data construction strategy which utilizes synthetic speech as sources while preserving real recordings as targets. This approach effectively addresses the scarcity of parallel data in voice imitation tasks and breaks the quality ceiling associated with synthetic targets. The incorporation of interleaved text-audio modeling enhances content fidelity, while preference alignment during post-training mitigates the distributional gap between training and inference. These methodological advancements are well-grounded in the challenges of voice imitation, making the approach both practical and theoretically sound.
The experimental evaluation is comprehensive, utilizing both subjective and objective metrics to assess the performance of MimicLM against state-of-the-art systems. The use of a large-scale dataset (Emilia) for training and the systematic evaluation across multiple benchmarks (SeedTTS test-vc-en and MimicLM-Test) demonstrates the robustness of the results. The paper presents clear comparisons with existing methods, showing significant improvements in naturalness, intelligibility, and similarity metrics, which are crucial for voice imitation tasks.
The paper provides detailed implementation details, including training configurations, data construction processes, and evaluation metrics. However, the absence of a publicly available code repository limits reproducibility. While the methodology is described in depth, access to the actual implementation would enhance the ability of other researchers to replicate the results.
The paper acknowledges several limitations, including the dependency on the quality of the TTS model used for generating synthetic speech and the potential for higher word error rates (WER) on real inputs. Additionally, the reliance on external systems for TTS may introduce variability that affects the overall performance. The authors also highlight the risks associated with misuse of voice imitation technology, which necessitates careful consideration in deployment.
The advancements in voice imitation technology presented in this work have significant implications for applications in personalized voice assistants, audiobook narration, and accessibility tools. However, the potential for misuse, such as unauthorized voice cloning and impersonation, raises ethical concerns that must be addressed through appropriate safeguards and regulations. The authors emphasize the importance of responsible deployment and the need for ongoing dialogue within the research community regarding the ethical implications of their work. MimicLM presents a novel approach to voice imitation that leverages synthetic speech as training sources while retaining real recordings as targets, significantly enhancing the quality and naturalness of generated speech. The comprehensive evaluation and innovative methodology position this work as a meaningful contribution to the field of machine learning and audio processing.
Recent advances in Speech Large Language Models (Speech-LLMs) have made significant progress, greatly enhancing multimodal interaction capabilities.However, their application in low-resource and dialect-diverse environments still faces challenges. The severe scarcity of Tibetan data, coupled with the phonetic differences among its major dialects (ร-Tsang, Amdo, and Kham), is a prime example of this challenge. This paper proposes Ti-Audio, the first multi-dialectal end-to-end Speech-LLM for Tibetan. To efficiently align speech and text, we introduce a Dynamic Q-Former Adapter that extracts essential acoustic features from variable-length speech, ensuring stable cross-modal alignment even with limited data. At the data level, we leverage mutual assistance among related dialects to alleviate data scarcity and employ a temperature-based sampling strategy to maximize this synergy. Experimental results demonstrate that Ti-Audio achieves state-of-the-art performance on Tibetan benchmarks for automatic speech recognition and speech translation. Our work validates the effectiveness of cross-dialectal cooperation and provides a scalable paradigm for the development of Speech-LLM in low-resource scenarios.
Primary: Minzu University of China
All Institutions: Minzu University of China, The Chinese University of Hong Kong
The main contribution of this paper is the introduction of Ti-Audio, the first multi-dialectal end-to-end Speech-LLM for Tibetan, which effectively addresses the challenges of data scarcity and dialectal diversity through innovative methodologies and comprehensive experimental validation. This work significantly advances the state-of-the-art in speech processing for low-resource languages, providing a scalable framework for future research and applications.
The paper introduces Ti-Audio, a novel end-to-end Speech-LLM specifically designed for Tibetan, which is a low-resource language. The methodology is innovative, employing a Dynamic Q-Former Adapter to bridge the gap between speech and text modalities effectively. The approach leverages cross-dialectal cooperation to enhance performance in resource-scarce settings, which is a significant advancement in the field of speech processing for low-resource languages. The use of a temperature-aware data balancing strategy is particularly noteworthy, as it addresses data imbalance issues effectively. Overall, the methodology is well-structured and presents a clear advancement over existing techniques.
The experiments are comprehensive, demonstrating the effectiveness of Ti-Audio across various tasks, including automatic speech recognition (ASR) and speech translation (ST). The results show significant improvements over baseline models, with state-of-the-art performance metrics reported. The experimental setup is robust, utilizing a well-constructed dataset that addresses the challenges of dialectal diversity and data scarcity. The paper also includes ablation studies that validate the contributions of different components of the architecture, enhancing the credibility of the results.
The paper provides detailed descriptions of the model architecture, training procedures, and evaluation metrics, which are essential for reproducibility. However, the absence of a publicly available code repository or dataset limits the ability for others to replicate the results fully. The authors should consider releasing their code and data to facilitate further research in this area.
One limitation is the reliance on proprietary datasets, which may not be accessible to the broader research community. Additionally, while the model shows strong performance, the evaluation of emotional recognition tasks indicates that there are still challenges in modeling subtle emotional cues, suggesting areas for future improvement. The paper could also benefit from a more thorough exploration of the limitations of the proposed approach in terms of scalability and generalization to other low-resource languages.
The development of Ti-Audio has significant implications for the field of speech processing, particularly for low-resource languages. By demonstrating that cross-dialectal cooperation can enhance model performance, this work opens avenues for similar approaches in other dialectically diverse languages. The findings could lead to improved accessibility and usability of speech technologies for Tibetan speakers and potentially other low-resource language communities. The main contribution of this paper is the introduction of Ti-Audio, the first multi-dialectal end-to-end Speech-LLM for Tibetan, which effectively addresses the challenges of data scarcity and dialectal diversity through innovative methodologies and comprehensive experimental validation. This work significantly advances the state-of-the-art in speech processing for low-resource languages, providing a scalable framework for future research and applications.
Recent progress in multimodal models has spurred rapid advances in audio understanding, generation, and editing. However, these capabilities are typically addressed by specialized models, leaving the development of a truly unified framework that can seamlessly integrate all three tasks underexplored. While some pioneering works have explored unifying audio understanding and generation, they often remain confined to specific domains. To address this, we introduce Audio-Omni, the first end-to-end framework to unify generation and editing across general sound, music, and speech domains, with integrated multi-modal understanding capabilities. Our architecture synergizes a frozen Multimodal Large Language Model for high-level reasoning with a trainable Diffusion Transformer for high-fidelity synthesis. To overcome the critical data scarcity in audio editing, we construct AudioEdit, a new large-scale dataset comprising over one million meticulously curated editing pairs. Extensive experiments demonstrate that Audio-Omni achieves state-of-the-art performance across a suite of benchmarks, outperforming prior unified approaches while achieving performance on par with or superior to specialized expert models. Beyond its core capabilities, Audio-Omni exhibits remarkable inherited capabilities, including knowledge-augmented reasoning generation, in-context generation, and zero-shot cross-lingual control for audio generation, highlighting a promising direction toward universal generative audio intelligence. The code, model, and dataset will be publicly released on https://zeyuet.github.io/Audio-Omni.
Primary: Hong Kong University of Science and Technology
All Institutions: Hong Kong University of Science and Technology, Peking University, WeChat Vision, Tencent Inc
The main contribution of this paper is the introduction of Audio-Omni, a unified framework for audio understanding, generation, and editing, which leverages a novel architecture and a large-scale dataset to achieve state-of-the-art performance across multiple audio tasks. This work significantly advances the field of multimodal audio processing by providing a comprehensive solution that integrates various audio capabilities into a single model, setting a new standard for future research in generative audio intelligence.
The paper introduces Audio-Omni, a novel framework that integrates audio understanding, generation, and editing across diverse audio domains. Its architecture employs a frozen Multimodal Large Language Model (MLLM) for high-level reasoning and a trainable Diffusion Transformer (DiT) for synthesis, which is a significant advancement in unifying these tasks. The hybrid conditioning mechanism effectively separates high-level semantic inputs from low-level signal features, allowing for precise audio manipulation. The dataset construction method is also innovative, combining real-world data mining with synthetic data generation to create a large-scale dataset for instruction-guided audio editing.
The experiments are extensive and demonstrate that Audio-Omni outperforms prior unified models and matches or exceeds the performance of specialized models across various benchmarks. The use of both objective metrics (like FAD and LSD) and subjective evaluations (human ratings) provides a comprehensive assessment of the model's capabilities. The results validate the effectiveness of the proposed architecture and its ability to generalize across multiple audio tasks.
The paper provides detailed implementation details, including architecture specifications, training protocols, and evaluation metrics, which enhances reproducibility. The authors commit to releasing their code, model, and dataset publicly, which is a positive step toward enabling other researchers to replicate their findings.
While the framework shows promise, it may still face challenges in handling highly complex audio editing tasks that require nuanced understanding beyond the current capabilities of the MLLM. Additionally, the reliance on a large-scale dataset may limit accessibility for researchers without similar resources.
The potential applications of this work are significant, ranging from creative audio generation to practical applications in media production and accessibility technologies. However, ethical considerations regarding the misuse of generative audio technologies, such as deepfakes, must be addressed. The main contribution of this paper is the introduction of Audio-Omni, a unified framework for audio understanding, generation, and editing, which leverages a novel architecture and a large-scale dataset to achieve state-of-the-art performance across multiple audio tasks. This work significantly advances the field of multimodal audio processing by providing a comprehensive solution that integrates various audio capabilities into a single model, setting a new standard for future research in generative audio intelligence.
Symbolic music research has relied almost exclusively on MIDI-based datasets; text-based engraving formats such as LilyPond remain unexplored for music understanding. We present BMdataset, a musicologically curated dataset of 393 LilyPond scores (2,646 movements) transcribed by experts directly from original Baroque manuscripts, with metadata covering composer, musical form, instrumentation, and sectional attributes. Building on this resource, we introduce LilyBERT (weights can be found at https://huggingface.co/csc-unipd/lilybert), a CodeBERT-based encoder adapted to symbolic music through vocabulary extension with 115 LilyPond-specific tokens and masked language model pre-training. Linear probing on the out-of-domain Mutopia corpus shows that, despite its modest size (~90M tokens), fine-tuning on BMdataset alone outperforms continuous pre-training on the full PDMX corpus (~15B tokens) for both composer and style classification, demonstrating that small, expertly curated datasets can be more effective than large, noisy corpora for music understanding. Combining broad pre-training with domain-specific fine-tuning yields the best results overall (84.3% composer accuracy), confirming that the two data regimes are complementary. We release the dataset, tokenizer, and model to establish a baseline for representation learning on LilyPond.
Primary: Centro di Sonologia Computazionale (CSC)
All Institutions: Centro di Sonologia Computazionale (CSC), Department of Information Engineering, University of Padua, Boston University
The main contribution of this paper is the introduction of BMdataset and LilyBERT, which together provide a robust framework for symbolic music representation learning, demonstrating that expert-curated datasets can outperform larger, noisier datasets in music classification tasks. This work significantly advances the field by addressing a gap in the use of text-based music formats and establishing a new baseline for future research.
The paper presents a novel approach to music representation learning by introducing BMdataset, a carefully curated dataset of LilyPond scores, and LilyBERT, a CodeBERT-based model specifically adapted for symbolic music. The methodology includes a unique tokenizer that preserves musical semantics by treating LilyPond-specific commands as atomic units. The two-stage training process, combining broad pre-training on a large corpus with domain-specific fine-tuning, is well-justified and effectively demonstrated through rigorous experiments.
The experiments conducted are comprehensive, utilizing linear probing to assess the effectiveness of the proposed model. The results indicate that the curated dataset significantly outperforms larger, less curated datasets for composer and style classification tasks. The systematic evaluation on the Mutopia corpus provides a solid benchmark for future research, and the findings are statistically significant, showcasing the advantages of expert curation in training datasets.
The authors have made their dataset, model weights, and code publicly available, which enhances reproducibility. The detailed descriptions of the dataset creation process, model architecture, and training procedures allow other researchers to replicate the study. However, the paper could benefit from clearer documentation of the training environment and hyperparameter settings.
The dataset is skewed towards certain composers, particularly Vivaldi, which may limit its generalizability. Additionally, the reliance on automatically converted data from the PDMX corpus for pre-training may introduce artifacts that could affect the model's performance. The authors acknowledge these limitations and suggest future work to expand the dataset and explore more robust model architectures.
This work has significant implications for the field of music information retrieval and generative AI, particularly in enhancing the understanding and generation of symbolic music. The introduction of a domain-specific model like LilyBERT could pave the way for more nuanced applications in music analysis, composition, and education, fostering greater engagement with less-represented composers in the Baroque period. The main contribution of this paper is the introduction of BMdataset and LilyBERT, which together provide a robust framework for symbolic music representation learning, demonstrating that expert-curated datasets can outperform larger, noisier datasets in music classification tasks. This work significantly advances the field by addressing a gap in the use of text-based music formats and establishing a new baseline for future research.
Deep learning models have improved sign language-to-text translation and made it easier for non-signers to understand signed messages. When the goal is spoken communication, a naive approach is to convert signed messages into text and then synthesize speech via Text-to-Speech (TTS). However, this two-stage pipeline inevitably treat text as a bottleneck representation, causing the loss of rich non-verbal information originally conveyed in the signing. To address this limitation, we propose a novel task, \emph{Sign-to-Speech Prosody Transfer}, which aims to capture the global prosodic nuances expressed in sign language and directly integrate them into synthesized speech. A major challenge is that aligning sign and speech requires expert knowledge, making annotation extremely costly and preventing the construction of large parallel corpora. To overcome this, we introduce \emph{SignRecGAN}, a scalable training framework that leverages unimodal datasets without cross-modal annotations through adversarial learning and reconstruction losses. Furthermore, we propose \emph{S2PFormer}, a new model architecture that preserves the expressive power of existing TTS models while enabling the injection of sign-derived prosody into the synthesized speech. Extensive experiments demonstrate that the proposed method can synthesize speech that faithfully reflects the emotional content of sign language, thereby opening new possibilities for more natural sign language communication. Our code will be available upon acceptance.
Primary: Keio University
All Institutions: Keio University
The paper presents a significant advancement in sign language processing through the introduction of a novel task and a robust methodology that effectively captures prosodic nuances in synthesized speech. The combination of adversarial learning and reconstruction losses represents a meaningful contribution to the field, with potential applications that could greatly enhance communication for the hearing impaired.
The paper introduces a novel task of Sign-to-Speech Prosody Transfer, which is a significant advancement in the field of multimodal learning. The methodology employs a GAN-based framework (SignRecGAN) that utilizes unpaired unimodal datasets, thus addressing the challenge of obtaining aligned datasets for sign and speech. The architecture (S2PFormer) effectively integrates sign-derived prosody into synthesized speech, maintaining the expressiveness of TTS models. The use of adversarial learning combined with reconstruction losses (SignRec loss and ProMo loss) is innovative, ensuring that the synthesized speech retains the nuances of sign language. However, the paper could benefit from a more detailed exploration of the limitations of the proposed losses and their impact on the final output.
The experimental setup is robust, utilizing both qualitative and quantitative evaluations, including user studies and objective metrics like WER and UTMOS. The paper reports significant findings that demonstrate the effectiveness of the proposed method in capturing emotional nuances in synthesized speech compared to traditional two-stage methods. The ablation studies provide insights into the contributions of each component, reinforcing the importance of the proposed losses. However, the paper lacks a comprehensive comparison with other state-of-the-art methods in the same domain, which could strengthen its claims.
The paper provides sufficient details on the datasets used, preprocessing steps, and the architecture of the model. However, the absence of a publicly available code repository at the time of review limits reproducibility. The authors mention that the code will be available upon acceptance, which is a positive aspect but should ideally be accessible during the review process.
One limitation is the reliance on unimodal datasets, which may not fully capture the complexities of sign language prosody. Additionally, the subjective evaluation metrics, while valuable, may introduce bias depending on the participants' familiarity with sign language. The paper also does not address the potential challenges in scaling the model to different sign languages or dialects.
The proposed method has significant implications for improving communication for individuals with hearing impairments, potentially enhancing the expressiveness and naturalness of synthesized speech in sign language applications. This could lead to better integration of sign language users in various contexts, including education and social interactions. The approach also opens avenues for further research in multimodal learning and prosody transfer, which could benefit other areas of machine learning. The paper presents a significant advancement in sign language processing through the introduction of a novel task and a robust methodology that effectively captures prosodic nuances in synthesized speech. The combination of adversarial learning and reconstruction losses represents a meaningful contribution to the field, with potential applications that could greatly enhance communication for the hearing impaired.
Modern audio systems universally employ mel-scale representations derived from 1940s Western psychoacoustic studies, potentially encoding cultural biases that create systematic performance disparities. We present a comprehensive evaluation of cross-cultural bias in audio front-ends, comparing mel-scale features with learnable alternatives (LEAF, SincNet) and psychoacoustic variants (ERB, Bark, CQT) across speech recognition (11 languages), music analysis (6 collections), and European acoustic scene classification (10 European cities). Our controlled experiments isolate front-end contributions while holding architecture and training protocols minimal and constant. Results demonstrate that mel-scale features yield 31.2% WER for tonal languages compared to 18.7% for non-tonal languages (12.5% gap), and show 15.7% F1 degradation between Western and non-Western music. Alternative representations significantly reduce these disparities: LEAF reduces the speech gap by 34% through adaptive frequency allocation, CQT achieves 52% reduction in music performance gaps, and ERB-scale filtering cuts disparities by 31% with only 1% computational overhead. We also release FairAudioBench, enabling cross-cultural evaluation, and demonstrate that adaptive frequency decomposition offers practical paths toward equitable audio processing. These findings reveal how foundational signal processing choices propagate bias, providing crucial guidance for developing inclusive audio systems.
Primary: Presight AI
All Institutions: Presight AI
The main contribution of this paper is the identification and quantification of cross-cultural bias in mel-scale audio representations, alongside the introduction of alternative representations that significantly reduce performance disparities. This work is a critical step towards developing fairer audio systems, highlighting the importance of cultural considerations in machine learning applications.
The paper employs a robust methodology that systematically evaluates the impact of mel-scale representations on audio processing across diverse cultural contexts. The authors isolate the contributions of various front-end configurations while maintaining consistent architecture and training protocols. They introduce a comprehensive set of fairness metrics to quantify performance disparities, which is a significant advancement in the evaluation of audio systems. The theoretical foundation is well-articulated, linking frequency resolution to classification error, thereby providing a strong basis for their claims.
The experiments are well-designed, utilizing a diverse set of datasets across speech recognition, music analysis, and acoustic scene classification. The balanced sampling across languages and musical traditions ensures that the results are meaningful and generalizable. The statistical significance of the findings is rigorously tested, enhancing the credibility of the results. The performance gaps highlighted in the results section are compelling and underscore the need for alternative representations.
The authors provide sufficient details regarding their experimental setup, including hyperparameters and dataset specifications, which facilitates reproducibility. The release of FairAudioBench as a benchmark for cross-cultural evaluation further enhances the reproducibility of their findings and allows other researchers to validate and build upon their work.
While the study is comprehensive, it acknowledges limitations in geographic coverage, particularly the underrepresentation of African tonal languages and indigenous musical traditions. Additionally, the focus on single-axis biases without addressing intersectionality may overlook complex interactions between different forms of bias. Future work could expand on these aspects to provide a more nuanced understanding of audio processing disparities.
This research has significant implications for the development of inclusive audio systems that are equitable across cultural contexts. By challenging the assumptions underlying traditional psychoacoustic models, the authors advocate for a paradigm shift in audio processing that considers cultural diversity. The findings can inform the design of more effective speech recognition systems and music analysis tools that serve a global audience, ultimately contributing to a more equitable technological landscape. The main contribution of this paper is the identification and quantification of cross-cultural bias in mel-scale audio representations, alongside the introduction of alternative representations that significantly reduce performance disparities. This work is a critical step towards developing fairer audio systems, highlighting the importance of cultural considerations in machine learning applications.
Video-to-Audio (V2A) generation is essential for immersive multimedia experiences, yet its evaluation remains underexplored. Existing benchmarks typically assess diverse audio types under a unified protocol, overlooking the fine-grained requirements of distinct audio categories. To address this gap, we propose VidAudio-Bench, a multi-task benchmark for V2A evaluation with four key features: (1) Broad Coverage: It encompasses four representative audio categories - sound effects, music, speech, and singing - under both V2A and Video-Text-to-Audio (VT2A) settings. (2) Extensive Evaluation: It comprises 1,634 video-text pairs and benchmarks 11 state-of-the-art generation models. (3) Comprehensive Metrics: It introduces 13 task-specific, reference-free metrics to systematically assess audio quality, video-audio consistency, and text-audio consistency. (4) Human Alignment: It validates all metrics through subjective studies, demonstrating strong consistency with human preferences. Experimental results reveal that current V2A models perform poorly in speech and singing compared to sound effects. Our VT2A results further highlight a fundamental tension between instruction following and visually grounded generation: stronger visual conditioning improves video-audio alignment, but often at the cost of generating the intended audio category. These findings establish VidAudio-Bench as a comprehensive and scalable framework for diagnosing V2A systems and provide new insights into multimodal audio generation.
Primary: Shanghai Jiaotong University
All Institutions: Shanghai Jiaotong University
The main contribution of this paper is the establishment of VidAudio-Bench, a comprehensive benchmark for evaluating V2A and VT2A systems, which systematically addresses the limitations of existing evaluation methodologies and provides valuable insights into the performance of current models. The technical contributions, including the innovative evaluation metrics and the extensive dataset, position this work as a significant advancement in the field of audio generation.
The paper introduces VidAudio-Bench, a novel benchmarking framework for Video-to-Audio (V2A) and Video-Text-to-Audio (VT2A) generation that addresses the limitations of existing evaluation methods by providing a multi-task benchmark with task-specific metrics. The methodology is robust, featuring a comprehensive dataset of 1,634 video-text pairs across four audio categories, and it employs both objective and subjective evaluation metrics, including human alignment studies to validate the proposed metrics. The introduction of a zero-information-leak design for VT2A evaluation is particularly innovative, allowing for a clearer assessment of visual understanding without relying on textual shortcuts.
The experimental evaluation is thorough, benchmarking 11 state-of-the-art models across various tasks and dimensions. The results reveal significant insights into the performance of current V2A models, particularly their struggles with speech and singing tasks. The paper effectively uses a variety of metrics to assess audio quality, video-audio consistency, and text-audio consistency, providing a comprehensive view of model performance. The correlation analysis with human evaluations further strengthens the credibility of the findings.
The paper provides detailed descriptions of the dataset construction, evaluation metrics, and experimental setup, which are essential for reproducibility. However, the absence of publicly available code or datasets limits the ability for other researchers to replicate the results directly. The methodology is well-documented, but the lack of a project URL or demo limits broader accessibility.
One limitation is the reliance on subjective human evaluations, which, while valuable, can introduce variability and bias. Additionally, the dataset may not cover all possible scenarios in V2A generation, potentially limiting the generalizability of the findings. The paper also notes a fundamental tension between instruction following and visually grounded generation, indicating that there are inherent challenges in achieving optimal performance across all tasks.
The development of VidAudio-Bench has significant implications for the field of multimodal audio generation, providing a structured framework that can guide future research and model development. By highlighting the challenges faced by current V2A models, the paper encourages further exploration into improving audio generation systems, which can enhance applications in entertainment, accessibility, and human-computer interaction. The main contribution of this paper is the establishment of VidAudio-Bench, a comprehensive benchmark for evaluating V2A and VT2A systems, which systematically addresses the limitations of existing evaluation methodologies and provides valuable insights into the performance of current models. The technical contributions, including the innovative evaluation metrics and the extensive dataset, position this work as a significant advancement in the field of audio generation.
MeloTune is an iPhone-deployed music agent that instantiates the Mesh Memory Protocol (MMP) and Symbolic-Vector Attention Fusion (SVAF) as a production system for affect-aware music curation with peer-to-peer mood coupling. Each device runs two closed-form continuous-time (CfC) networks: a private listener-level CfC that predicts a short-horizon affective trajectory on Russell's circumplex and drives proactive curation, and a shared mesh-runtime CfC at MMP Layer 6 that integrates Cognitive Memory Blocks (CMBs) from co-listening peers. CfC hidden states never cross the wire; only structured CMBs do. A Personal Arousal Function (PAF) replaces the standard linear mapping from audio intensity to psychological arousal with a per-listener learned adjustment, trained from behavioral signals (skip, completion, favorite, volume) and from drift between user-declared mood and machine inference. The same track receives different arousal predictions for different listeners. The model (94,552 parameters) achieves trajectory MAE 0.414, pattern accuracy 96.6%, and intent accuracy 69.4% on held-out validation. PAF evidence from a live deployment session (46 observations across 11 genres) demonstrates that the learning loop operates end-to-end, with pop reaching full confidence after 22 observations. All inference runs on-device via CoreML. To our knowledge, this is the first production deployment of MMP/SVAF on consumer mobile hardware. The accompanying SDK (sym-swift v0.3.78, SYMCore v0.3.7) enforces strict protocol conformance. Music is the case study; the substrate is the contribution.
Primary: SYM.BOT
All Institutions: SYM.BOT
The main contribution of MeloTune is its innovative architecture that combines continuous-time modeling with peer-to-peer mood coupling for personalized music curation. This approach addresses key limitations in traditional music recommendation systems, offering a promising direction for future research and applications in affect-aware technologies.
The methodology presented in MeloTune is innovative, leveraging a dual-layer architecture that combines a private listener-level Closed-form Continuous-time (CfC) network with a shared mesh-runtime CfC for peer-to-peer mood coupling. The Personal Arousal Function (PAF) is a significant advancement, allowing for personalized arousal predictions based on behavioral signals, which is a notable departure from traditional methods that rely on audio intensity alone. The use of Cognitive Memory Blocks (CMBs) for structured communication between agents is a unique aspect that enhances the system's ability to maintain privacy while still enabling collaborative mood curation. The continuous-time modeling approach is well-justified and effectively addresses the limitations of existing sequential recommendation systems.
The paper provides quantitative results from a live deployment, including metrics such as trajectory Mean Absolute Error (MAE), pattern accuracy, and intent accuracy. While the results are promising, the absence of a comprehensive controlled evaluation and comparisons against established benchmarks limits the robustness of the findings. The reported metrics indicate that the system performs well in predicting affective trajectories, but further validation against more diverse datasets and user scenarios would strengthen the claims.
The implementation details are described in sufficient depth, particularly regarding the architecture and training procedures. However, the lack of publicly available code or a demo limits reproducibility. The paper mentions an SDK, but without access to the actual implementation, independent verification of the results is challenging.
The primary limitations include the reliance on user-declared moods, which may not always be available or accurate, potentially affecting the PAF's learning process. Additionally, the system's performance in diverse real-world scenarios and with different user demographics is not fully explored. The absence of a controlled evaluation against standard recommendation systems raises questions about the generalizability of the results.
MeloTune has the potential to significantly impact the music recommendation landscape by providing a more personalized and context-aware listening experience. The approach could be extended to other domains where user affect and social context play a crucial role, such as in mental health applications or collaborative environments. The focus on privacy-preserving techniques is particularly relevant in today's data-sensitive climate. The main contribution of MeloTune is its innovative architecture that combines continuous-time modeling with peer-to-peer mood coupling for personalized music curation. This approach addresses key limitations in traditional music recommendation systems, offering a promising direction for future research and applications in affect-aware technologies.
Audio-native large language models (audio-LLMs) commonly use Whisper as their audio encoder. However, Whisper was trained exclusively on speech data, producing weak representations for music and environmental sound. This forces downstream audio-LLMs to compensate through extensive training on large-scale non-speech data. We present Whisper-AuT, a domain-adapted audio encoder obtained by fine-tuning Whisper-large-v3 on a curated mixture of speech (80%), environmental sound (10%), and music (10%) totaling approximately 20M samples. The full encoder-decoder is trained end-to-end with a seq2seq captioning objective; the decoder is then discarded and only the encoder is retained. Linear probe evaluations show that Whisper-AuT achieves +23.0% on ESC-50 (environmental sound), +5.0% on GTZAN (music genre), and +0.7% on Speech Commands (keyword spotting) compared to the original Whisperlarge-v3 encoder. Whisper-AuT is designed as a drop-in replacement for Whisper in audio-LLM architectures, with the goal of reducing downstream training cost by providing stronger initial audio representations for non-speech domains.
Primary: Salesforce AI Research
All Institutions: Salesforce AI Research
The main contribution of this paper is the introduction of Whisper-AuT, a domain-adapted audio encoder that improves the representation of non-speech audio, thereby reducing the training costs and enhancing the performance of downstream audio-LLMs. This work represents a meaningful advancement in the field of audio processing and machine learning, particularly in the context of integrating audio understanding with large language models.
The methodology is clearly articulated, following a systematic approach to fine-tune the Whisper-large-v3 model on a curated dataset that includes a balanced mix of speech, environmental sounds, and music. The use of a seq2seq training paradigm is consistent with existing practices, but the adaptation to a mixed-domain dataset is a notable improvement. The decision to retain only the encoder after training is a practical choice that simplifies integration into existing audio-LLM frameworks. However, the paper could benefit from more detailed descriptions of the training process and hyperparameter choices.
The experimental evaluation is robust, utilizing linear probing on well-established benchmarks (ESC-50, GTZAN, Speech Commands) to assess the encoder's performance across different audio domains. The reported improvements (+23.0% on ESC-50, +5.0% on GTZAN, +0.7% on Speech Commands) are significant and demonstrate the effectiveness of the proposed approach. However, the evaluation could be strengthened by including additional metrics or qualitative assessments to provide a more comprehensive view of the encoder's capabilities.
The paper provides a reasonable level of detail regarding the training configuration and data preparation, which aids in reproducibility. However, the lack of specific hyperparameter settings and the absence of a publicly available code repository hinder full reproducibility. Including these details would greatly enhance the paper's impact.
One limitation is the reliance on a relatively small dataset (20M samples) for fine-tuning, which may not fully capture the diversity of non-speech audio. Additionally, while the improvements on environmental sound and music are notable, the marginal gain on speech suggests that the encoder may not significantly enhance performance in that domain. Future work should explore the effects of varying the dataset composition and size.
The development of Whisper-AuT has the potential to significantly reduce the computational burden associated with training audio-LLMs, making them more accessible for various applications in audio understanding and generation. By providing stronger initial representations for non-speech audio, this work could enhance the performance of audio-LLMs in real-world applications, such as content creation, sound classification, and interactive audio systems. The main contribution of this paper is the introduction of Whisper-AuT, a domain-adapted audio encoder that improves the representation of non-speech audio, thereby reducing the training costs and enhancing the performance of downstream audio-LLMs. This work represents a meaningful advancement in the field of audio processing and machine learning, particularly in the context of integrating audio understanding with large language models.
The psychological profile that structurally documents the case of a depression patient is essential for psychotherapy. Large language models can be applied to summarize the profiles from counseling speech, however, it may suffer from long-context forgetting and produce unverifiable hallucinations, due to overlong length of speech, multi-party interactions and unstructured chatting. Hereby, we propose a StreamProfile, a streaming framework that processes counseling speech incrementally, extracts evidences grounded from ASR transcriptions by storing it in a Hierarchical Evidence Memory, and then performs a Chain-of-Thought pipeline according to PM+ psychological intervention for clinical reasoning. The final profile is synthesized strictly from those evidences, making every claim traceable. Experiments on real-world teenager counseling speech have shown that the proposed StreamProfile system can accurately generate the profiles and prevent hallucination.
Primary: South China University of Technology
All Institutions: South China University of Technology, Chinese Academy of Sciences, Key Laboratory of Biomedical Imaging Science and System, Shenzhen Institutes of Advanced Technology, Shenzhen Mental Health Center
The main contribution of this paper is the introduction of StreamProfile, a novel framework that integrates streaming processing, CoT reasoning, and evidence memory to generate accurate and verifiable psychological profiles from counseling sessions. This work represents a significant advancement in the application of LLMs in mental health, addressing critical challenges and demonstrating substantial improvements over existing methods.
The methodology presented in this paper is innovative, combining a streaming framework with a Chain-of-Thought (CoT) reasoning process and a Hierarchical Evidence Memory (HEM) to generate psychological profiles from counseling sessions. The approach addresses critical issues such as long-context forgetting and hallucinations in LLMs by ensuring that every claim made in the generated profiles is traceable to specific utterances from the counseling session. The use of a structured protocol (PM+) to guide the reasoning process is particularly noteworthy, as it aligns the model's outputs with clinical standards.
The experiments conducted on the Psy-Bench dataset demonstrate a rigorous evaluation of the proposed system against various LLM baselines. The results indicate significant improvements in both profile generation performance and hallucination reduction. The use of multiple evaluation metrics, including ROUGE-L, BERTScore, and subjective assessments of hallucination and consistency, provides a comprehensive understanding of the system's capabilities. The ablation studies further validate the effectiveness of the CoT and HEM components.
The paper provides detailed descriptions of the experimental setup, including the LLMs used, evaluation metrics, and dataset characteristics. However, the lack of a publicly available codebase or demo limits the reproducibility of the results. The authors mention using specific models and configurations, but without access to the implementation, it may be challenging for other researchers to replicate the findings.
One limitation is the reliance on a specific dataset (Psy-Bench) that may not generalize to other contexts or languages, as the experiments are conducted on a Chinese dataset. Additionally, while the framework addresses hallucinations effectively, the potential for misinterpretation of nuanced clinical language remains a concern. The paper also does not discuss the computational resources required for real-time processing, which could impact practical deployment.
The proposed framework has significant implications for mental health care, particularly in enhancing the efficiency and accuracy of psychological assessments. By automating the generation of structured profiles from counseling sessions, it could aid clinicians in delivering timely and informed interventions. However, ethical considerations regarding patient data privacy and the potential for over-reliance on AI in sensitive clinical contexts must be addressed. The main contribution of this paper is the introduction of StreamProfile, a novel framework that integrates streaming processing, CoT reasoning, and evidence memory to generate accurate and verifiable psychological profiles from counseling sessions. This work represents a significant advancement in the application of LLMs in mental health, addressing critical challenges and demonstrating substantial improvements over existing methods.
Automatic depression detection using speech signals with acoustic and textual modalities is a promising approach for early diagnosis. Depression-related patterns exhibit sparsity in speech: diagnostically relevant features occur in specific segments rather than being uniformly distributed. However, most existing methods treat all frames equally, assuming depression-related information is uniformly distributed and thus overlooking this sparsity. To address this issue, we proposes a depression detection network based on Adaptive Cross-Modal Gating (ACMG) that adaptively reassigns frame-level weights across both modalities, enabling selective attention to depression-related segments. Experimental results show that the depression detection system with ACMG outperforms baselines without it. Visualization analyses further confirm that ACMG automatically attends to clinically meaningful patterns, including low-energy acoustic segments and textual segments containing negative sentiments.
Primary: Shenzhen Institutes of Advanced Technology
All Institutions: Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, University of Chinese Academy of Sciences, Key Laboratory of Biomedical Imaging Science and System
The main contribution of this paper is the introduction of the Adaptive Cross-Modal Gating (ACMG) mechanism for depression detection, which effectively enhances the identification of clinically relevant features in speech and text. The comprehensive analysis of the technical contribution, methodology, and significance to the field demonstrates the potential of this approach to improve automatic depression detection systems.
The proposed Adaptive Cross-Modal Gating (ACMG) mechanism is innovative in its approach to addressing the sparsity of depression-related patterns in speech and text. The dual-branch architecture effectively combines acoustic and textual modalities, leveraging pre-trained models and an adaptive gating mechanism to enhance the detection of clinically relevant features. The methodology is well-structured, with clear explanations of the ACMG mechanism, global context extraction, and feature refinement processes, demonstrating a comprehensive understanding of the problem domain.
The experiments are conducted on two relevant datasets, PDCD2025 and DAIC-WOZ, which are suitable for evaluating the effectiveness of the proposed method. The results indicate a significant improvement over baseline models, with the ACMG mechanism consistently outperforming non-ACMG systems. The use of quantitative metrics, such as accuracy and F1 score, alongside qualitative analyses, strengthens the evaluation. However, the paper lacks detailed ablation studies that could further clarify the contributions of individual components.
The paper provides a clear description of the methods and datasets used, but it lacks specific implementation details, such as hyperparameters and training procedures, which are crucial for reproducibility. The absence of a publicly available code repository or demo limits the ability of other researchers to replicate the results.
One limitation is the reliance on pre-trained models, which may not fully capture the nuances of depression-related speech and text. Additionally, the paper does not address potential biases in the datasets used, which could affect the generalizability of the findings. The lack of a comprehensive comparison with state-of-the-art methods in the field also limits the contextual understanding of the proposed approach's performance.
The implications of this research are significant, as automatic depression detection can lead to earlier diagnosis and intervention, improving mental health outcomes. The methodology could be adapted for other mental health conditions, expanding its applicability. However, ethical considerations regarding data privacy and the potential for misdiagnosis must be addressed in future work. The main contribution of this paper is the introduction of the Adaptive Cross-Modal Gating (ACMG) mechanism for depression detection, which effectively enhances the identification of clinically relevant features in speech and text. The comprehensive analysis of the technical contribution, methodology, and significance to the field demonstrates the potential of this approach to improve automatic depression detection systems.
Self-supervised music foundation models underperform on key detection, which requires pitch-sensitive representations. In this work, we present the first systematic study showing that the design of self-supervised pretraining directly impacts pitch sensitivity, and demonstrate that masked contrastive embeddings uniquely enable state-of-the-art (SOTA) performance in key detection in the supervised setting. First, we discover that linear evaluation after masking-based contrastive pretraining on Mel spectrograms leads to competitive performance on music key detection out of the box. This leads us to train shallow but wide multi-layer perceptrons (MLPs) on features extracted from our base model, leading to SOTA performance without the need for sophisticated data augmentation policies. We further analyze robustness and show empirically that the learned representations naturally encode common augmentations. Our study establishes self-supervised pretraining as an effective approach for pitch-sensitive MIR tasks and provides insights for designing and probing music foundation models.
Primary: Texas A&M University
All Institutions: Texas A&M University
The main contribution of this paper is the introduction of KeyMyna, a novel approach to music key detection that leverages masked contrastive pretraining to achieve state-of-the-art performance without complex data augmentation. This work significantly advances the field of music information retrieval by demonstrating the potential of self-supervised learning techniques in capturing pitch-sensitive representations, thereby addressing a critical challenge in the domain.
The paper introduces KeyMyna, a systematic study of self-supervised pretraining for music key detection using masked contrastive learning. The methodology is well-structured, leveraging a pre-trained model (Myna-Vertical) and shallow multi-layer perceptrons (MLPs) for key detection. The authors effectively demonstrate the advantages of their approach over traditional methods and other deep learning models, particularly in terms of pitch sensitivity and robustness to augmentations. The use of a simple contrastive learning framework with token masking is innovative and addresses the challenges of limited labeled datasets in music key detection.
The experiments are thorough, utilizing two widely recognized datasets (GiantSteps and McGill Billboard) for evaluation. The results show that KeyMyna outperforms existing methods despite using less data and simpler architectures. The paper provides a comprehensive comparison with prior work, demonstrating the effectiveness of their approach through various metrics. However, the paper could benefit from more extensive ablation studies to further validate the impact of individual components of their methodology.
The authors provide a GitHub repository with code and models, which is a positive aspect for reproducibility. However, detailed hyperparameter settings and training configurations are presented, but the absence of a complete training script or environment setup instructions may hinder full reproducibility for some researchers.
The paper acknowledges limitations, such as the inability of KeyMyna to track key modulations within songs, which could affect performance in certain musical genres. Additionally, the focus on major and minor keys limits the model's applicability to more complex musical structures. Future work is suggested to address these limitations, including the exploration of moving averages for key modulation detection.
The findings of this research have significant implications for music information retrieval (MIR) applications, including playlist generation and music similarity search. By improving key detection through self-supervised learning, the work contributes to the development of more robust and efficient MIR systems. The insights gained from this study could also inform future research in music analysis and representation learning. The main contribution of this paper is the introduction of KeyMyna, a novel approach to music key detection that leverages masked contrastive pretraining to achieve state-of-the-art performance without complex data augmentation. This work significantly advances the field of music information retrieval by demonstrating the potential of self-supervised learning techniques in capturing pitch-sensitive representations, thereby addressing a critical challenge in the domain.
Cross-modal retrieval between audio recordings and symbolic music representations (MIDI) remains challenging because continuous waveforms and discrete event sequences encode different aspects of the same performance. We study descriptor injection, the augmentation of modality-specific encoders with hand-crafted domain features, as a bridge across this gap. In a three-phase campaign covering 13 descriptor-mechanism combinations, 6 architectural families, and 3 training schedules, the best configuration reaches a mean S of 84.0 percent across five independent seeds, improving the descriptor-free baseline by 8.8 percentage points. Causal ablation shows that the audio descriptor A4, based on octave-band energy dynamics, drives the gain in the top dual models, while the MIDI descriptor D4 has only a weak inference-time effect despite improving training dynamics. We also introduce reverse cross-attention, where descriptor tokens query encoder features, reducing attention operations relative to the standard formulation while remaining competitive. CKA analysis shows that descriptors substantially increase audio-MIDI transformer layer alignment, indicating representational convergence rather than simple feature concatenation. Perturbation analysis identifies high-frequency octave bands as the dominant discriminative signal. All experiments use MAESTRO v3.0.0 with an evaluation protocol controlling for composer and piece similarity.
Primary: Asociaciรณn Civil AlterMundi
All Institutions: Asociaciรณn Civil AlterMundi
The main contribution of this paper is the introduction of descriptor injection as a novel approach to improve audio-MIDI alignment, demonstrating that simple, hand-crafted features can significantly enhance cross-modal learning performance. The comprehensive methodology and rigorous experimental validation position this work as a meaningful advancement in the field of machine learning for music.
The paper presents a systematic exploration of descriptor injection for cross-modal audio-MIDI learning, employing a robust methodology that includes a three-phase experimental design with various descriptor-mechanism combinations and architectural families. The introduction of reverse cross-attention as a novel mechanism to reduce attention operations while maintaining competitive performance is a significant methodological contribution. The use of causal ablation and CKA analysis to validate the effectiveness of the descriptors adds rigor to the methodology.
The experiments are comprehensive, utilizing the MAESTRO v3.0.0 dataset and employing a structured evaluation protocol. The results demonstrate clear improvements over the baseline, with statistical significance established through multi-seed validation. The paper effectively communicates the experimental results, including detailed ablation studies and sensitivity analyses, which substantiate the claims made regarding the effectiveness of the proposed methods.
The paper provides sufficient details on the architecture, training protocols, and evaluation metrics, which supports reproducibility. However, the lack of a public repository or demo URL limits the ease of access for other researchers to replicate the findings.
The study is constrained by its use of a single dataset (MAESTRO v3.0.0), which may limit the generalizability of the findings to other musical genres or instruments. Additionally, the paper acknowledges the absence of data augmentation techniques, which could enhance robustness. The D4 descriptor's weak inference-time effect raises questions about its practical utility despite its training benefits.
The findings have potential implications for music information retrieval, automatic transcription, and musicological analysis, as they suggest that structured domain knowledge can significantly enhance cross-modal learning. The approach could be extended to other domains where modality gaps exist, making it relevant beyond music. The main contribution of this paper is the introduction of descriptor injection as a novel approach to improve audio-MIDI alignment, demonstrating that simple, hand-crafted features can significantly enhance cross-modal learning performance. The comprehensive methodology and rigorous experimental validation position this work as a meaningful advancement in the field of machine learning for music.
Spatial audio is fundamental to immersive virtual experiences, yet synthesizing high-fidelity binaural audio from sparse observations remains a significant challenge. Existing methods typically rely on implicit neural representations conditioned on visual priors, which often struggle to capture fine-grained acoustic structures. Inspired by 3D Gaussian Splatting (3DGS), we introduce AudioGS, a novel visual-free framework that explicitly encodes the sound field as a set of Audio Gaussians based on spectrograms. AudioGS associates each time-frequency bin with an Audio Gaussian equipped with dual Spherical Harmonic (SH) coefficients and a decay coefficient. For a target pose, we render binaural audio by evaluating the SH field to capture directionality, incorporating geometry-guided distance attenuation and phase correction, and reconstructing the waveform. Experiments on the Replay-NVAS dataset demonstrate that AudioGS successfully captures complex spatial cues and outperforms state-of-the-art visual-dependent baselines. Specifically, AudioGS reduces the magnitude reconstruction error (MAG) by over 14% and reduces the perceptual quality metric (DPAM) by approximately 25% compared to the best performing visual-guided method.
Primary: Shanghai Jiao Tong University
All Institutions: Shanghai Jiao Tong University, Institute of Cultural and Creative Industry, Shanghai Jiao Tong University
AudioGS presents a novel approach to binaural audio synthesis that effectively captures spatial cues without relying on visual data. The technical contributions, including the explicit modeling of the sound field and the integration of phase correction, represent a meaningful advancement in the field of audio processing and spatial audio synthesis.
The methodology presented in AudioGS is innovative, as it introduces a visual-free framework for synthesizing binaural audio using a set of Audio Gaussians derived from spectrograms. The use of dual Spherical Harmonic coefficients and a decay coefficient to model directional energy and distance attenuation is a significant advancement over existing methods that rely on visual priors. The explicit representation of the sound field allows for more interpretable modeling and captures complex spatial cues effectively. The integration of geometry-guided phase correction further enhances the realism of the synthesized audio, addressing limitations in phase alignment seen in previous methods.
The experiments are well-structured, utilizing the Replay-NVAS dataset to evaluate the performance of AudioGS against state-of-the-art visual-dependent methods. The quantitative metrics reported, including MAG and DPAM, provide a robust assessment of audio quality and spatial accuracy. The results demonstrate a significant improvement over existing methods, validating the effectiveness of the proposed approach. The inclusion of subjective listening tests adds depth to the evaluation, confirming the objective metrics with human judgments.
The paper provides sufficient implementation details, including training setups, loss functions, and evaluation metrics, which facilitates reproducibility. However, the absence of a publicly available code repository or demo URL limits the ease with which other researchers can replicate the results.
One limitation of the study is its reliance on a specific dataset (Replay-NVAS), which may not generalize to all acoustic environments. Additionally, while the paper discusses future work on extending the framework to dynamic scenes, the current implementation is limited to static environments, which may restrict its applicability in real-world scenarios.
The potential applications of AudioGS are significant, particularly in immersive technologies such as VR, AR, and XR, where high-fidelity spatial audio is crucial for enhancing user experience. The framework could also benefit fields such as gaming, virtual conferencing, and education, where realistic audio environments are increasingly important. AudioGS presents a novel approach to binaural audio synthesis that effectively captures spatial cues without relying on visual data. The technical contributions, including the explicit modeling of the sound field and the integration of phase correction, represent a meaningful advancement in the field of audio processing and spatial audio synthesis.
Audio has rapidly become a primary interface for foundation models, powering real-time voice assistants. Ensuring safety in audio systems is inherently more complex than just "unsafe text spoken aloud": real-world risks can hinge on audio-native harmful sound events, speaker attributes (e.g., child voice), impersonation/voice-cloning misuse, and voice-content compositional harms, such as child voice plus sexual content. The nature of audio makes it challenging to develop comprehensive benchmarks or guardrails against this unique risk landscape. To close this gap, we conduct large-scale red teaming on audio systems, systematically uncover vulnerabilities in audio, and develop a comprehensive, policy-grounded audio risk taxonomy and AudioSafetyBench, the first policy-based audio safety benchmark across diverse threat models. AudioSafetyBench supports diverse languages, suspicious voices (e.g., celebrity/impersonation and child voice), risky voice-content combinations, and non-speech sound events. To defend against these threats, we propose AudioGuard, a unified guardrail consisting of 1) SoundGuard for waveform-level audio-native detection and 2) ContentGuard for policy-grounded semantic protection. Extensive experiments on AudioSafetyBench and four complementary benchmarks show that AudioGuard consistently improves guardrail accuracy over strong audio-LLM-based baselines with substantially lower latency.
Primary: University of Illinois Urbana-Champaign
All Institutions: University of Illinois Urbana-Champaign
This paper presents a comprehensive framework for audio safety protection through a novel dual-path guardrail system that effectively addresses the unique risks associated with audio inputs and outputs. The technical contributions, including the development of a robust audio risk taxonomy and extensive experimental validation, position this work as a significant advancement in the field of audio machine learning.
The methodology presented in this paper is robust, combining a novel audio risk taxonomy with a dual-path guardrail system (SoundGuard and ContentGuard) that effectively addresses the unique challenges posed by audio safety. The approach is well-structured, leveraging large-scale red teaming to identify vulnerabilities and systematically developing a comprehensive benchmark (AudioSafetyBench) that accommodates diverse threat models. The modular design allows for flexibility and efficiency in deployment, which is a significant advancement in the field.
The experiments are extensive, demonstrating the effectiveness of AudioGuard across multiple benchmarks and showing significant improvements in accuracy and latency compared to existing audio-LLM-based guardrails. The evaluation metrics are well-defined, and the results provide a clear indication of the system's performance across various scenarios, including severe voice-content compositional risks and non-speech harmful sound events.
The paper provides sufficient details regarding the implementation of the models and the training processes, which enhances reproducibility. However, the absence of a publicly available demo or project URL limits the ability for others to replicate the findings directly.
One limitation is the potential dual-use concern regarding the insights gained from the red teaming and taxonomy, which could be exploited by malicious actors. Additionally, while the paper addresses audio-native risks effectively, it may not fully encompass all possible real-world scenarios, particularly as audio technology continues to evolve.
The work has the potential to significantly enhance the safety of audio-capable AI systems, reducing harmful outputs and improving detection of impersonation and child safety risks. The modular design and policy-grounded approach could lead to more transparent and effective safety measures in various applications, including voice assistants and TTS systems. This paper presents a comprehensive framework for audio safety protection through a novel dual-path guardrail system that effectively addresses the unique risks associated with audio inputs and outputs. The technical contributions, including the development of a robust audio risk taxonomy and extensive experimental validation, position this work as a significant advancement in the field of audio machine learning.
Full-duplex dialogue audio, in which each speaker is recorded on a separate track, is an important resource for spoken dialogue research, but is difficult to collect at scale. Most in-the-wild two-speaker dialogue is available only as degraded monaural mixtures, making it unsuitable for systems requiring clean speaker-wise signals. We propose DialogueSidon, a model for joint restoration and separation of degraded monaural two-speaker dialogue audio. DialogueSidon combines a variational autoencoder (VAE) operates on the speech self-supervised learning (SSL) model feature, which compresses SSL model features into a compact latent space, with a diffusion-based latent predictor that recovers speaker-wise latent representations from the degraded mixture. Experiments on English, multilingual, and in-the-wild dialogue datasets show that DialogueSidon substantially improves intelligibility and separation quality over a baseline, while also achieving much faster inference.
Primary: The University of Tokyo
All Institutions: The University of Tokyo, National Institute of Advanced Industrial Science and Technology (AIST)
The paper presents DialogueSidon, a novel model for recovering full-duplex dialogue tracks from degraded audio, significantly advancing the field of audio processing and dialogue systems. The combination of innovative methodology and strong experimental results positions this work as a meaningful contribution to the ongoing research in speech separation and restoration.
The proposed DialogueSidon model innovatively combines a variational autoencoder (VAE) with a diffusion-based latent predictor to address the dual challenges of restoring and separating degraded monaural two-speaker dialogue audio. This joint approach is well-motivated, as it leverages self-supervised learning (SSL) features to create a compact latent space, which is crucial for effective processing of complex audio mixtures. The methodology is robust, addressing the specific challenges posed by in-the-wild audio, including overlapping speech and background noise, and introduces auxiliary latent predictions to mitigate permutation ambiguity, showcasing a thoughtful design that enhances model performance.
The experiments are comprehensive, utilizing multiple datasets (English, multilingual, and in-the-wild) to evaluate the model's performance across various conditions. The use of both objective metrics (WER, p-CER, speaker similarity) and subjective assessments (MOS) provides a well-rounded evaluation of the model's effectiveness. The results demonstrate significant improvements in intelligibility and separation quality over baseline methods, with a notable reduction in WER and high subjective quality ratings, indicating that the model not only performs well technically but is also preferred by human listeners.
The paper provides sufficient detail regarding the model architecture, training procedures, and evaluation metrics, which facilitates reproducibility. The authors mention the use of specific datasets and the training process on multiple GPUs, which is helpful for others looking to replicate the study. However, the lack of access to the datasets used for training could limit full reproducibility for some researchers.
One limitation is that the model is primarily evaluated on two-speaker dialogue, which may not generalize well to scenarios involving more speakers or different types of dialogue interactions. Additionally, while the model shows improved performance, the reliance on specific SSL features may limit its applicability to other audio domains or languages not covered in the training data.
The potential applications of DialogueSidon are significant, particularly in enhancing the quality of conversational AI systems and improving the accessibility of spoken dialogue data for research and development. By enabling the recovery of clean speaker-wise audio from in-the-wild recordings, this work could facilitate advancements in natural language processing, speech recognition, and human-computer interaction. The paper presents DialogueSidon, a novel model for recovering full-duplex dialogue tracks from degraded audio, significantly advancing the field of audio processing and dialogue systems. The combination of innovative methodology and strong experimental results positions this work as a meaningful contribution to the ongoing research in speech separation and restoration.
We propose a generative framework for multi-track music source separation (MSS) that reformulates the task as conditional discrete token generation. Unlike conventional approaches that directly estimate continuous signals in the time or frequency domain, our method combines a Conformer-based conditional encoder, a dual-path neural audio codec (HCodec), and a decoder-only language model to autoregressively generate audio tokens for four target tracks. The generated tokens are decoded back to waveforms through the codec decoder. Evaluation on the MUSDB18-HQ benchmark shows that our generative approach achieves perceptual quality approaching state-of-the-art discriminative methods, while attaining the highest NISQA score on the vocals track. Ablation studies confirm the effectiveness of the learnable Conformer encoder and the benefit of sequential cross-track generation.
Primary: Qwen Applications Business Group of Alibaba
All Institutions: Qwen Applications Business Group of Alibaba
The paper presents a novel generative framework for multi-track music source separation that reformulates the task as autoregressive token prediction, achieving competitive performance against state-of-the-art methods. The methodology is innovative, and the results demonstrate significant potential for advancing the field of audio processing.
The proposed methodology introduces a novel generative framework for multi-track music source separation (MSS) that leverages a combination of a Conformer-based conditional encoder, a dual-path neural audio codec (HCodec), and a decoder-only language model. This approach effectively reformulates the MSS task into a discrete token generation problem, which is a significant departure from traditional continuous signal estimation methods. The use of residual vector quantization (RVQ) to represent target tracks as interleaved acoustic and semantic tokens is innovative and allows for autoregressive generation of multiple tracks in a single run, enhancing the model's ability to capture cross-track dependencies. The architecture is well-structured, and the integration of a language model for audio token generation is a promising direction that could influence future research in audio processing.
The experiments are conducted on the MUSDB18-HQ benchmark, which is a recognized dataset for evaluating music source separation methods. The authors provide comprehensive evaluations using perceptual metrics like ViSQOL, DNSMOS, and NISQA, demonstrating competitive performance against state-of-the-art discriminative methods. The results indicate that the proposed generative approach achieves perceptual quality comparable to existing methods, particularly excelling in vocal track separation. The ablation studies further validate the effectiveness of key components of the framework, such as the learnable Conformer encoder and the benefits of sequential cross-track generation.
The paper provides detailed implementation details, including model architecture, training configurations, and evaluation metrics, which are essential for reproducibility. However, the reliance on pseudo-labels generated by a baseline model (BS-RoFormer) raises concerns about the quality and reliability of the training data, which could affect reproducibility in practice. The authors do not provide a public code repository, which limits the ability for others to replicate the results directly.
The paper acknowledges several limitations, including challenges in separating percussive sources with sharp transients, which are difficult for the autoregressive generation paradigm. The reliance on pseudo-labels may introduce biases and limit the performance upper bound. Additionally, the dual-path codec architecture with multiple layers of RVQ can lead to cumulative errors, affecting the quality of the final output.
The proposed framework has significant implications for various applications in music technology, including music remixing, transcription, and karaoke generation. By advancing the state of the art in multi-track music source separation, this research could enhance user experiences in music production and accessibility for hearing-impaired individuals. The integration of language models into audio processing also opens avenues for further exploration in multimodal AI systems. The paper presents a novel generative framework for multi-track music source separation that reformulates the task as autoregressive token prediction, achieving competitive performance against state-of-the-art methods. The methodology is innovative, and the results demonstrate significant potential for advancing the field of audio processing.
Audio large language models (ALLMs) enable rich speech-text interaction, but they also introduce jailbreak vulnerabilities in the audio modality. Existing audio jailbreak methods mainly optimize jailbreak success while overlooking utility preservation, as reflected in transcription quality and question answering performance. In practice, stronger attacks often come at the cost of degraded utility. To study this trade-off, we revisit existing attacks by varying their perturbation coverage in the frequency domain, from partial-band to full-band, and find that broader frequency coverage does not necessarily improve jailbreak performance, while utility consistently deteriorates. This suggests that concentrating perturbation on a subset of bands can yield a better attack-utility trade-off than indiscriminate full-band coverage. Based on this insight, we propose GRM, a utility-aware frequency-selective jailbreak framework. It ranks Mel bands by their attack contribution relative to utility sensitivity, perturbs only a selected subset of bands, and learns a reusable universal perturbation under a semantic-preservation objective. Experiments on four representative ALLMs show that GRM achieves an average Jailbreak Success Rate (JSR) of 88.46% while providing a better attack-utility trade-off than representative baselines. These results highlight the potential of frequency-selective perturbation for better balancing attack effectiveness and utility preservation in audio jailbreak. Content Warning: This paper includes harmful query examples and unsafe model responses.
Primary: Sun Yat-Sen University
All Institutions: Sun Yat-Sen University
The main contribution of this paper is the introduction of GRM, a utility-aware frequency-selective jailbreak framework for audio LLMs, which balances attack effectiveness and utility preservation. This work represents a meaningful advancement in the understanding and mitigation of vulnerabilities in audio-based AI systems, highlighting the importance of considering utility in adversarial settings.
The proposed method, GRM (Gradient Ratio Masking), introduces a novel framework for audio jailbreak attacks that emphasizes a utility-aware approach. By focusing on frequency-selective perturbation, the authors demonstrate a systematic method for identifying key Mel frequency bands that contribute to jailbreak effectiveness while minimizing utility degradation. The dual-gradient scoring mechanism for band selection is a unique aspect that enhances the attack-utility trade-off. The methodology is well-structured, with clear definitions and a comprehensive explanation of the optimization process, making it a significant advancement in the field of audio adversarial attacks.
The experiments are robust, involving multiple representative ALLMs and a thorough evaluation of the attack's effectiveness through metrics such as Jailbreak Success Rate (JSR), Word Error Rate (WER), and Response Quality Score (RQS). The results indicate that GRM achieves a high average JSR of 88.46% while maintaining better utility preservation compared to existing baselines. The ablation studies provide insight into the contributions of different components of the GRM framework, further validating the effectiveness of the proposed method.
The paper provides detailed implementation details, including the experimental setup, datasets, model configurations, and evaluation protocols. However, the lack of a publicly accessible code repository or demo URL limits the reproducibility of the results. The authors mention using specific models and datasets, which could be challenging for other researchers to replicate without access to the same resources.
The study acknowledges several limitations, including potential bias in the evaluation metrics due to reliance on LLM-based judges and the lack of validation in real-world environments. Additionally, the method's effectiveness is primarily model-specific, with limited cross-model transferability observed. These factors may restrict the generalizability of the findings.
The implications of this research are significant, particularly in the context of audio security and the safety of audio large language models (ALLMs). By improving the attack-utility trade-off, the findings could inform the development of more robust defenses against audio jailbreak attacks. The methodology could also be applied to enhance the safety and reliability of audio-based AI systems in various applications, including voice assistants and interactive systems. The main contribution of this paper is the introduction of GRM, a utility-aware frequency-selective jailbreak framework for audio LLMs, which balances attack effectiveness and utility preservation. This work represents a meaningful advancement in the understanding and mitigation of vulnerabilities in audio-based AI systems, highlighting the importance of considering utility in adversarial settings.
Audio super-resolution aims to recover missing high-frequency details from bandwidth-limited low-resolution audio, thereby improving the naturalness and perceptual quality of the reconstructed signal. However, most existing methods directly operate in the waveform or time-frequency domain, which not only involves high-dimensional generation spaces but is also largely limited to speech tasks, leaving substantial room for improvement on more complex audio types such as sound effects and music. To mitigate these limitations, we introduce LatentFlowSR, a new audio super-resolution approach that leverages conditional flow matching (CFM) within a latent representation space. Specifically, we first train a noise-robust autoencoder, which encodes low-resolution audio into a continuous latent space. Conditioned on the low-resolution latent representation, a CFM mechanism progressively generates the corresponding high-resolution latent representation from a Gaussian prior with a one-step ordinary differential equation (ODE) solver. The resulting high-resolution latent representation is then decoded by the pretrained autoencoder to reconstruct the high-resolution audio. Experimental results demonstrate that LatentFlowSR consistently outperforms baseline methods across various audio types and super-resolution settings. These results indicate that the proposed method possesses strong high-frequency reconstruction capability and robust generalization performance, providing compelling evidence for the effectiveness of latent-space modeling in audio super-resolution. All relevant code will be made publicly available upon completion of the paper review process.
Primary: University of Science and Technology of China
All Institutions: University of Science and Technology of China
The main contribution of this paper is the introduction of LatentFlowSR, an innovative audio super-resolution method that effectively utilizes latent-space modeling and conditional flow matching to achieve high-fidelity audio reconstruction. This work represents a significant advancement in the field of audio processing, particularly in its ability to handle diverse audio types and improve computational efficiency.
The methodology proposed in LatentFlowSR is innovative, leveraging a noise-robust autoencoder to encode low-resolution audio into a latent space, followed by a conditional flow matching (CFM) mechanism to generate high-resolution audio. The use of an ODE solver for generating high-resolution latent representations is a novel approach that enhances computational efficiency. The architecture incorporates a U-Net style for estimating the velocity field, which is well-suited for capturing both local and global audio features. The integration of noise robustness during training further strengthens the model's performance in real-world applications.
The experimental evaluation is comprehensive, utilizing diverse datasets that include speech, sound effects, and music, which allows for a thorough assessment of the model's capabilities. The results demonstrate significant improvements over baseline methods in both objective metrics (LSD, LSD-HF, ViSQOL) and subjective evaluations (MOS), indicating strong performance across various audio types and degradation levels. The paper also includes a detailed analysis of computational complexity, showcasing the efficiency of the LatentFlowSR model.
The paper outlines the implementation details and training strategies clearly, which aids in reproducibility. The authors mention the use of specific datasets, training steps, and optimization techniques, which are essential for others to replicate their work. However, the lack of a public code repository at the time of review may hinder immediate reproducibility.
One limitation is the reliance on a specific architecture and training strategy, which may not generalize well to all audio super-resolution tasks. Additionally, while the model shows strong performance on the tested datasets, further validation on more diverse and challenging datasets would be beneficial to fully assess its generalization capabilities.
The proposed LatentFlowSR has significant implications for various applications in audio processing, including speech synthesis, music restoration, and sound effects enhancement. Its ability to recover high-frequency details can improve the quality of audio in consumer products, entertainment, and communication technologies. The methodology could also inspire further research into latent-space modeling in other domains of machine learning. The main contribution of this paper is the introduction of LatentFlowSR, an innovative audio super-resolution method that effectively utilizes latent-space modeling and conditional flow matching to achieve high-fidelity audio reconstruction. This work represents a significant advancement in the field of audio processing, particularly in its ability to handle diverse audio types and improve computational efficiency.
Multimodal music creation requires models that can both generate audio from high-level cues and edit existing mixtures in a targeted manner. Yet most multimodal music systems are built for a single task and a fixed prompting interface, making their conditioning brittle when guidance is ambiguous, temporally misaligned, or partially missing. Common additive fusion or feature concatenation further weakens cross-modal grounding, often causing prompt drift and spurious musical content during generation and editing. We propose MAGE, a modality-agnostic framework that unifies multimodal music generation and mixture-grounded editing within a single continuous latent formulation. At its core, MAGE uses a Controlled Multimodal FluxFormer, a flow-based Transformer that learns controllable latent trajectories for synthesis and editing under any available subset of conditions. To improve grounding, we introduce Audio-Visual Nexus Alignment to select temporally consistent visual evidence for the audio timeline, and a cross-gated modulation mechanism that applies multiplicative control from aligned visual and textual cues to the audio latents, suppressing unsupported components rather than injecting them. Finally, we train with a dynamic modality-masking curriculum that exposes the model to text-only, visual-only, joint multimodal, and mixture-guided settings, enabling robust inference under missing modalities without training separate models. Experiments on the MUSIC benchmark show that MAGE supports effective multimodal-guided music generation and targeted editing, achieving competitive quality while offering a lightweight and flexible interface tailored to practical music workflows.
Primary: Princeton University
All Institutions: Princeton University, Google, University of North Carolina at Charlotte
The main contribution of this paper is the introduction of MAGE, a modality-agnostic framework for multimodal music generation and editing that effectively combines various conditioning inputs within a single continuous latent formulation. This work significantly advances the field of audio processing by addressing the complexities of multimodal music creation and providing a flexible, unified approach that enhances both generation and editing capabilities.
The methodology presented in MAGE is robust and innovative, employing a Controlled Multimodal FluxFormer that integrates audio generation and editing in a unified framework. The use of Audio-Visual Nexus Alignment and cross-gated modulation enhances cross-modal grounding, addressing significant challenges in multimodal music generation. The dynamic modality-masking curriculum is a noteworthy approach that allows the model to adapt to various input conditions without requiring separate training for each modality, which is a significant advancement in the field.
The experiments conducted on the MUSIC benchmark are comprehensive and well-structured, demonstrating the effectiveness of MAGE in both multimodal-guided music generation and targeted editing. The evaluation metrics used, including SDR, SIR, SAR, FAD, and CLAP scores, provide a thorough assessment of the model's performance across different tasks, ensuring that both signal fidelity and semantic alignment are addressed.
The paper provides detailed implementation information, including the architecture of the model, training procedures, and the datasets used. However, the absence of a publicly available code repository or demo limits the reproducibility of the results, as other researchers cannot easily validate or build upon the work.
While the proposed framework is promising, it may still struggle with highly complex mixtures where the separation of overlapping sources is particularly challenging. Additionally, the reliance on specific datasets like MUSIC may limit the generalizability of the findings to other music genres or contexts.
The implications of MAGE extend beyond academic research, potentially influencing practical applications in music production, interactive audio editing, and creative tools for musicians. The ability to generate and edit music based on multimodal inputs could revolutionize how music is created and manipulated, making it more accessible to non-experts. The main contribution of this paper is the introduction of MAGE, a modality-agnostic framework for multimodal music generation and editing that effectively combines various conditioning inputs within a single continuous latent formulation. This work significantly advances the field of audio processing by addressing the complexities of multimodal music creation and providing a flexible, unified approach that enhances both generation and editing capabilities.
Auditory large language models (ALLMs) have demonstrated strong general capabilities in audio understanding and reasoning tasks. However, their reliability is still undermined by hallucination issues. Existing hallucination evaluation methods are formulated as binary classification tasks, which are insufficient to characterize the more complex hallucination patterns that arise in generative tasks. Moreover, current hallucination mitigation strategies rely on fine-tuning, resulting in high computational costs. To address the above limitations, we propose a plug-and-play Noise-Aware In-Context Learning (NAICL) method. Specifically, we construct a noise prior library, retrieve noise examples relevant to the input audio, and incorporate them as contextual priors, thereby guiding the model to reduce speculative associations when acoustic evidence is insufficient and to adopt a more conservative generation strategy. In addition, we establish a hallucination benchmark for audio caption tasks including the construction of the Clotho-1K multi-event benchmark dataset, the definition of four types of auditory hallucinations, and the introduction of metrics such as hallucination type distribution to support fine-grained analysis. Experimental results show that all evaluated ALLMs exhibit same hallucination behaviors. Moreover, the proposed NAICL method reduces the overall hallucination rate from 26.53% to 16.98%.
Primary: Japan Advanced Institute of Science and Technology
All Institutions: Japan Advanced Institute of Science and Technology
This paper presents a novel method for mitigating hallucinations in auditory large language models through the innovative use of noise as contextual guidance. The comprehensive methodology and robust experimental results indicate a meaningful contribution to the field of audio understanding and generative modeling, addressing a critical challenge in the deployment of ALLMs.
The proposed Noise-Aware In-Context Learning (NAICL) method introduces an innovative approach to mitigate hallucinations in Auditory Large Language Models (ALLMs) by utilizing a structured noise prior library. This method effectively guides the model to adopt more conservative outputs when acoustic evidence is insufficient, which is a significant departure from traditional fine-tuning approaches that are computationally expensive. The methodology is well-structured, involving a detailed process of dataset filtering, noise retrieval, and contextual integration, which enhances the interpretability and effectiveness of the model.
The experiments are comprehensive, utilizing the newly constructed Clotho-1K benchmark to evaluate the performance of various ALLMs. The results demonstrate a significant reduction in hallucination rates, providing strong empirical support for the effectiveness of the NAICL method. The inclusion of an ablation study further strengthens the findings by analyzing different configurations and their impacts on performance, showcasing the robustness of the approach.
The paper provides sufficient implementation details, including the use of a specific acoustic encoder (BEATs) and clear descriptions of the retrieval process. The availability of code on GitHub enhances reproducibility, although further details on hyperparameter settings and training procedures would be beneficial for complete replication.
One limitation is the reliance on the Clotho dataset, which may not encompass the full diversity of real-world audio scenarios, potentially limiting the generalizability of the findings. Additionally, while the method shows promise in reducing hallucinations, it may introduce its own biases by overly constraining the model's outputs in uncertain contexts.
The implications of this research are significant for the development of reliable audio understanding systems, particularly in applications requiring high accuracy in audio captioning and interpretation. By addressing hallucination issues, this work could enhance the deployment of ALLMs in real-world scenarios, such as assistive technologies and automated content generation, thereby improving user trust and system performance. This paper presents a novel method for mitigating hallucinations in auditory large language models through the innovative use of noise as contextual guidance. The comprehensive methodology and robust experimental results indicate a meaningful contribution to the field of audio understanding and generative modeling, addressing a critical challenge in the deployment of ALLMs.
Integrating pretrained speech encoders with large language models (LLMs) is promising for ASR, but performance and data efficiency depend on the speech-language interface. A common choice is a learned projector that maps encoder features into the LLM embedding space, whereas an alternative is to expose discrete phoneme sequences to the LLM. Using the same encoder and LLM backbones, we compare phoneme-based and vanilla projector-based interfaces in high-resource English and low-resource Tatar. We also propose a BPE-phoneme interface that groups frequent local phoneme patterns while preserving explicit word-boundary cues for phoneme-to-grapheme generation. On LibriSpeech, the phoneme-based interface is competitive with the vanilla projector, and the BPE-phoneme interface yields further gains. On Tatar, the phoneme-based interface substantially outperforms the vanilla projector. We further find that phoneme supervision yields a phoneme-informed hybrid interface that is stronger than the vanilla projector.
Primary: Tsinghua University
All Institutions: Tsinghua University, TasiTech Co., Ltd., Xinjiang University
The main contribution of this paper is the comparative analysis of phoneme-based and projector-based interfaces for LLM-integrated ASR, revealing that phoneme-based approaches can significantly enhance performance, especially in low-resource settings. This work advances the understanding of speech-language interfaces and provides a foundation for future innovations in ASR systems.
The paper presents a systematic comparison of two speech-language interfacesโprojector-based and phoneme-basedโfor integrating large language models (LLMs) with automatic speech recognition (ASR). The methodology is robust, utilizing controlled backbones and extensive experiments across high-resource (LibriSpeech) and low-resource (Tatar) settings. The introduction of a BPE-phoneme interface is particularly innovative, as it combines phoneme sequences with boundary-awareness, enhancing the model's performance. The two-stage training process for both interfaces is well-defined and leverages phoneme supervision effectively.
The experiments are comprehensive, covering a variety of configurations and datasets. The results demonstrate that the phoneme-based interface significantly outperforms the vanilla projector, especially in low-resource scenarios, which is a critical finding for the field. The paper provides clear metrics (Word Error Rate - WER) and contextualizes results against recent baselines, showcasing the competitive performance of the proposed methods.
The paper outlines the experimental setup, including details on datasets, model architectures, and training procedures. However, the absence of a public repository or demo limits reproducibility. The authors mention the use of specific models and configurations but do not provide code or data access, which is essential for full reproducibility in machine learning research.
One limitation is the lack of exploration into the scalability of the proposed interfaces beyond the tested languages and datasets. Additionally, while the BPE-phoneme interface shows promise, its effectiveness in other languages or dialects has not been evaluated. The paper also does not address potential biases in the datasets used, which could impact generalizability.
The findings have significant implications for ASR systems, particularly in low-resource languages, where traditional methods often struggle. The insights gained from this study could inform future research and development in multilingual ASR, potentially leading to more inclusive and accessible speech technologies. The main contribution of this paper is the comparative analysis of phoneme-based and projector-based interfaces for LLM-integrated ASR, revealing that phoneme-based approaches can significantly enhance performance, especially in low-resource settings. This work advances the understanding of speech-language interfaces and provides a foundation for future innovations in ASR systems.
We present AccompGen, a system that generates instrumental music audio to accompany input vocals. Given isolated singing voice, AccompGen produces a coherent instrumental accompaniment that can be directly mixed with the input to create complete music. We propose three key innovations over prior work: (1) a dual-rate codec tokenization scheme using HuBERT semantic tokens at 50,Hz for vocals and EnCodec acoustic tokens at 75,Hz for instrumentals, enabling time-aligned yet rate-independent modeling; (2) a three-stage hierarchical autoregressive architecture (semantic to coarse acoustic to fine acoustic) with interleaved multi-codebook prediction and classifier-free guidance; and (3) modern Transformer design choices including QK-norm, GEGLU activations, RMSNorm, and T5-style relative position bias for improved training stability and sequence generalization.
Primary: Zhejiang Lab
All Institutions: Zhejiang Lab, University of Science and Technology of China, Huawei Technologies Co., Ltd.
AccompGen presents a novel approach to vocal accompaniment generation through a hierarchical autoregressive model that leverages dual-rate codec tokenization and modern Transformer techniques. This work significantly advances the field of audio generation by providing a robust framework for creating high-quality instrumental music that complements vocal performances.
The methodology presented in AccompGen is innovative, particularly with its dual-rate codec tokenization and hierarchical autoregressive architecture. The use of HuBERT and EnCodec for semantic and acoustic tokenization respectively allows for a nuanced representation of both vocals and instrumentals, which is critical for generating coherent music. The three-stage approach effectively decomposes the generation task, allowing for more controlled and precise outputs. The incorporation of modern Transformer design choices enhances the model's performance and stability, making it a significant advancement over previous methods.
The experiments conducted on the MUSDB18 dataset demonstrate the effectiveness of AccompGen, achieving a Frรฉchet Audio Distance (FAD) score that matches state-of-the-art systems while using significantly fewer parameters. This is a strong indicator of the model's efficiency and effectiveness. However, the paper could benefit from more extensive qualitative evaluations, such as user studies or subjective listening tests, to complement the objective metrics provided.
The paper provides a detailed description of the model architecture, training configurations, and data preprocessing steps, which aids in reproducibility. However, the lack of a publicly available code repository or demo limits the ability for other researchers to replicate the results directly. Providing access to the trained models or code would greatly enhance reproducibility.
One limitation is the reliance on specific datasets (MUSDB18 and FMA-Large) for training and evaluation, which may not fully represent the diversity of vocal styles and musical genres. Additionally, the model's performance in real-world applications, where vocal inputs may vary significantly in quality and style, remains untested. The paper also does not address potential biases in the training data that could affect the model's outputs.
The ability to generate instrumental accompaniments from vocal inputs has significant implications for music creation, democratizing music production and enabling non-musicians to create personalized music. This technology could be applied in various fields, including music education, entertainment, and therapy. However, ethical considerations regarding copyright and the potential for misuse in generating music that mimics existing artists should be addressed. AccompGen presents a novel approach to vocal accompaniment generation through a hierarchical autoregressive model that leverages dual-rate codec tokenization and modern Transformer techniques. This work significantly advances the field of audio generation by providing a robust framework for creating high-quality instrumental music that complements vocal performances.
Abusive speech detection is becoming increasingly important as social media shifts towards voice-based interaction, particularly in multilingual and low-resource settings. Most current systems rely on automatic speech recognition (ASR) followed by text-based hate speech classification, but this pipeline is vulnerable to transcription errors and discards prosodic information carried in speech. We investigate whether Contrastive Language-Audio Pre-training (CLAP) can support abusive speech detection directly from audio. Using the ADIMA dataset, we evaluate CLAP-based representations under few-shot supervised contrastive adaptation in cross-lingual and leave-one-language-out settings, with zero-shot prompting included as an auxiliary analysis. Our results show that CLAP yields strong cross-lingual audio representations across ten Indic languages, and that lightweight projection-only adaptation achieves competitive performance with respect to fully supervised systems trained on complete training data. However, the benefits of few-shot adaptation are language-dependent and not monotonic with shot size. These findings suggest that contrastive audio-text models provide a promising basis for cross-lingual audio abuse detection in low-resource settings, while also indicating that transfer remains incomplete and language-specific in important ways.
Primary: Tรฉlรฉcom SudParis
All Institutions: Tรฉlรฉcom SudParis, Polytechnique de Paris
This paper presents a significant advancement in the field of abusive speech detection by proposing a novel approach that leverages few-shot learning and contrastive audio-text representations. The methodology and results contribute valuable insights into the challenges of detecting abusive speech in low-resource languages, highlighting the potential for effective cross-lingual transfer and adaptation.
The methodology presented in this paper is innovative in its application of Contrastive Language-Audio Pre-training (CLAP) for abusive speech detection directly from audio, bypassing traditional ASR pipelines. The authors effectively leverage few-shot learning techniques to adapt the model to low-resource languages, which is crucial given the linguistic diversity in the target application. The use of projection-only adaptation versus projection+fine-tuning is a thoughtful approach that allows for exploration of trade-offs in performance and computational efficiency. The research questions are well-defined, guiding the exploration of the model's capabilities across various languages and adaptation scenarios.
The experiments are comprehensive, utilizing the ADIMA dataset to evaluate the proposed methods across multiple languages and settings. The authors provide detailed results for both few-shot and zero-shot conditions, with a clear focus on macro-F1 scores as the primary evaluation metric. The analysis of language-specific performance and the leave-one-language-out (LOLO) approach adds depth to the evaluation, revealing insights into the model's transferability and robustness across different linguistic contexts. However, the reliance on a single dataset may limit the generalizability of the findings.
The paper includes sufficient implementation details, such as the use of PyTorch and HuggingFace for model training, as well as fixed random seeds for reproducibility. However, the absence of a publicly available code repository or demo limits the ease with which other researchers can replicate the results. Including such resources would enhance the paper's impact and facilitate further exploration of the proposed methods.
The authors acknowledge several limitations, including the focus on a single dataset, which may not capture the full diversity of abusive speech across different contexts and languages. Additionally, the few-shot results may be sensitive to support-set composition and optimization choices, particularly in very low-shot regimes. The paper also lacks a dedicated fairness analysis, which is critical given the cultural sensitivity surrounding abusive language.
The implications of this research are significant, particularly in the context of moderating abusive speech in multilingual and low-resource settings. By developing a model that can effectively detect abusive speech directly from audio, the work addresses a pressing societal need in the age of voice-based social media. The findings could inform the development of more robust moderation tools that respect linguistic diversity and cultural nuances. This paper presents a significant advancement in the field of abusive speech detection by proposing a novel approach that leverages few-shot learning and contrastive audio-text representations. The methodology and results contribute valuable insights into the challenges of detecting abusive speech in low-resource languages, highlighting the potential for effective cross-lingual transfer and adaptation.
Detecting video deepfakes has become increasingly urgent in recent years. Given the audio-visual information in videos, existing methods typically expose deepfakes by modeling cross-modal correspondence using specifically designed architectures with publicly available datasets. While they have shown promising results, their effectiveness often degrades in real-world scenarios, as the limited diversity of training datasets naturally restricts generalizability to unseen cases. To address this, we propose a simple yet effective method, called AVPF, which can notably enhance model generalizability by training with self-generated Audio-Visual Pseudo-Fakes.The key idea of AVPF is to create pseudo-fake training samples that contain diverse audio-visual correspondence patterns commonly observed in real-world deepfakes. We highlight that AVPF is generated solely from authentic samples, and training relies only on authentic data and AVPF, without requiring any real deepfakes.Extensive experiments on multiple standard datasets demonstrate the strong generalizability of the proposed method, achieving an average performance improvement of up to 7.4%.
Primary: Ocean University of China
All Institutions: Ocean University of China
The main contribution of this paper is the introduction of a novel method for generating audio-visual pseudo-fakes that enhances the generalizability of video deepfake detection models. This work represents a meaningful advancement in the field, addressing a critical challenge in deepfake detection by leveraging authentic data to create diverse training samples that better reflect real-world scenarios.
The proposed methodology introduces a novel self-generated Audio-Visual Pseudo-Fake (AVPF) strategy that enhances the generalizability of video deepfake detection by simulating both inter- and intra-modality inconsistencies. The two key strategies, Audio-Visual Self-Blending (AVSB) and Audio-Visual Self-Splicing (AVSS), are well-conceived, leveraging authentic data to create pseudo-fake samples that reflect real-world deepfake characteristics. The approach is straightforward yet effective, relying solely on authentic samples, which is a significant departure from existing methods that require real deepfake data. The methodology is clearly articulated, with detailed descriptions of the processes involved in generating pseudo-fakes, making it replicable for future research.
The experiments are extensive, covering multiple standard datasets, including FakeAVCeleb, AV-Deepfake1M, AVLips, and TalkingHeadBench. The paper reports an average performance improvement of up to 7.4%, which is a significant enhancement over existing methods. The results are well-presented, with comparisons to state-of-the-art methods, demonstrating the effectiveness of the proposed approach. The ablation studies provide valuable insights into the contributions of each component of the methodology, reinforcing the robustness of the findings.
The paper provides sufficient implementation details, including the use of specific datasets, the architecture of the model, and the training parameters. However, the lack of a publicly available code repository or demo URL limits reproducibility. Future work should consider sharing code and datasets to facilitate validation of the results by the research community.
One limitation noted is that while the method improves detection generalizability, it does not address the localization of forgery within videos. This could be a significant drawback for practical applications where identifying the specific manipulated segments is crucial. Additionally, the reliance on authentic samples may still limit the diversity of training data, as the method may not cover all possible deepfake manipulation techniques.
The proposed method has significant implications for the field of multimedia forensics and security, particularly in combating the growing threat of deepfake technology. By improving detection capabilities, the research contributes to safeguarding authenticity in various domains, including journalism, social media, and legal contexts. The approach also opens avenues for further research into multi-modal deepfake detection strategies. The main contribution of this paper is the introduction of a novel method for generating audio-visual pseudo-fakes that enhances the generalizability of video deepfake detection models. This work represents a meaningful advancement in the field, addressing a critical challenge in deepfake detection by leveraging authentic data to create diverse training samples that better reflect real-world scenarios.
QoS-QoE translation is a fundamental problem in multimedia systems because it characterizes how measurable system and network conditions affect user-perceived experience. Although many prior studies have examined this relationship, their findings are often developed for specific setups and remain scattered across papers, experimental settings, and reporting formats, limiting systematic reuse, cross-scenario generalization, and large-scale analysis. To address this gap, we first introduce QoS-QoE Translation dataset, a source-grounded dataset of structured QoS-QoE relationships from the multimedia literature, with a focus on video streaming related tasks. We construct the dataset through an automated pipeline that combines paper curation, QoS-QoE relationship extraction, and iterative data evaluation. Each record preserves the extracted relationship together with parameter definitions, supporting evidence, and contextual metadata. We further evaluate the capability of large language models (LLMs) on QoS-QoE translation, both before and after supervised fine-tuning on our dataset, and show strong performance on both continuous-value and discrete-label prediction in bidirectional translation, from QoS-QoE and QoE-QoS. Our dataset provides a foundation for benchmarking LLMs in QoS-QoE translation and for supporting future LLM-based reasoning for multimedia quality prediction and optimization. The complete dataset and code are publicly available at https://yyu6969.github.io/qos-qoe-translation-page/, for full reproducibility and open access.
Primary: University of Illinois Urbana-Champaign
All Institutions: University of Illinois Urbana-Champaign, University of Massachusetts Amherst
The main contribution of this paper is the introduction of the QoS-QoE Translation dataset, which provides a structured resource for understanding the relationship between Quality of Service and Quality of Experience in multimedia systems, alongside a robust methodology for its construction and evaluation. This work significantly advances the field by enabling systematic reuse and benchmarking of LLMs in QoS-QoE translation tasks, addressing a critical gap in existing research.
The methodology presented in the paper is robust, involving a well-structured pipeline for dataset construction that includes paper curation, relationship extraction, and iterative evaluation. The use of large language models (LLMs) for both extraction and evaluation is innovative, leveraging their capabilities for structured information retrieval from complex academic texts. The iterative review process enhances the reliability of the dataset, addressing common pitfalls in automated extraction methods. However, while the approach is systematic, it would benefit from a clearer explanation of the specific prompts used for LLM extraction and how they were tailored to the nuances of QoS-QoE relationships.
The experimental evaluation is thorough, demonstrating the effectiveness of the proposed dataset through supervised fine-tuning of various LLMs. The results indicate significant performance improvements across multiple models, showcasing the dataset's utility for both continuous and discrete prediction tasks. The metrics used (MAPE, Accuracy, Macro-F1) are appropriate for the tasks at hand, and the clear delineation of results before and after fine-tuning provides a compelling narrative of the dataset's impact. However, the experiments could be strengthened by including comparisons to baseline models or alternative approaches to highlight the advantages of the proposed method.
The paper emphasizes reproducibility by providing access to the dataset and code, which is a commendable practice in machine learning research. The structured JSON format of the dataset, along with detailed descriptions of the data extraction process, supports reproducibility. However, the paper could enhance reproducibility further by including more detailed information about the training configurations and hyperparameters used during the fine-tuning of the LLMs.
One limitation of the study is the focus on video streaming, which may restrict the generalizability of the findings to other multimedia contexts. Additionally, while the dataset construction process is automated, the reliance on LLMs for extraction and evaluation may introduce biases or errors inherent to the models used. The paper also acknowledges that current LLMs struggle with complex reasoning tasks, which may affect the quality of translations in more nuanced scenarios.
The potential applications of the QoS-QoE Translation dataset are significant, as it can facilitate advancements in multimedia systems, particularly in real-time quality prediction and adaptive streaming. The dataset could serve as a foundational resource for developing AI agents that optimize user experience based on system conditions. Furthermore, the structured nature of the dataset allows for its use in retrieval-augmented systems, enhancing decision-making processes in network management. The main contribution of this paper is the introduction of the QoS-QoE Translation dataset, which provides a structured resource for understanding the relationship between Quality of Service and Quality of Experience in multimedia systems, alongside a robust methodology for its construction and evaluation. This work significantly advances the field by enabling systematic reuse and benchmarking of LLMs in QoS-QoE translation tasks, addressing a critical gap in existing research.
The rapid advancement of Audio Large Language Models (ALLMs) has enabled cost-effective, high-fidelity generation and manipulation of both speech and non-speech audio, including sound effects, singing voices, and music. While these capabilities foster creativity and content production, they also introduce significant security and trust challenges, as realistic audio deepfakes can now be generated and disseminated at scale. Existing audio deepfake detection (ADD) countermeasures (CMs) and benchmarks, however, remain largely speech-centric, often relying on speech-specific artifacts and exhibiting limited robustness to real-world distortions, as well as restricted generalization to heterogeneous audio types and emerging spoofing techniques. To address these gaps, we propose the All-Type Audio Deepfake Detection (AT-ADD) Grand Challenge for ACM Multimedia 2026, designed to bridge controlled academic evaluation with practical multimedia forensics. AT-ADD comprises two tracks: (1) Robust Speech Deepfake Detection, which evaluates detectors under real-world scenarios and against unseen, state-of-the-art speech generation methods; and (2) All-Type Audio Deepfake Detection, which extends detection beyond speech to diverse, unknown audio types and promotes type-agnostic generalization across speech, sound, singing, and music. By providing standardized datasets, rigorous evaluation protocols, and reproducible baselines, AT-ADD aims to accelerate the development of robust and generalizable audio forensic technologies, supporting secure communication, reliable media verification, and responsible governance in an era of pervasive synthetic audio.
Primary: Communication University of China
All Institutions: Communication University of China, Ant Group, Chinese Academy of Sciences, Beijing Institute of Technology, Shanghai Jiao Tong University
The paper presents the AT-ADD challenge, a comprehensive evaluation framework for audio deepfake detection that addresses existing gaps in robustness and generalization across audio types. This work is significant as it lays the groundwork for advancing audio forensic technologies, promoting secure communication and reliable media verification in the face of growing synthetic audio threats.
The methodology presented in the paper is robust, proposing a structured evaluation framework for audio deepfake detection that includes two distinct tracks focusing on speech and all-type audio. The challenge is designed to address the limitations of existing benchmarks by incorporating real-world conditions and diverse audio types. The datasets are well-constructed, ensuring a comprehensive evaluation of the proposed countermeasures (CMs) under various conditions, which enhances the reliability of the results.
The experimental evaluation is thorough, with a clear description of dataset composition, including the number of samples and the diversity of audio types. The inclusion of multiple state-of-the-art generation methods for both real and fake audio in the evaluation sets allows for a rigorous assessment of the CMs' performance. Baseline models are provided, which facilitate fair comparisons and establish a strong foundation for future research.
The paper emphasizes reproducibility by providing official implementations of baseline models and clear rules regarding data usage. The closed setting for the challenge ensures that participants can only use the provided datasets, which minimizes variability and enhances the reliability of the results. However, the paper could benefit from more detailed implementation instructions or links to code repositories for the proposed methods.
One limitation of the proposed challenge is that it may not fully capture the complexity of real-world audio deepfake scenarios, especially in terms of environmental variability and user-generated content. Additionally, the focus on specific audio types may overlook other emerging forms of audio manipulation. The challenge's closed setting might also restrict innovative approaches that could leverage external data.
The AT-ADD challenge has significant implications for the field of audio forensics and security, as it aims to improve the robustness and generalizability of audio deepfake detection systems. By addressing the challenges associated with diverse audio types and real-world conditions, the challenge promotes the development of technologies that can enhance media verification and secure communication in an era of increasing synthetic audio generation. The paper presents the AT-ADD challenge, a comprehensive evaluation framework for audio deepfake detection that addresses existing gaps in robustness and generalization across audio types. This work is significant as it lays the groundwork for advancing audio forensic technologies, promoting secure communication and reliable media verification in the face of growing synthetic audio threats.
Voice design from natural language descriptions is emerging as a new task in text-to-speech multimodal generation, aiming to synthesize speech with target timbre and speaking style without relying on reference audio. However, existing methods mainly focus on single-utterance generation, leaving conversational voice design largely unexplored. In this work, we extend voice design to dialogue, enabling better target speaker modeling and turn-level expressive control in natural conversational settings. We propose CapTalk, a unified caption-conditioned text-audio autoregressive framework for both single-utterance and dialogue voice design. CapTalk uses utterance-level captions for single-utterance voice design and speaker-level captions for dialogue speaker modeling, and further introduces a CoT control sequence in dialogue to explicitly plan turn-level dynamic attributes. To resolve the conflict between stable timbre preservation and context-adaptive expression, we propose a hierarchical variational conditioning module with an utterance-level speaker encoder to better balance stable timbre preservation and context-adaptive expression. This enables timbre reuse while keeping expression adaptive to the current utterance and, in dialogue, the surrounding context. We also build a comprehensive evaluation protocol for both single-utterance and dialogue settings. Experiments show that CapTalk achieves state-of-the-art performance on a single-utterance voice design benchmark and delivers better expression controllability and contextual appropriateness in multi-turn dialogue. Audio samples are available at: https://anonymous.4open.science/api/repo/Captalk-D601/file/index.html.
Primary: University of Chinese Academy of Sciences
All Institutions: University of Chinese Academy of Sciences, Hello Group Inc.
The main contribution of this paper is the introduction of CapTalk, a unified framework for voice design that effectively integrates single-utterance and dialogue generation, achieving state-of-the-art results while addressing key challenges in expressive speech synthesis. The comprehensive methodology and rigorous experimental evaluation position this work as a significant advancement in the field of machine learning and speech generation.
The paper introduces CapTalk, a unified caption-conditioned text-audio autoregressive framework that innovatively extends voice design to dialogue settings. The methodology effectively incorporates hierarchical variational conditioning to balance stable timbre preservation and context-adaptive expression, which is a significant advancement over existing methods that primarily focus on single-utterance generation. The use of CoT control sequences for explicit turn-level expressive control is a novel approach that enhances the model's ability to handle dynamic dialogue contexts.
The experiments demonstrate that CapTalk achieves state-of-the-art performance on single-utterance voice design benchmarks and shows improved expression controllability and contextual appropriateness in multi-turn dialogue. The evaluation protocol is comprehensive, utilizing both human evaluations and automatic metrics, which strengthens the reliability of the results. The paper provides detailed comparisons with existing models, showcasing the advantages of CapTalk through various metrics.
The paper outlines the architecture and training objectives clearly, which aids in reproducibility. However, the reliance on a specific multimodal model (Qwen3-Omni) for caption generation could limit the generalizability of the results if the model's performance varies. The authors plan to release caption annotations and a subset of data, which will further enhance reproducibility.
The paper acknowledges limitations related to the quality of the caption generation process and the emotional expressiveness of the training data, which primarily consists of natural conversational speech. These factors may impact the model's performance in more expressive settings. Additionally, the evaluation benchmarks for dialogue are still developing, which may affect the assessment of the model's capabilities.
CapTalk has the potential to significantly impact the fields of conversational AI and speech synthesis by enabling more natural and context-aware dialogue systems. The ability to generate expressive speech from textual descriptions could enhance applications in virtual assistants, gaming, and interactive storytelling, making human-computer interactions more engaging and realistic. The main contribution of this paper is the introduction of CapTalk, a unified framework for voice design that effectively integrates single-utterance and dialogue generation, achieving state-of-the-art results while addressing key challenges in expressive speech synthesis. The comprehensive methodology and rigorous experimental evaluation position this work as a significant advancement in the field of machine learning and speech generation.
Talking face generation has gained significant attention as a core application of generative models. To enhance the expressiveness and realism of synthesized videos, emotion editing in talking face video plays a crucial role. However, existing approaches often limit expressive flexibility and struggle to generate extended emotions. Label-based methods represent emotions with discrete categories, which fail to capture a wide range of emotions. Audio-based methods can leverage emotionally rich speech signals - and even benefit from expressive text-to-speech (TTS) synthesis - but they fail to express the target emotions because emotions and linguistic contents are entangled in emotional speeches. Images-based methods, on the other hand, rely on target reference images to guide emotion transfer, yet they require high-quality frontal views and face challenges in acquiring reference data for extended emotions (e.g., sarcasm). To address these limitations, we propose Cross-Modal Emotion Transfer (C-MET), a novel approach that generates facial expressions based on speeches by modeling emotion semantic vectors between speech and visual feature spaces. C-MET leverages a large-scale pretrained audio encoder and a disentangled facial expression encoder to learn emotion semantic vectors that represent the difference between two different emotional embeddings across modalities. Extensive experiments on the MEAD and CREMA-D datasets demonstrate that our method improves emotion accuracy by 14% over state-of-the-art methods, while generating expressive talking face videos - even for unseen extended emotions. Code, checkpoint, and demo are available at https://chanhyeok-choi.github.io/C-MET/
Primary: Ulsan National Institute of Science and Technology (UNIST)
All Institutions: Ulsan National Institute of Science and Technology (UNIST)
The paper presents a novel approach to emotion editing in talking face videos through Cross-Modal Emotion Transfer (C-MET), significantly advancing the field by enabling the synthesis of extended emotions from audio inputs. The methodology is well-structured, and the experimental validation demonstrates its effectiveness, making it a valuable contribution to the machine learning community.
The proposed Cross-Modal Emotion Transfer (C-MET) method is innovative in its approach to emotion editing in talking face videos by leveraging emotion semantic vectors derived from audio and visual modalities. The methodology effectively addresses the limitations of existing methods by enabling the generation of extended emotions without requiring extensive labeled datasets. The use of a contrastive learning framework to align audio and visual representations is a notable strength, as it enhances the model's ability to generalize across different emotional expressions.
The experiments conducted on the MEAD and CREMA-D datasets are comprehensive, demonstrating significant improvements in emotion accuracy over state-of-the-art methods. The quantitative metrics, such as \(Acc_{emo}\), alongside qualitative assessments from user studies, provide a robust evaluation of the model's performance. The results indicate that C-MET not only achieves higher accuracy but also maintains visual fidelity and synchronization, which are critical for practical applications.
The paper includes sufficient implementation details, such as the choice of encoders, training protocols, and loss functions, which facilitate reproducibility. The availability of code and demo links further supports this aspect, although the actual code repository is not provided in the text.
The model's reliance on a minimum number of speech samples for stable performance could limit its applicability in scenarios with limited data. Additionally, the current focus on English datasets restricts the model's generalizability to multilingual contexts. The inability to handle multi-view identity images is another notable limitation that could affect the model's robustness in diverse applications.
The ability to generate expressive talking face videos has significant implications for fields such as virtual reality, gaming, and telecommunication, where emotional engagement is crucial. The advancements in emotion editing can enhance human-computer interaction, making virtual agents more relatable and effective in applications like education and therapy. The potential for integrating this technology into various multimedia platforms could lead to more immersive and empathetic user experiences. The paper presents a novel approach to emotion editing in talking face videos through Cross-Modal Emotion Transfer (C-MET), significantly advancing the field by enabling the synthesis of extended emotions from audio inputs. The methodology is well-structured, and the experimental validation demonstrates its effectiveness, making it a valuable contribution to the machine learning community.
Passive Acoustic Monitoring (PAM) is widely used for biodiversity assessment. Its application in African tropical forests is limited by scarce annotated data, reducing the performance of general-purpose ecoacoustic models on underrepresented taxa. In this study, we introduce DeepForestSound (DFS), a multi-species automatic detection model designed for PAM in African tropical forests. DFS relies on a semi-supervised pipeline combining clustering of unannotated recordings with manual validation, followed by supervised fine-tuning of an Audio Spectrogram Transformer (AST) using low-rank adaptation, which is compared to a frozen-backbone linear baseline (DFS-Linear). The framework supports the detection of multiple taxonomic groups, including birds, primates, and elephants, from long-term acoustic recordings. DFS was trained on acoustic data collected in the Sebitoli area, in Kibale National Park, Uganda, and evaluated on an independent dataset recorded two years later at different locations within the same forest. This evaluation therefore assesses generalization across time and recording sites within a single tropical forest ecosystem. Across 8 out of 12 taxons, DFS outperforms existing automatic detection tools, particularly for non-avian taxa, achieving average AP values of 0.964 for primates and 0.961 for elephants. Results further show that LoRA-based fine-tuning substantially outperforms linear probing across taxa. Overall, these results demonstrate that task-oriented, region-specific training substantially improves detection performance in acoustically complex tropical environments, and highlight the potential of DFS as a practical tool for biodiversity monitoring and conservation in African rainforests.
Primary: Musรฉum National d'Histoire Naturelle
All Institutions: Musรฉum National d'Histoire Naturelle, Sebitoli Chimpanzee Project, Uganda Wildlife Authority, Nitidae Association, Centre d'Ecologie et des Sciences de la Conservation, Institut de Systรฉmatique, Evolution, Biodiversitรฉ
This paper presents DeepForestSound (DFS), a multi-species automatic detection model for passive acoustic monitoring in African tropical forests, demonstrating a significant advancement in biodiversity monitoring techniques. The innovative methodology, rigorous experimental evaluation, and potential for real-world applications underscore its importance in the field of machine learning and conservation biology.
The methodology presented in this paper is robust and innovative, utilizing a semi-supervised pipeline to generate labeled datasets from unannotated acoustic recordings. The combination of clustering techniques with manual validation, followed by fine-tuning a pretrained Audio Spectrogram Transformer (AST) using Low-Rank Adaptation (LoRA), is particularly noteworthy. This approach addresses the challenge of limited annotated data in biodiversity monitoring effectively. The detailed steps taken in data collection, processing, and model training demonstrate a comprehensive understanding of the complexities involved in acoustic monitoring in tropical environments.
The experiments are well-structured, with a clear evaluation protocol that includes comparisons with existing models such as BirdNET, Perch v2, and RDet. The results indicate that DFS outperforms these models for non-avian taxa, which is significant given the ecological importance of these species. The use of Average Precision (AP) and best F1 scores as evaluation metrics is appropriate, and the results are presented clearly, highlighting the model's strengths and weaknesses across different taxa.
The paper provides sufficient detail on the implementation of the model, including the datasets used, preprocessing steps, and training configurations. However, the inability to share raw audio recordings due to legal restrictions may limit full reproducibility. The availability of the code and pretrained models on GitHub is a positive aspect that enhances reproducibility.
One limitation identified is the focus on a specific geographic region (Kibale National Park) and the potential lack of generalizability to other tropical forest ecosystems. Additionally, while the semi-supervised clustering approach is effective, the authors acknowledge that a systematic sensitivity analysis of hyperparameters was not conducted, which could affect the robustness of the model. The model's performance on underrepresented species may also be influenced by the limited training data available for those taxa.
The implications of this research are significant for biodiversity conservation, particularly in underrepresented and threatened species within African tropical forests. The development of a task-oriented model like DFS can facilitate more effective monitoring and conservation efforts, potentially leading to better-informed ecological management strategies. The framework's adaptability for future species integration also suggests a scalable approach to biodiversity assessment. This paper presents DeepForestSound (DFS), a multi-species automatic detection model for passive acoustic monitoring in African tropical forests, demonstrating a significant advancement in biodiversity monitoring techniques. The innovative methodology, rigorous experimental evaluation, and potential for real-world applications underscore its importance in the field of machine learning and conservation biology.
Conversational AI has made significant progress, yet generating expressive and controllable text-to-speech (TTS) remains challenging. Specifically, controlling fine-grained voice styles and emotions is notoriously difficult and typically requires massive amounts of heavily annotated training data. To overcome this data bottleneck, we present a scalable, data-efficient cascaded framework that pairs textual style tokens with human-curated, high-quality audio prompts. This approach enables single-shot adaptation to fine-grained speaking styles and character voices. In the context of TTS, this audio prompting acts as In-Context Learning (ICL), guiding the model's prosody and timbre without requiring massive parameter updates or large-scale retraining. To further enhance generation quality and mitigate hallucinations, we introduce a novel ICL-based online reinforcement learning (RL) strategy. This strategy directly optimizes the autoregressive prosody model using subjective aesthetic rewards while being constrained by Connectionist Temporal Classification (CTC) alignment to preserve intelligibility. Comprehensive human perception evaluations demonstrate significant improvements in both the naturalness and expressivity of the synthesized speech, establishing the efficacy of our ICL-based online RL approach.
Primary: Meta AI
All Institutions: Meta AI
The paper presents a novel cascaded framework for enhancing conversational TTS through ICL and online reinforcement learning, significantly improving the expressivity and naturalness of synthesized speech. The technical contributions, including innovative methodologies and thorough experimental evaluations, position this work as a meaningful advancement in the field of conversational AI and TTS systems.
The proposed methodology introduces a cascaded framework that utilizes textual style tokens and audio prompts for fine-grained control over TTS expressivity. The integration of In-Context Learning (ICL) allows for single-shot adaptation, which is a significant advancement in reducing the data requirements typically associated with expressive TTS systems. The novel ICL-based online reinforcement learning strategy optimizes the autoregressive prosody model using subjective aesthetic rewards while maintaining intelligibility through CTC alignment, showcasing a sophisticated approach to mitigating common issues in TTS systems.
The experiments are robust, employing comprehensive human perception evaluations that assess naturalness and expressivity across multiple dimensions. The use of a comparative Mean Opinion Score (CMOS) and a structured rating protocol based on paralinguistic dimensions adds rigor to the evaluation process. The results demonstrate substantial improvements over baseline models, indicating the effectiveness of the proposed methods.
The paper provides a clear description of the experimental setup, including the selection of audio prompts and the training process for the models. However, the lack of publicly available datasets or code may hinder full reproducibility. The authors could enhance reproducibility by providing access to their training data and model implementations.
One limitation is the reliance on human-curated audio prompts, which may introduce subjectivity and variability in the results. Additionally, while the proposed methods show improvements, the scalability of the approach in real-world applications and its performance across diverse languages and accents remain to be fully explored.
The advancements in expressive TTS have significant implications for various applications, including virtual assistants, audiobooks, and interactive entertainment. By enabling more natural and expressive speech synthesis, this research could enhance user experiences in conversational AI systems and contribute to the development of more engaging and human-like interactions. The paper presents a novel cascaded framework for enhancing conversational TTS through ICL and online reinforcement learning, significantly improving the expressivity and naturalness of synthesized speech. The technical contributions, including innovative methodologies and thorough experimental evaluations, position this work as a meaningful advancement in the field of conversational AI and TTS systems.
Noisy speech separation systems are typically trained on fully-synthetic mixtures, limiting generalization to real-world scenarios. Though training on mixtures of in-domain (thus often noisy) speech is possible, we show that this leads to undesirable optima where mixture noise is retained in the estimates, due to the inseparability of the background noises and the loss function's symmetry. To address this, we propose ring mixing, a batch strategy of using each source in two mixtures, alongside a new Signal-to-Consistency-Error Ratio (SCER) auxiliary loss penalizing inconsistent estimates of the same source from different mixtures, breaking symmetry and incentivizing denoising. On a WHAM!-based benchmark, our method can reduce residual noise by upwards of half, effectively learning to denoise from only noisy recordings. This opens the door to training more generalizable systems using in-the-wild data, which we demonstrate via systems trained using naturally-noisy speech from VoxCeleb.
Primary: Johns Hopkins University
All Institutions: Johns Hopkins University, Carnegie Mellon University
The paper presents a significant advancement in the field of unsupervised speech separation by introducing innovative methodologies that effectively address the challenges posed by noisy training data. The combination of ring mixing and SCER loss represents a promising direction for future research, with the potential to improve the generalization of speech separation systems in real-world applications.
The paper introduces a novel batch construction strategy called "ring mixing" and an auxiliary loss function termed Signal-to-Consistency-Error Ratio (SCER). The methodology effectively addresses the limitations of conventional supervised training in noisy speech separation tasks by breaking the symmetry in the loss function that leads to undesirable optima. The use of multiple mixtures for the same source in training helps in reducing residual noise and improving the generalization of the model to real-world scenarios. The approach is well-justified, with a clear explanation of the problems with existing methods and a logical progression to the proposed solutions.
The experiments conducted on the WHAM! dataset demonstrate significant improvements in denoising capabilities, with results indicating a reduction in residual noise by upwards of half. The evaluation metrics, including SI-SDR and occupancy metrics, provide a comprehensive assessment of the model's performance. The results show that the proposed SCER loss contributes positively to the denoising task while maintaining separation quality, which is a critical aspect of the research.
The paper provides sufficient details regarding the datasets, model architecture, and training configurations, which are essential for reproducibility. However, the lack of a publicly available code repository or demo URL limits the ease with which other researchers can replicate the results. The hyperparameter settings, particularly for the SCER loss, are mentioned but not extensively tuned, which could affect reproducibility in varying contexts.
One notable limitation is the observed degradation in performance when evaluating on noiseless conditions, suggesting that the model may not generalize well to all scenarios. Additionally, the reliance on specific datasets may limit the applicability of the findings to other types of noisy speech environments. The authors also mention that the SCER loss can lead to local minima, which may hinder optimal performance.
The proposed methods have significant implications for real-world applications in speech separation and denoising, particularly in environments where overlapping speech and background noise are prevalent. The ability to train models using naturally noisy recordings could enhance the robustness of speech processing systems in various applications, including telecommunications, hearing aids, and voice recognition systems. This work opens avenues for further research into unsupervised learning techniques in audio processing. The paper presents a significant advancement in the field of unsupervised speech separation by introducing innovative methodologies that effectively address the challenges posed by noisy training data. The combination of ring mixing and SCER loss represents a promising direction for future research, with the potential to improve the generalization of speech separation systems in real-world applications.
Speech LLM post-training increasingly relies on efficient cross-modal alignment and robust low-resource adaptation, yet collecting large-scale audio-text pairs remains costly. Text-only alignment methods such as TASU reduce this burden by simulating CTC posteriors from transcripts, but they provide limited control over uncertainty and error rate, making curriculum design largely heuristic. We propose \textbf{TASU2}, a controllable CTC simulation framework that simulates CTC posterior distributions under a specified WER range, producing text-derived supervision that better matches the acoustic decoding interface. This enables principled post-training curricula that smoothly vary supervision difficulty without TTS. Across multiple source-to-target adaptation settings, TASU2 improves in-domain and out-of-domain recognition over TASU, and consistently outperforms strong baselines including text-only fine-tuning and TTS-based augmentation, while mitigating source-domain performance degradation.
Primary: Shanghai Jiao Tong University
All Institutions: Shanghai Jiao Tong University, AISpeech Ltd, Nanjing University
The main contribution of this paper is the introduction of TASU2, a controllable CTC simulation framework that significantly improves the alignment and adaptation of speech LLMs in low-resource settings. The methodology and results presented demonstrate a meaningful advancement in the efficiency and effectiveness of speech recognition systems, particularly in the context of limited data availability.
The methodology proposed in TASU2 is innovative, focusing on controllable CTC simulation to improve the alignment between text and speech representations. The use of a WER-conditioned approach allows for more precise control over the generated posteriors, which is a significant advancement over previous methods like TASU. The authors effectively integrate a lightweight Transformer architecture to achieve this, which is appropriate for the task. The algorithm is well-structured, and the training signal is designed to closely mimic real acoustic behavior, enhancing the fidelity of the simulation.
The experiments are comprehensive, evaluating TASU2 across various datasets and settings, including low-resource adaptation scenarios. The results demonstrate consistent improvements over the baseline methods, particularly in terms of WER reduction and domain generalization. The paper provides a thorough analysis of the results, including ablation studies that validate the importance of the WER conditioning. However, specific quantitative results (e.g., exact WER scores) were not detailed in the provided text, which could enhance clarity.
The paper outlines the training and evaluation setup, including the architecture of the simulator and the datasets used. However, the absence of a public code repository or detailed implementation instructions limits the reproducibility of the results. Providing a GitHub link or similar would significantly enhance this aspect.
One limitation is the reliance on a teacher ASR system for generating posteriors, which may introduce biases depending on the quality of the ASR model used. Additionally, while the method shows promise in low-resource settings, its performance in extremely low-resource scenarios remains to be fully explored. The paper could also benefit from a discussion on the scalability of the approach to larger datasets or more complex domains.
The proposed TASU2 framework has significant implications for the field of speech recognition, particularly in scenarios where paired audio-text data is scarce. By enabling effective low-resource adaptation, it opens avenues for deploying speech LLMs in diverse languages and dialects, thereby enhancing accessibility and usability in various applications. This could lead to advancements in real-time translation, voice assistants, and other speech-driven technologies. The main contribution of this paper is the introduction of TASU2, a controllable CTC simulation framework that significantly improves the alignment and adaptation of speech LLMs in low-resource settings. The methodology and results presented demonstrate a meaningful advancement in the efficiency and effectiveness of speech recognition systems, particularly in the context of limited data availability.
Integrating large language models (LLMs) into automatic speech recognition (ASR) has become a dominant paradigm. Although recent LLM-based ASR models have shown promising performance on public benchmarks, it remains challenging to balance recognition quality with latency and overhead, while hallucinations further limit real-world deployment. In this study, we revisit LLM-based ASR from an entropy allocation perspective and introduce three metrics to characterize how training paradigms allocate entropy reduction between the speech encoder and the LLM. To remedy entropy-allocation inefficiencies in prevailing approaches, we propose a principled multi-stage training strategy grounded in capability-boundary awareness, optimizing parameter efficiency and hallucination robustness. Specifically, we redesign the pretraining strategy to alleviate the speech-text modality gap, and further introduce an iterative asynchronous SFT stage between alignment and joint SFT to preserve functional decoupling and constrain encoder representation drift. Experiments on Mandarin and English benchmarks show that our method achieves competitive performance with state-of-the-art models using only 2.3B parameters, while also effectively mitigating hallucinations through our decoupling-oriented design.
Primary: NIO
All Institutions: NIO
The paper presents a novel approach to entropy allocation in LLM-based ASR systems, significantly contributing to the understanding and improvement of model performance. The methodology is well-structured, and the experimental results validate the proposed framework, marking a meaningful advancement in the field of audio processing and machine learning.
The paper introduces an innovative perspective on entropy allocation in LLM-based ASR systems, proposing new metrics (NSE, PAI, CSAI) to analyze the dynamics between speech encoders and LLMs. The multi-stage training strategy, particularly the iterative asynchronous SFT (IA-SFT) stage, is a significant methodological advancement that aims to preserve functional decoupling and mitigate hallucinations. The approach is well-grounded in theoretical considerations and is supported by empirical evidence, making it a robust contribution to the field.
The experiments conducted on Mandarin and English benchmarks demonstrate the effectiveness of the proposed methods, achieving competitive performance with significantly fewer parameters than state-of-the-art models. The paper provides a comprehensive comparison with existing models, showcasing improvements in both recognition accuracy and hallucination rates. The use of diverse datasets strengthens the validity of the results.
The paper includes detailed descriptions of the training procedures, data statistics, and evaluation metrics, which facilitate reproducibility. However, the absence of a publicly available code repository or demo URL limits the practical reproducibility of the findings.
While the proposed method shows promise, the paper does not address potential scalability issues when applied to larger datasets or more complex ASR tasks. Additionally, the reliance on specific metrics for evaluation may not capture all aspects of model performance, particularly in real-world scenarios.
The research has significant implications for the deployment of LLM-based ASR systems in real-world applications, particularly in enhancing recognition accuracy while reducing hallucinations. The findings could influence future research directions in ASR and multimodal systems, promoting more efficient and robust architectures. The paper presents a novel approach to entropy allocation in LLM-based ASR systems, significantly contributing to the understanding and improvement of model performance. The methodology is well-structured, and the experimental results validate the proposed framework, marking a meaningful advancement in the field of audio processing and machine learning.
Recent advances in audio-visual representation learning have shown the value of combining contrastive alignment with masked reconstruction. However, jointly optimizing these objectives in a single forward pass forces the contrastive branch to rely on randomly visible patches designed for reconstruction rather than cross-modal alignment, introducing semantic noise and optimization interference. We propose TG-DP, a Teacher-Guided Dual-Path framework that decouples reconstruction and alignment into separate optimization paths. By disentangling the masking regimes of the two branches, TG-DP enables the contrastive pathway to use a visibility pattern better suited to cross-modal alignment. A teacher model further provides auxiliary guidance for organizing visible tokens in this branch, helping reduce interference and stabilize cross-modal representation learning. TG-DP achieves state-of-the-art performance in zero-shot retrieval. On AudioSet, it improves R@1 from 35.2\% to 37.4\% for video-to-audio retrieval and from 27.9\% to 37.1\% for audio-to-video retrieval. The learned representations also remain semantically robust, achieving state-of-the-art linear-probe performance on AS20K and VGGSound. Taken together, our results suggest that decoupling multimodal objectives and introducing teacher-guided structure into the contrastive pathway provide an effective framework for improving large-scale audio-visual pretraining. Code is available at https://github.com/wanglg20/TG-DP.
Primary: Unknown
All Institutions: Unknown
The paper presents a novel Teacher-Guided Dual-Path framework for audio-visual representation learning, significantly improving state-of-the-art performance in zero-shot retrieval tasks. The comprehensive methodology and experimental validation highlight its potential impact on the field, addressing critical challenges in cross-modal alignment and semantic noise reduction.
The proposed TG-DP framework effectively decouples the objectives of masked reconstruction and contrastive learning into separate optimization paths. This dual-path approach allows for tailored visibility patterns that enhance cross-modal alignment while mitigating semantic noise and optimization interference. The introduction of a teacher-student mechanism further enriches the training process by providing structured guidance, which is a noteworthy advancement in the field. The methodology is well-structured and addresses existing challenges in audio-visual representation learning.
The experiments are comprehensive, utilizing large-scale datasets such as AudioSet-2M and VGGSound. The results demonstrate significant improvements in zero-shot retrieval performance, achieving state-of-the-art results across various metrics. The ablation studies provide valuable insights into the effectiveness of the proposed components, such as the dual-path structure and teacher-guided masking strategy. However, the paper could benefit from more detailed comparisons with additional baselines to further validate the claims.
The paper provides a clear description of the methodology and experimental setup, including hyperparameters and data preprocessing steps. The availability of code on GitHub enhances reproducibility. However, the lack of detailed information on the training environment and specific configurations may pose challenges for complete replication.
The primary limitation is the unknown primary institution and the lack of citation context, which may hinder the paper's visibility and impact in the academic community. Additionally, the performance improvements, while significant, may still be context-dependent and require further validation across diverse tasks and datasets.
The advancements in audio-visual representation learning have the potential to enhance various applications, including multimedia retrieval, content-based recommendation systems, and interactive AI systems. The proposed framework could lead to more robust models that understand and integrate audio-visual information, paving the way for future research and applications in multimodal AI. The paper presents a novel Teacher-Guided Dual-Path framework for audio-visual representation learning, significantly improving state-of-the-art performance in zero-shot retrieval tasks. The comprehensive methodology and experimental validation highlight its potential impact on the field, addressing critical challenges in cross-modal alignment and semantic noise reduction.
Word error rate (WER) is the dominant metric for automatic speech recognition, yet it cannot detect a systematic failure mode: models that produce fluent output in the wrong writing system. We define Script Fidelity Rate (SFR), the fraction of hypothesis characters in the target script block, computable without reference transcriptions, and report the first systematic measurement of script collapse across six languages spanning four writing systems (Pashto, Urdu, Hindi, Bengali, Malayalam, Somali) and nine ASR models on FLEURS test sets. Across 53 evaluated model-language pairs, 18 (34%; 95% Wilson CI: 23-47%) exhibit script collapse (SFR < 10%); MMS-1B and SeamlessM4T-v2 maintain SFR above 99% on every language evaluated, confirming that SFR correctly identifies high fidelity where it is present. We identify three distinct collapse patterns: Latin phonetic substitution (smaller Whisper on Indic languages), Arabic substitution for Somali's Latin-script orthography, and Devanagari substitution where larger Whisper models treat all Indic audio as Hindi, a failure present even in Whisper large-v3.
Primary: Independent Researcher
All Institutions: Independent Researcher
The paper presents a novel metric, Script Fidelity Rate (SFR), that effectively measures the fidelity of ASR outputs in multilingual contexts, addressing a critical gap in existing evaluation methodologies. The comprehensive analysis of technical contributions, methodology, and significance to the field underscores the potential for SFR to enhance the reliability of ASR systems across diverse languages and scripts.
The paper introduces a novel metric, Script Fidelity Rate (SFR), which addresses a critical gap in the evaluation of automatic speech recognition (ASR) systems, particularly for multilingual contexts. The methodology is well-defined, relying on Unicode block membership to assess the fidelity of output scripts without requiring reference transcriptions. This approach allows for a continuous evaluation in production settings, which is a significant advancement over traditional metrics like WER that do not account for script fidelity. The empirical taxonomy of collapse patterns adds depth to the analysis, providing insights into specific failure modes across different models.
The experiments conducted are robust, evaluating 53 model-language pairs across six languages and nine ASR models. The use of the FLEURS test sets is appropriate, and the systematic measurement of SFR across various models highlights the effectiveness of the proposed metric. The results clearly demonstrate the prevalence of script collapse in certain models, particularly the Whisper family, and the paper effectively uses statistical confidence intervals to support its findings.
The paper provides a clear description of the experimental setup, including datasets and models used, and offers access to the code and results via Hugging Face. However, the lack of a peer-reviewed venue may raise concerns about the rigor of the validation process, although the author mentions validation against known positives and negatives.
The primary limitation is that SFR does not differentiate between high-quality target-script text and random characters from the correct script, which could lead to misleading interpretations. Additionally, the Unicode block specifications may be approximate, potentially affecting the accuracy of SFR for languages that use characters from multiple blocks.
The introduction of SFR has significant implications for the deployment of ASR systems in multilingual environments, particularly for low-resource languages. By enabling continuous monitoring of script fidelity, it can help developers identify and rectify issues before they affect end-users. This metric could foster improvements in ASR technologies, making them more reliable for diverse linguistic contexts. The paper presents a novel metric, Script Fidelity Rate (SFR), that effectively measures the fidelity of ASR outputs in multilingual contexts, addressing a critical gap in existing evaluation methodologies. The comprehensive analysis of technical contributions, methodology, and significance to the field underscores the potential for SFR to enhance the reliability of ASR systems across diverse languages and scripts.
Large Audio-Language Models (LALMs) have set new benchmarks in speech processing, yet their deployment is hindered by the memory footprint of the Key-Value (KV) cache during long-context inference. While general KV cache compression techniques excel in LLMs, they often fail in the audio domain by overlooking the intrinsic temporal continuity of acoustic signals. To bridge this gap, we propose AudioKV, a novel framework that robustly prioritizes audio-critical attention heads through a hardware-friendly semantic-acoustic alignment mechanism. Specifically, we identify these modality-specialized heads by analyzing attention scores in ASR tasks and dynamically allocate KV cache budgets preferentially to them. Furthermore, we introduce Spectral Score Smoothing (SSS), an FFT-based global filtering strategy designed to suppress high-frequency noise and recover smooth global trends from importance scores, ensuring more balanced token selection with unprecedented precision. Extensive evaluations across multiple LALMs, including Qwen and Gemma series, demonstrate that AudioKV significantly outperforms baselines while enhancing computational efficiency. Notably, at a 40% compression ratio, AudioKV maintains near-full accuracy on Qwen3-Omni-30B with only a 0.45% drop, whereas traditional methods suffer from catastrophic performance degradation and repetition. Our code will be released after acceptance.
Primary: Huazhong University of Science and Technology
All Institutions: Huazhong University of Science and Technology, Shanghai Jiao Tong University, HKUST (GZ), Xidian University
The main contribution of this paper is the introduction of AudioKV, a novel framework for efficient KV cache management in audio-language models, which significantly enhances performance while reducing memory usage. This work addresses a critical bottleneck in deploying LALMs and offers a robust solution that combines innovative methodologies with thorough experimental validation, marking a meaningful advancement in the field of machine learning for audio processing.
The methodology presented in the paper is innovative, focusing on the unique challenges of Key-Value (KV) cache management in Large Audio-Language Models (LALMs). The authors propose a dual approach that combines audio-aware head allocation with Spectral Score Smoothing (SSS) to enhance the efficiency of KV cache usage. The identification of audio-critical attention heads through attention score analysis is a significant contribution, as it allows for a more nuanced allocation of resources compared to traditional uniform methods. The SSS technique, which employs FFT-based filtering to stabilize importance scores, is particularly noteworthy for its potential to improve performance in dynamic audio contexts.
The experiments are comprehensive and demonstrate the effectiveness of AudioKV across multiple benchmarks, including Automatic Speech Recognition (ASR) and Speech Translation (ST). The results show that AudioKV outperforms existing methods significantly, especially at high compression ratios where other methods fail. The use of diverse datasets and models strengthens the validity of the findings, and the detailed performance metrics provide a clear picture of the advantages of the proposed method.
The paper mentions that the code will be released after acceptance, which is a positive step towards reproducibility. However, the absence of a public demo or project URL limits immediate access to the implementation details. The methodology is described in sufficient detail to allow for replication, but the lack of a publicly available codebase at this time is a drawback.
One limitation noted in the paper is the potential for repetition and degeneration in output under high KV cache compression ratios, which could affect the quality of generated text. Additionally, while the method shows promise, its applicability to other modalities beyond audio is not explored, which may limit its generalizability.
The implications of this work are significant for the deployment of LALMs in real-world applications, particularly in resource-constrained environments where efficient memory usage is critical. The techniques developed could lead to advancements in speech recognition and multimodal interactions, potentially enhancing user experiences in various applications such as virtual assistants, transcription services, and interactive audio systems. The main contribution of this paper is the introduction of AudioKV, a novel framework for efficient KV cache management in audio-language models, which significantly enhances performance while reducing memory usage. This work addresses a critical bottleneck in deploying LALMs and offers a robust solution that combines innovative methodologies with thorough experimental validation, marking a meaningful advancement in the field of machine learning for audio processing.
Target Speaker Extraction (TSE) aims to isolate a specific speaker's voice from a mixture, guided by a pre-recorded enrollment. While TSE bypasses the global permutation ambiguity of blind source separation, it remains vulnerable to speaker confusion, where models mistakenly extract the interfering speaker. Furthermore, conventional TSE relies on static inference pipeline, where performance is limited by the quality of the fixed enrollment. To overcome these limitations, we propose EvoTSE, an evolving TSE framework in which the enrollment is continuously updated through reliability-filtered retrieval over high-confidence historical estimates. This mechanism reduces speaker confusion and relaxes the quality requirements for pre-recorded enrollment without relying on additional annotated data. Experiments across multiple benchmarks demonstrate that EvoTSE achieves consistent improvements, especially when evaluated on out-of-domain (OOD) scenarios. Our code and checkpoints are available.
Primary: Northwestern Polytechnical University
All Institutions: Northwestern Polytechnical University, Nanjing University, Huawei Technologies Co., Ltd.
The main contribution of this paper is the introduction of EvoTSE, a novel framework for Target Speaker Extraction that dynamically updates speaker enrollments to mitigate speaker confusion and improve performance in challenging audio environments. This work significantly advances the state of the art in TSE, particularly in handling out-of-domain scenarios, and provides a solid foundation for future research in audio processing and speaker identification.
The proposed EvoTSE framework innovatively addresses the limitations of static enrollment in Target Speaker Extraction (TSE) by introducing a dynamic, evolving enrollment mechanism that utilizes historical context to adaptively update speaker cues. The methodology integrates a contextual retriever, backbone extractor, reliability classifier, and memory curator, which collectively enhance the robustness of speaker extraction in long-duration audio scenarios. The approach is well-structured and leverages existing concepts like Retrieval-Augmented Generation (RAG) while extending them into the audio domain, showcasing a thoughtful adaptation of techniques to solve a specific problem in TSE.
The experimental setup is comprehensive, utilizing multiple datasets including WSJ0-2mix, Libri2mix-clean, and a newly constructed Emotional Speech Database (ESD) to evaluate the model's performance across various conditions. The results demonstrate consistent improvements in extraction quality, particularly in out-of-domain scenarios, which is a significant contribution to the field. The use of multiple evaluation metrics, including SI-SDRi and NSR, provides a robust framework for assessing the model's effectiveness.
The paper provides sufficient implementation details, including model configurations and training strategies, which enhance reproducibility. However, the absence of a clear mention of the specific venue or publication may hinder broader accessibility to the research community. The availability of code and checkpoints on GitHub is a positive aspect that supports reproducibility.
One limitation is the reliance on the quality of historical estimates, which may introduce noise if the initial enrollment is poor. Additionally, while the framework shows promise in OOD scenarios, the paper does not extensively discuss the computational complexity or real-time applicability of the EvoTSE framework in practical applications.
The EvoTSE framework has significant implications for real-world applications such as voice assistants, automated transcription services, and any system requiring speaker identification in noisy environments. By improving the robustness of TSE, this work could enhance user experiences in various audio processing applications, particularly in dynamic and emotionally varied contexts. The main contribution of this paper is the introduction of EvoTSE, a novel framework for Target Speaker Extraction that dynamically updates speaker enrollments to mitigate speaker confusion and improve performance in challenging audio environments. This work significantly advances the state of the art in TSE, particularly in handling out-of-domain scenarios, and provides a solid foundation for future research in audio processing and speaker identification.
Cross-lingual Speech Emotion Recognition (CLSER) aims to identify emotional states in unseen languages. However, existing methods heavily rely on the semantic synchrony of complete labels and static feature stability, hindering low-resource languages from reaching high-resource performance. To address this, we propose a semi-supervised framework based on Semantic-Emotional Resonance Embedding (SERE), a cross-lingual dynamic feature paradigm that requires neither target language labels nor translation alignment. Specifically, SERE constructs an emotion-semantic structure using a small number of labeled samples. It learns human emotional experiences through an Instantaneous Resonance Field (IRF), enabling unlabeled samples to self-organize into this structure. This achieves semi-supervised semantic guidance and structural discovery. Additionally, we design a Triple-Resonance Interaction Chain (TRIC) loss to enable the model to reinforce the interaction and embedding capabilities between labeled and unlabeled samples during emotional highlights. Extensive experiments across multiple languages demonstrate the effectiveness of our method, requiring only 5-shot labeling in the source language.
Primary: Xinjiang University
All Institutions: Xinjiang University, Pengcheng Laboratory Xinjiang Network Node, Xinjiang Multimodal Intelligent Processing and Information Security Engineering Technology Research Center, Joint Research Laboratory for Embodied Intelligence, Joint International Research Laboratory of Silk Road Multilingual Cognitive Computing
The paper presents a semi-supervised framework for cross-lingual speech emotion recognition that effectively utilizes limited labeled data to improve performance across multiple languages. The technical contributions, particularly the novel use of dynamic feature extraction and interaction mechanisms, position this work as a meaningful advancement in the field of machine learning and emotion recognition.
The proposed methodology introduces a novel semi-supervised framework, Semantic-Emotional Resonance Embedding (SERE), which effectively addresses the challenges of cross-lingual speech emotion recognition (CLSER) by leveraging a small number of labeled samples to construct an emotion-semantic structure. The use of the Instantaneous Resonance Field (IRF) and the Triple-Resonance Interaction Chain (TRIC) loss is innovative, allowing for dynamic feature extraction and interaction between labeled and unlabeled data, which enhances the model's ability to generalize across languages.
The experiments are extensive, covering multiple languages and demonstrating the effectiveness of the proposed method with only 5-shot labeling. The results show significant improvements over existing methods, indicating the robustness of the approach. However, the paper could benefit from more detailed comparisons with state-of-the-art methods and additional metrics to strengthen the evaluation.
While the methodology is described in detail, the lack of a publicly available code repository limits reproducibility. Including implementation details, hyperparameters, and data preprocessing steps would enhance reproducibility.
The paper acknowledges the challenge of emotional pronunciation differences across languages, which can lead to misclassification. Additionally, the reliance on a small number of labeled samples may limit the applicability of the method in more complex scenarios.
The proposed framework has significant implications for low-resource languages in emotional recognition tasks, potentially enhancing multilingual communication technologies and applications in areas such as mental health monitoring, customer service, and human-computer interaction. The paper presents a semi-supervised framework for cross-lingual speech emotion recognition that effectively utilizes limited labeled data to improve performance across multiple languages. The technical contributions, particularly the novel use of dynamic feature extraction and interaction mechanisms, position this work as a meaningful advancement in the field of machine learning and emotion recognition.
We present a framework for real-time human-AI musical co-performance, in which a latent diffusion model generates instrumental accompaniment in response to a live stream of context audio. The system combines a MAX/MSP front-end-handling real-time audio input, buffering, and playback-with a Python inference server running the generative model, communicating via OSC/UDP messages. This allows musicians to perform in MAX/MSP - a well-established, real-time capable environment - while interacting with a large-scale Python-based generative model, overcoming the fundamental disconnect between real-time music tools and state-of-the-art AI models. We formulate accompaniment generation as a sliding-window look-ahead protocol, training the model to predict future audio from partial context, where system latency is a critical constraint. To reduce latency, we apply consistency distillation to our diffusion model, achieving a 5.4x reduction in sampling time, with both models achieving real-time operation. Evaluated on musical coherence, beat alignment, and audio quality, both models achieve strong performance in the Retrospective regime and degrade gracefully as look-ahead increases. These results demonstrate the feasibility of diffusion-based real-time accompaniment and expose the fundamental trade-off between model latency, look-ahead depth, and generation quality that any such system must navigate.
Primary: University of California San Diego
All Institutions: University of California San Diego
The paper presents a comprehensive framework for real-time human-AI musical co-performance, utilizing latent diffusion models for generating instrumental accompaniment. The methodology effectively addresses the challenges of latency in generative models, and the results indicate strong potential for practical applications in live music settings.
The paper presents a novel framework for real-time human-AI musical co-performance utilizing latent diffusion models (LDMs) for generating instrumental accompaniment. The methodology is well-structured, combining a MAX/MSP front-end with a Python inference server, which is a significant step in bridging the gap between real-time audio processing and advanced AI models. The sliding-window look-ahead protocol is a clever approach to managing the inherent latency of generative models, allowing for continuous audio generation. The introduction of consistency distillation to reduce sampling time while maintaining audio quality is particularly innovative. However, the paper could benefit from a more detailed exploration of the implications of the look-ahead depth on musical coherence and generation quality.
The experimental setup is robust, utilizing the Slakh2100 dataset and a clear methodology for evaluating musical coherence, beat alignment, and audio quality. The results demonstrate strong performance across various configurations, showcasing the effectiveness of the proposed models in both retrospective and look-ahead regimes. The use of objective metrics such as COCOLA and Beat F1 scores provides a solid foundation for assessing the models' performance. However, the paper lacks a detailed comparison of subjective evaluations alongside the objective metrics, which would enhance the understanding of the models' performance from a listener's perspective.
The authors have made significant efforts to ensure reproducibility by providing access to the model code, pre-trained checkpoints, and detailed descriptions of the experimental setup. The inclusion of GitHub repositories and a demo page further aids in this regard. However, the paper could improve by providing clearer instructions on the setup process for users who may not be familiar with the technologies used, such as MAX/MSP and the specific configurations for the Python inference server.
One limitation of the study is the reliance on a specific dataset (Slakh2100), which may not fully represent the diversity of musical styles and contexts that the system could encounter in real-world applications. Additionally, while the look-ahead mechanism is innovative, it introduces a trade-off between latency and generation quality that may not be fully addressed in the current framework. The paper also does not explore the potential for user customization or adaptation of the system for different musical genres or performance contexts.
The proposed framework has significant implications for the field of music technology and AI, as it opens up new avenues for real-time collaboration between human musicians and AI systems. This could lead to enhanced creative possibilities in live performance settings, potentially transforming how music is created and experienced. The integration of AI into live performance also raises questions about authorship and the role of technology in artistic expression, which could spark further research and discussion in the field. The paper presents a comprehensive framework for real-time human-AI musical co-performance, utilizing latent diffusion models for generating instrumental accompaniment. The methodology effectively addresses the challenges of latency in generative models, and the results indicate strong potential for practical applications in live music settings.
Self-supervised learning (SSL) has driven impressive advances in speech processing by adopting time-domain prediction objectives, while audio representation learning frameworks operate on time-frequency spectrograms. Models optimized for one paradigm struggle to transfer to the other, highlighting the need for a joint framework. We propose Unified Learning of Transformer Representations for Audio and Speech (ULTRAS), where the masking and predictive modeling is performed over long patches of the data. The model, based on the transformer architecture, encodes spectral-patches of log-mel spectrogram features. The predictive modeling of masked segments is performed on spectral and temporal targets using a combined loss-function, forcing the representations to encode time and frequency traits. Experiments are performed on a variety of speech and audio tasks, where we illustrate that the ULTRAS framework achieves improved performance over other established baselines.
Primary: Indian Institute of Science
All Institutions: Indian Institute of Science
The main contribution of this paper is the introduction of the ULTRAS framework, which effectively integrates self-supervised learning techniques for joint modeling of audio and speech signals, showcasing significant improvements in performance across diverse tasks. This work represents a meaningful advancement in the field, addressing existing limitations in audio representation learning and providing a foundation for future research.
The proposed ULTRAS framework introduces a novel approach to self-supervised learning by integrating long-context masking and joint predictive modeling of both spectral and temporal targets. This methodology is a significant advancement over existing models, which typically focus on either temporal or spectral features separately. The use of transformer architecture to encode log-mel spectrograms, combined with a unique loss function that balances spectral and temporal predictions, showcases a well-thought-out design that addresses the limitations of previous models. The masking strategy, which operates over longer audio segments, is particularly innovative and is likely to enhance the model's ability to capture contextual information effectively.
The experiments conducted across a diverse set of speech and audio tasks demonstrate the robustness of the ULTRAS framework. The paper provides comprehensive evaluations using multiple datasets, including LibriSpeech and AudioSet, and compares the performance against established baselines. The results indicate that ULTRAS consistently outperforms these baselines, particularly in scenarios where both speech and audio tasks are involved. The inclusion of ablation studies further strengthens the findings by illustrating the contribution of each component of the proposed method.
The paper outlines the implementation details, including the pre-training and evaluation protocols, which are crucial for reproducibility. However, the absence of a publicly available code repository or demo URL limits the ability for other researchers to replicate the results independently. Clearer documentation or a supplementary repository would enhance reproducibility.
One limitation of the study is the reliance on a relatively small dataset for some experiments (200 hours), which may affect the generalizability of the results. Additionally, while the model shows improved performance, it is not clear how it scales with larger datasets or more complex tasks. The paper could also benefit from a more thorough discussion of potential biases in the datasets used.
The ULTRAS framework has the potential to significantly impact the fields of audio and speech processing by providing a unified approach that can be applied across various tasks. Its ability to learn robust representations from both speech and general audio signals could lead to advancements in applications such as automatic speech recognition, emotion recognition, and environmental sound classification. The implications of this work extend to improving the efficiency of training models in low-resource settings, thereby democratizing access to advanced audio processing technologies. The main contribution of this paper is the introduction of the ULTRAS framework, which effectively integrates self-supervised learning techniques for joint modeling of audio and speech signals, showcasing significant improvements in performance across diverse tasks. This work represents a meaningful advancement in the field, addressing existing limitations in audio representation learning and providing a foundation for future research.
The human auditory system has the ability to selectively focus on key speech elements in an audio stream while giving secondary attention to less relevant areas such as noise or distortion within the background, dynamically adjusting its attention over time. Inspired by the recent success of attention models, this study introduces a dual-path attention module in the bottleneck layer of a concurrent speech enhancement network. Our study proposes an attention-based dual-path RNN (DAT-RNN), which, when combined with the modified complex-valued frequency transformation network (CFTNet), forms the DAT-CFTNet. This attention mechanism allows for precise differentiation between speech and noise in time-frequency (T-F) regions of spectrograms, optimizing both local and global context information processing in the CFTNet. Our experiments suggest that the DAT-CFTNet leads to consistently improved performance over the existing models, including CFTNet and DCCRN, in terms of speech intelligibility and quality. Moreover, the proposed model exhibits superior performance in enhancing speech intelligibility for cochlear implant (CI) recipients, who are known to have severely limited T-F hearing restoration (e.g., >10%) in CI listener studies in noisy settings show the proposed solution is capable of suppressing non-stationary noise, avoiding the musical artifacts often seen in traditional speech enhancement methods. The implementation of the proposed model will be publicly available.
Primary: Chittagong University of Engineering and Technology
All Institutions: Chittagong University of Engineering and Technology
The main contribution of this research is the introduction of the DAT-CFTNet, which effectively enhances speech intelligibility for cochlear implant users through an innovative dual-path attention mechanism. This work represents a significant step forward in speech enhancement technologies, particularly in challenging acoustic environments.
The proposed methodology introduces a novel dual-path attention mechanism integrated into a complex-valued frequency transformation network (CFTNet), which is a significant advancement in the field of speech enhancement, particularly for cochlear implant users. The combination of intra-chunk and inter-chunk RNNs with attention modules allows for enhanced modeling of speech and noise dynamics in time-frequency representations. The detailed architecture and the rationale behind the design choices are well articulated, showcasing a thoughtful approach to addressing the limitations of existing models.
The experiments are robust, employing a comprehensive dataset that includes various noise conditions and SNR levels. The evaluation metrics used (STOI, PESQ, SISDR) are appropriate for assessing speech intelligibility and quality. The results demonstrate significant improvements over baseline models, indicating the effectiveness of the proposed approach. However, the paper could benefit from more detailed comparisons with state-of-the-art methods and a discussion on the statistical significance of the results.
The paper lacks sufficient implementation details that would facilitate reproducibility. While it mentions the use of a specific dataset and the architecture of the model, there are no code repositories or links to a demo that would allow other researchers to replicate the findings. Providing access to the model and training scripts would greatly enhance reproducibility.
One limitation is the reliance on objective metrics without a thorough subjective evaluation involving human listeners. While objective scores are important, subjective assessments are crucial for applications in speech enhancement, especially for cochlear implant users. Additionally, the model's complexity may limit its applicability in real-time scenarios, which is a critical factor for practical implementations.
The proposed DAT-CFTNet has the potential to significantly improve the quality of life for cochlear implant recipients by enhancing speech intelligibility in noisy environments. This advancement could lead to better communication and social interactions for individuals with hearing impairments. The public availability of the model also encourages further research and development in the field. The main contribution of this research is the introduction of the DAT-CFTNet, which effectively enhances speech intelligibility for cochlear implant users through an innovative dual-path attention mechanism. This work represents a significant step forward in speech enhancement technologies, particularly in challenging acoustic environments.