We present Audio Flamingo Next (AF-Next), the next-generation and most capable large audio-language model in the Audio Flamingo series, designed to advance understanding and reasoning over speech, environmental sounds and music. Compared to Audio Flamingo 3, AF-Next introduces: (i) a stronger foundational audio-language model that significantly improves accuracy across diverse audio understanding tasks; (ii) scalable strategies for constructing large-scale audio understanding and reasoning data beyond existing academic benchmarks; (iii) support for long and complex audio inputs up to 30 minutes; and (iv) Temporal Audio Chain-of-Thought, a new reasoning paradigm that explicitly grounds intermediate reasoning steps to timestamps in long audio, enabling fine-grained temporal alignment and improved interpretability. To enable these capabilities, we first conduct a systematic analysis of Audio Flamingo 3 to identify key gaps in audio understanding and reasoning. We then curate and scale new large-scale datasets totaling over 1 million hours to address these limitations and expand the existing AudioSkills-XL, LongAudio-XL, AF-Think and AF-Chat datasets. AF-Next is trained using a curriculum-based strategy spanning pre-training, mid-training and post-training stages. Extensive experiments across 20 audio understanding and reasoning benchmarks, including challenging long-audio tasks, show that AF-Next outperforms similarly sized open models by large margins and remains highly competitive with and sometimes surpasses, much larger open-weight and closed models. Beyond benchmark performance, AF-Next exhibits strong real-world utility and transfers well to unseen tasks, highlighting its robustness and generalization ability. In addition to all data, code and methods, we open-source 3 variants of AF-Next, including AF-Next-Instruct, AF-Next-Think and AF-Next-Captioner.
Primary: University of Maryland
All Institutions: University of Maryland, NVIDIA
The main contribution of this paper is the introduction of Audio Flamingo Next (AF-Next), a state-of-the-art open audio-language model that significantly advances audio understanding and reasoning capabilities, particularly for long and complex audio inputs. The comprehensive methodology, extensive experimental validation, and commitment to open science position this work as a significant milestone in the development of large audio-language models.
The methodology presented in the paper is robust, featuring a systematic analysis of previous models to identify gaps in audio understanding and reasoning. The introduction of the Temporal Audio Chain-of-Thought paradigm is particularly noteworthy, as it enhances the model's ability to handle long audio inputs by grounding reasoning steps to timestamps. The training strategy, which includes a four-stage curriculum and the curation of a large-scale dataset of over 1 million hours, demonstrates a comprehensive approach to improving model performance across various audio tasks. The use of diverse data sources and the focus on real-world applicability are commendable.
The experiments conducted across 20 audio understanding and reasoning benchmarks are extensive and well-structured. The results show that AF-Next consistently outperforms previous models, including both open-weight and closed models, particularly in long-audio tasks. The paper provides a thorough comparison with state-of-the-art models, showcasing significant improvements in accuracy and robustness. The inclusion of qualitative examples further strengthens the evaluation of the model's capabilities.
The authors have committed to open-sourcing the model weights, training data, and code, which is a significant step towards ensuring reproducibility. However, the paper could benefit from more detailed descriptions of the training configurations and hyperparameters used in each stage, as well as clearer guidelines for replicating the experiments.
The paper acknowledges several limitations, including the challenges posed by noisy and unevenly distributed training data, particularly for low-resource languages and rare sound events. Additionally, while the model improves long-audio understanding, it still faces difficulties with temporally distant evidence. The evaluation focuses primarily on established benchmarks, which may not fully capture the model's capabilities in more complex scenarios.
The advancements presented in AF-Next have the potential to significantly enhance audio understanding applications, including automatic speech recognition, audio captioning, and music information retrieval. The model's ability to handle long-form audio and its open-source nature could foster further research and development in the field, promoting transparency and collaboration among researchers. The main contribution of this paper is the introduction of Audio Flamingo Next (AF-Next), a state-of-the-art open audio-language model that significantly advances audio understanding and reasoning capabilities, particularly for long and complex audio inputs. The comprehensive methodology, extensive experimental validation, and commitment to open science position this work as a significant milestone in the development of large audio-language models.
Vocal-to-accompaniment (V2A) generation, which aims to transform a raw vocal recording into a fully arranged accompaniment, inherently requires jointly addressing an accompaniment trilemma: preserving acoustic authenticity, maintaining global coherence with the vocal track, and producing dynamic orchestration across a full song. Existing open-source approaches typically make compromises among these goals. Continuous-latent generation models can capture long musical spans but often struggle to preserve fine-grained acoustic detail. In contrast, discrete autoregressive models retain local fidelity but suffer from unidirectional generation and error accumulation in extended contexts. We present LaDA-Band, an end-to-end framework that introduces Discrete Masked Diffusion to the V2A task. Our approach formulates V2A generation as Discrete Masked Diffusion, i.e., a global, non-autoregressive denoising formulation that combines the representational advantages of discrete audio codec tokens with full-sequence bidirectional context modeling. This design improves long-range structural consistency and temporal synchronization while preserving crisp acoustic details. Built on this formulation, LaDA-Band further introduces a dual-track prefix-conditioning architecture, an auxiliary replaced-token detection objective for weakly anchored accompaniment regions, and a two-stage progressive curriculum to scale Discrete Masked Diffusion to full-song vocal-to-accompaniment generation. Extensive experiments on both academic and real-world benchmarks show that LaDA-Band consistently improves acoustic authenticity, global coherence, and dynamic orchestration over existing baselines, while maintaining strong performance even without auxiliary reference audio. Codes and audio samples are available at https://github.com/Duoluoluos/TME-LaDA-Band .
Primary: Institute of Computing Technology, Chinese Academy of Sciences (CAS)
All Institutions: Institute of Computing Technology, Chinese Academy of Sciences (CAS), Lyra Lab, Tencent Music Entertainment, Pengcheng Laboratory, State Key Lab of AI Safety
LaDA-Band presents a novel approach to vocal-to-accompaniment generation through Discrete Masked Diffusion, significantly improving upon existing methods in terms of acoustic authenticity, coherence, and orchestration. The comprehensive methodology and rigorous experimental validation position this work as a meaningful contribution to the field of machine learning in audio generation.
The methodology presented in LaDA-Band is innovative, leveraging Discrete Masked Diffusion to address the vocal-to-accompaniment generation problem. The dual-track prefix-conditioning architecture and the auxiliary replaced-token detection objective are significant contributions that enhance the model's ability to generate high-quality accompaniment while maintaining acoustic authenticity and global coherence. The two-stage progressive curriculum for training is a well-thought-out approach that allows the model to scale from short-form to full-song generation effectively.
The experiments conducted are extensive, comparing LaDA-Band against a variety of state-of-the-art baselines across multiple metrics. The results demonstrate consistent improvements in acoustic authenticity, global coherence, and dynamic orchestration, particularly under zero-shot conditions. The use of both objective metrics (like FAD and Onset F1) and subjective evaluations (like MOS) provides a comprehensive assessment of the model's performance.
The paper provides detailed implementation specifics, including architecture choices, training procedures, and evaluation metrics, which enhance reproducibility. The availability of the code and audio samples further supports this aspect, allowing other researchers to replicate the study.
While the paper acknowledges limitations such as dependency on the source separation and audio codec pipeline, it also notes challenges in fine-grained control over arrangement details and difficulties with certain stylistically free-form genres. These limitations suggest areas for future research and improvement.
The potential applications of LaDA-Band are significant, particularly in the music production industry, where automated accompaniment generation can streamline workflows for artists and producers. The framework's ability to generate high-quality music without extensive manual intervention could democratize music creation and enhance creative processes. LaDA-Band presents a novel approach to vocal-to-accompaniment generation through Discrete Masked Diffusion, significantly improving upon existing methods in terms of acoustic authenticity, coherence, and orchestration. The comprehensive methodology and rigorous experimental validation position this work as a meaningful contribution to the field of machine learning in audio generation.
Recent progress in multimodal models has spurred rapid advances in audio understanding, generation, and editing. However, these capabilities are typically addressed by specialized models, leaving the development of a truly unified framework that can seamlessly integrate all three tasks underexplored. While some pioneering works have explored unifying audio understanding and generation, they often remain confined to specific domains. To address this, we introduce Audio-Omni, the first end-to-end framework to unify generation and editing across general sound, music, and speech domains, with integrated multi-modal understanding capabilities. Our architecture synergizes a frozen Multimodal Large Language Model for high-level reasoning with a trainable Diffusion Transformer for high-fidelity synthesis. To overcome the critical data scarcity in audio editing, we construct AudioEdit, a new large-scale dataset comprising over one million meticulously curated editing pairs. Extensive experiments demonstrate that Audio-Omni achieves state-of-the-art performance across a suite of benchmarks, outperforming prior unified approaches while achieving performance on par with or superior to specialized expert models. Beyond its core capabilities, Audio-Omni exhibits remarkable inherited capabilities, including knowledge-augmented reasoning generation, in-context generation, and zero-shot cross-lingual control for audio generation, highlighting a promising direction toward universal generative audio intelligence. The code, model, and dataset will be publicly released on https://zeyuet.github.io/Audio-Omni.
Primary: Hong Kong University of Science and Technology
All Institutions: Hong Kong University of Science and Technology, Peking University, WeChat Vision, Tencent Inc
The main contribution of this paper is the introduction of Audio-Omni, a unified framework for audio understanding, generation, and editing, which leverages a novel architecture and a large-scale dataset to achieve state-of-the-art performance across multiple audio tasks. This work significantly advances the field of multimodal audio processing by providing a comprehensive solution that integrates various audio capabilities into a single model, setting a new standard for future research in generative audio intelligence.
The paper introduces Audio-Omni, a novel framework that integrates audio understanding, generation, and editing across diverse audio domains. Its architecture employs a frozen Multimodal Large Language Model (MLLM) for high-level reasoning and a trainable Diffusion Transformer (DiT) for synthesis, which is a significant advancement in unifying these tasks. The hybrid conditioning mechanism effectively separates high-level semantic inputs from low-level signal features, allowing for precise audio manipulation. The dataset construction method is also innovative, combining real-world data mining with synthetic data generation to create a large-scale dataset for instruction-guided audio editing.
The experiments are extensive and demonstrate that Audio-Omni outperforms prior unified models and matches or exceeds the performance of specialized models across various benchmarks. The use of both objective metrics (like FAD and LSD) and subjective evaluations (human ratings) provides a comprehensive assessment of the model's capabilities. The results validate the effectiveness of the proposed architecture and its ability to generalize across multiple audio tasks.
The paper provides detailed implementation details, including architecture specifications, training protocols, and evaluation metrics, which enhances reproducibility. The authors commit to releasing their code, model, and dataset publicly, which is a positive step toward enabling other researchers to replicate their findings.
While the framework shows promise, it may still face challenges in handling highly complex audio editing tasks that require nuanced understanding beyond the current capabilities of the MLLM. Additionally, the reliance on a large-scale dataset may limit accessibility for researchers without similar resources.
The potential applications of this work are significant, ranging from creative audio generation to practical applications in media production and accessibility technologies. However, ethical considerations regarding the misuse of generative audio technologies, such as deepfakes, must be addressed. The main contribution of this paper is the introduction of Audio-Omni, a unified framework for audio understanding, generation, and editing, which leverages a novel architecture and a large-scale dataset to achieve state-of-the-art performance across multiple audio tasks. This work significantly advances the field of multimodal audio processing by providing a comprehensive solution that integrates various audio capabilities into a single model, setting a new standard for future research in generative audio intelligence.
Universal speech enhancement (USE) aims to restore speech signals from diverse distortions across multiple sampling rates. We propose UniPASE, an extension of the low-hallucination PASE framework tailored for USE. At its core is DeWavLM-Omni, a unified representation-level enhancement module fine-tuned from WavLM via knowledge distillation on a large-scale supervised multi-distortion dataset. This module directly converts degraded waveforms into clean and linguistically faithful phonetic representations, ensuring robust enhancement with minimal linguistic hallucination. Based on these enhanced phonetic representations, an Adapter generates enhanced acoustic representations containing rich acoustic details, which a neural Vocoder uses to reconstruct corresponding high-fidelity 16-kHz waveforms. A PostNet then converts the waveforms to 48~kHz before resampling them to their original rates, enabling seamless handling of inputs and outputs at multiple sampling rates. Experimental results on several evaluation datasets, covering sub-tasks and full tasks, demonstrate that UniPASE achieves superior or competitive performance compared with existing state-of-the-art models. The proposed model also serves as the backbone of our submission to the URGENT 2026 Challenge, which achieved 1st place in the objective evaluation. The source code and audio demos are available at https://github.com/xiaobin-rong/unipase/.
Primary: Nanjing University
All Institutions: Nanjing University, Institute of Acoustics, NJU-Horizon Intelligent Audio Lab
The main contribution of this paper is the introduction of UniPASE, a generative model that effectively enhances speech across multiple distortions and sampling rates while minimizing hallucinations. This work significantly advances the field of universal speech enhancement by integrating innovative methodologies and demonstrating superior performance against existing state-of-the-art models.
The methodology presented in UniPASE is robust and innovative, extending the low-hallucination PASE framework to a universal speech enhancement context. The introduction of DeWavLM-Omni, which utilizes knowledge distillation for phonetic representation enhancement, is a significant advancement. The dual-stream approach, combining phonetic and acoustic representations, effectively addresses the challenges of linguistic and acoustic hallucinations. The explicit acoustic enhancement stage via an Adapter, along with the PostNet for flexible sampling rates, showcases a comprehensive design that addresses multiple distortions and enhances fidelity.
The experiments are thorough, utilizing a diverse set of evaluation datasets that cover various speech enhancement tasks. The performance metrics reported, including DNSMOS, UTMOS, and speaker similarity, demonstrate that UniPASE achieves competitive results against state-of-the-art models. The model's performance in the URGENT 2025 Challenge, where it ranked first, further validates its effectiveness. The comprehensive evaluation across different metrics and datasets indicates a rigorous approach to assessing the model's capabilities.
The paper provides detailed implementation details, including configurations for each module and the training setup. The availability of source code and audio demos on GitHub enhances reproducibility. However, the reliance on specific datasets and configurations may require careful attention from other researchers attempting to replicate the results.
While the paper presents a strong model, it may still face challenges in real-world applications where distortions are unpredictable. The performance under extreme noise conditions or in highly variable environments has not been extensively tested. Additionally, the model's complexity may pose challenges for deployment in resource-constrained settings.
The advancements in speech enhancement presented in this paper have significant implications for various applications, including telecommunications, virtual assistants, and accessibility technologies. By improving the fidelity and robustness of speech signals, UniPASE can enhance user experiences in noisy environments and contribute to more effective communication technologies. The main contribution of this paper is the introduction of UniPASE, a generative model that effectively enhances speech across multiple distortions and sampling rates while minimizing hallucinations. This work significantly advances the field of universal speech enhancement by integrating innovative methodologies and demonstrating superior performance against existing state-of-the-art models.
In bandwidth-constrained communication such as satellite and underwater channels, speech must often be transmitted at ultra-low bitrates where intelligibility is the primary objective. At such extreme compression levels, codecs trained with acoustic reconstruction losses tend to allocate bits to perceptual detail, leading to substantial degradation in word error rate (WER). This paper proposes ClariCodec, a neural speech codec operating at 200 bit per second (bps) that reformulates quantisation as a stochastic policy, enabling reinforcement learning (RL)-based optimisation of intelligibility. Specifically, the encoder is fine-tuned using WER-driven rewards while the acoustic reconstruction pipeline remains frozen. Even without RL, ClariCodec achieves 3.68% WER on the LibriSpeech test-clean set at 200 bps, already competitive with codecs operating at higher bitrates. Further RL fine-tuning reduces WER to 3.20% on test-clean and 8.93% on test-other, corresponding to a 13% relative reduction while preserving perceptual quality.
Primary: Tsinghua University
All Institutions: Tsinghua University, Huawei Technologies Co., Ltd
ClariCodec presents a novel approach to neural speech coding by optimising for intelligibility at ultra-low bitrates using reinforcement learning. This work significantly advances the state of the art in speech codecs, addressing critical challenges in bandwidth-constrained communication environments while maintaining competitive performance metrics.
The methodology proposed in ClariCodec is innovative, particularly in its two-stage training approach that combines traditional reconstruction-based training with reinforcement learning (RL) for semantic optimisation. The reformulation of quantisation as a stochastic policy is a significant advancement, allowing for the direct optimisation of intelligibility using word error rate (WER) as a reward signal. This novel approach addresses the limitations of existing codecs that prioritize acoustic fidelity over intelligibility, making it a meaningful contribution to the field of neural speech coding.
The experimental evaluation is robust, utilizing the LibriSpeech dataset to benchmark performance against several existing neural speech codecs. The results demonstrate that ClariCodec achieves competitive performance at an unprecedented low bitrate of 200 bps, with a WER of 3.20% on test-clean and 8.93% on test-other. The paper includes comprehensive comparisons with baseline models, showing that ClariCodec maintains perceptual quality while achieving significant improvements in intelligibility through RL fine-tuning.
The paper provides detailed implementation information, including model architecture, training setup, and loss functions used in both stages of training. However, the lack of a publicly available code repository limits the reproducibility of the results. The authors mention using specific hardware and configurations, which could aid in reproducing the experiments if the code were available.
One limitation noted is the potential degradation in acoustic quality when optimising solely for intelligibility during the RL fine-tuning phase. The paper addresses this by incorporating a mel reconstruction loss to mitigate quality loss, but this trade-off remains a concern. Additionally, the non-causal architecture may introduce latency issues, which the authors plan to address in future work.
The implications of ClariCodec are significant, particularly for applications in bandwidth-constrained environments such as satellite and underwater communication. By prioritising intelligibility over acoustic fidelity, this codec could enhance communication reliability in critical scenarios. The potential for future developments, such as streaming codecs and integration with generative tasks, suggests a broad range of applications in speech technology. ClariCodec presents a novel approach to neural speech coding by optimising for intelligibility at ultra-low bitrates using reinforcement learning. This work significantly advances the state of the art in speech codecs, addressing critical challenges in bandwidth-constrained communication environments while maintaining competitive performance metrics.
Recent advances in video-to-audio (V2A) generation enable high-quality audio synthesis from visual content, yet achieving robust and fine-grained controllability remains challenging. Existing methods suffer from weak textual controllability under visual-text conflict and imprecise stylistic control due to entangled temporal and timbre information in reference audio. Moreover, the lack of standardized benchmarks limits systematic evaluation. We propose ControlFoley, a unified multimodal V2A framework that enables precise control over video, text, and reference audio. We introduce a joint visual encoding paradigm that integrates CLIP with a spatio-temporal audio-visual encoder to improve alignment and textual controllability. We further propose temporal-timbre decoupling to suppress redundant temporal cues while preserving discriminative timbre features. In addition, we design a modality-robust training scheme with unified multimodal representation alignment (REPA) and random modality dropout. We also present VGGSound-TVC, a benchmark for evaluating textual controllability under varying degrees of visual-text conflict. Extensive experiments demonstrate state-of-the-art performance across multiple V2A tasks, including text-guided, text-controlled, and audio-controlled generation. ControlFoley achieves superior controllability under cross-modal conflict while maintaining strong synchronization and audio quality, and shows competitive or better performance compared to an industrial V2A system. Code, models, datasets, and demos are available at: https://yjx-research.github.io/ControlFoley/.
Primary: Xiaomi Inc.
All Institutions: Xiaomi Inc., Wuhan University
ControlFoley represents a substantial advancement in the field of video-to-audio generation, providing a unified framework that enhances controllability and robustness in multimodal audio synthesis. The combination of innovative methodologies, comprehensive experimental validation, and the introduction of a new evaluation benchmark positions this work as a significant contribution to the machine learning community.
The methodology presented in ControlFoley is robust and innovative, addressing key limitations in existing video-to-audio (V2A) generation systems. The joint visual encoding paradigm that integrates CLIP with a spatio-temporal audio-visual encoder is a significant advancement, enhancing both audio-visual alignment and textual controllability. The introduction of temporal-timbre decoupling is particularly noteworthy, as it allows for precise stylistic control by suppressing redundant temporal cues while preserving essential timbre features. Additionally, the modality-robust training scheme with unified multimodal representation alignment (REPA) and random modality dropout is a clever approach to ensure the model's robustness across varying input conditions. The development of the VGGSound-TVC benchmark is also a critical contribution, filling a gap in the evaluation of textual controllability under visual-text conflicts.
The experimental evaluation is comprehensive, demonstrating the effectiveness of ControlFoley across multiple V2A tasks, including text-guided, text-controlled, and audio-controlled generation. The authors provide extensive quantitative results, comparing their model against several state-of-the-art baselines. The use of diverse datasets for evaluation, including both in-distribution and out-of-distribution scenarios, strengthens the validity of their findings. The metrics employed, such as IB-score, CLAP-score, and DeSync, are appropriate for assessing the quality of generated audio and its alignment with visual content.
The paper includes sufficient details regarding the model architecture, training procedures, and evaluation metrics, which should facilitate reproducibility. The authors have also made their code, models, datasets, and demos available online, further supporting the reproducibility of their work.
While the paper presents a strong framework, it does not extensively discuss potential limitations or challenges in real-world applications, such as the model's performance in highly complex or noisy environments. Additionally, the reliance on specific datasets may limit the generalizability of the findings to other contexts or types of audio-visual content.
The implications of this research are significant, particularly in fields such as film, gaming, and advertising, where high-quality audio generation is crucial. The ability to generate audio that is both synchronized with visual content and controllable via text or reference audio opens new avenues for creative expression and content creation. Furthermore, the introduction of a standardized benchmark for evaluating V2A systems may encourage further research and development in this area. ControlFoley represents a substantial advancement in the field of video-to-audio generation, providing a unified framework that enhances controllability and robustness in multimodal audio synthesis. The combination of innovative methodologies, comprehensive experimental validation, and the introduction of a new evaluation benchmark positions this work as a significant contribution to the machine learning community.
Recent image-to-audio models have shown impressive performance on object-centric visual scenes. However, their application to satellite imagery remains limited by the complex, wide-area semantic ambiguity of top-down views. While satellite imagery provides a uniquely scalable source for global soundscape generation, matching these views to real acoustic environments with unique spatial structures is inherently difficult. To address this challenge, we introduce Geo2Sound, a novel task and framework for generating geographically realistic soundscapes from satellite imagery. Specifically, Geo2Sound combines structural geospatial attributes modeling, semantic hypothesis expansion, and geo-acoustic alignment in a unified framework. A lightweight classifier summarizes overhead scenes into compact geographic attributes, multiple sound-oriented semantic hypotheses are used to generate diverse acoustically plausible candidates, and a geo-acoustic alignment module projects geographic attributes into the acoustic embedding space and identifies the candidate most consistent with the candidate sets. Moreover, we establish SatSound-Bench, the first benchmark comprising over 20k high-quality paired satellite images, text descriptions, and real-world audio recordings, collected from the field across more than 10 countries and complemented by three public datasets. Experiments show that Geo2Sound achieves a SOTA FAD of 1.765, outperforming the strongest baseline by 50.0%. Human evaluations further confirm substantial gains in both realism (26.5%) and semantic alignment, validating our high-fidelity synthesis on scale. Project page and source code: https://github.com/Blanketzzz/Geo2Sound
Primary: The Hong Kong University of Science and Technology (Guangzhou)
All Institutions: The Hong Kong University of Science and Technology (Guangzhou), University of South Carolina, University of Canterbury, Southwest Jiaotong University, Beijing University of Posts and Telecommunications
Geo2Sound presents a scalable framework for generating geographically aligned soundscapes from satellite imagery, addressing key challenges in the field of audio generation. The combination of innovative methodologies and comprehensive evaluations positions this work as a significant contribution to the advancement of multimodal audio systems.
The methodology presented in Geo2Sound is robust, integrating three key components—structural geospatial attributes modeling, semantic hypothesis expansion, and geo-acoustic alignment—into a cohesive framework. This approach effectively addresses the unique challenges posed by satellite imagery in soundscape generation. The use of a lightweight classifier for geographic attributes and the innovative semantic hypothesis expansion strategy significantly enhance the model's ability to produce diverse and contextually relevant soundscapes. The geo-acoustic alignment module further strengthens the framework by ensuring that the generated audio is not only acoustically plausible but also geographically consistent.
The experiments are comprehensive, utilizing a well-constructed benchmark (SatSound-Bench) with over 20k paired satellite images, textual descriptions, and audio recordings. The results demonstrate significant improvements over existing baselines, with both objective metrics (e.g., FAD, CLAP scores) and human evaluations indicating superior performance in terms of realism and semantic alignment. The thoroughness of the evaluation, including ablation studies, provides strong evidence for the contributions of each component of the framework.
The paper provides detailed implementation specifics, including the architecture of the models used, the training process, and the datasets employed. However, the absence of a demo URL limits immediate reproducibility for external researchers. The authors have made the project code available on GitHub, which is a positive aspect for reproducibility.
One limitation is the reliance on satellite imagery, which may not capture all acoustic nuances present in ground-level scenes. Additionally, the model's performance may vary based on the quality and resolution of the satellite images used. The paper does not discuss potential biases in the dataset or the implications of using field recordings from specific geographic locations.
The potential applications of Geo2Sound are significant, particularly in urban planning, environmental monitoring, and immersive media. By enabling the generation of realistic soundscapes from satellite imagery, this framework could facilitate better understanding and management of urban environments and promote public engagement with environmental issues. The integration of such technology into digital twin cities and virtual reality experiences could revolutionize how we interact with and perceive our surroundings. Geo2Sound presents a scalable framework for generating geographically aligned soundscapes from satellite imagery, addressing key challenges in the field of audio generation. The combination of innovative methodologies and comprehensive evaluations positions this work as a significant contribution to the advancement of multimodal audio systems.
Recent Large Audio Language Models have demonstrated impressive capabilities in audio understanding. However, they often suffer from perceptual errors, while reliable audio reasoning is unattainable without first grounding the model's perception in structured auditory scenes. Inspired by Auditory Scene Analysis, we first introduce a Perception-Aware Question Answering (PAQA) dataset. PAQA implements a hierarchical decoupling strategy that separates speech from environmental sound and distinguishes multiple speakers, providing explicit perceptual reasoning for training. Building on this, we propose HyPeR, a two-stage Hybrid Perception-Reasoning framework. In Stage I, we finetune the model on PAQA to perceive acoustic attributes in complex audio. In Stage II, we leverage GRPO to refine the model's internal deliberation. We also introduce PAUSE tokens to facilitate latent computation during acoustically ambiguous phases and design perceptual consistency reward to align reasoning rationales with raw audio. Experiments across benchmarks demonstrate that HyPeR achieves absolute improvements over the base model, with performance comparable to large-scale models, stressing the effectiveness of hybrid perception-grounded reasoning for robust and multi-speaker audio understanding.
Primary: Shanghai AI Laboratory
All Institutions: Shanghai AI Laboratory, Peking University, CUHK MMLab, Fudan University
The main contribution of this paper is the introduction of a hybrid reasoning framework (HyPeR) that effectively combines explicit perceptual reasoning with implicit latent computation for improved audio understanding. This work is significant as it addresses critical challenges in audio processing, such as perceptual errors and multi-speaker scenarios, while providing a structured dataset (PAQA) for training and evaluation.
The paper introduces a novel two-stage Hybrid Perception-Reasoning framework (HyPeR) that effectively integrates explicit perceptual reasoning with implicit latent computation. The use of the Perception-Aware Question Answering (PAQA) dataset is innovative, as it allows for a structured approach to audio understanding by decoupling speech from environmental sounds and handling multi-speaker scenarios. The introduction of PAUSE tokens to facilitate latent reasoning during ambiguous acoustic phases is a significant methodological advancement. The combination of supervised fine-tuning and reinforcement learning through Group Relative Policy Optimization (GRPO) is well-justified and effectively addresses the challenges posed by complex audio environments.
The experiments are comprehensive, evaluating the proposed HyPeR framework against multiple benchmarks, including the newly introduced PAQA dataset. The results demonstrate substantial improvements in performance over baseline models, particularly in challenging scenarios involving background noise and multi-speaker interactions. The paper provides detailed quantitative metrics, which are essential for assessing the effectiveness of the proposed methods. However, the evaluation could benefit from more qualitative analysis of the model's outputs to better understand its reasoning capabilities.
The paper includes sufficient implementation details, including the architecture, training procedures, and hyperparameters used in the experiments. The availability of the code and dataset on GitHub enhances reproducibility. However, the paper could improve by providing clearer instructions on how to replicate the experiments, including any specific dependencies or configurations required.
The paper acknowledges several limitations, including the increased latency introduced by the PAUSE token mechanism and the potential for overthinking during reflection steps. While the authors note that their approach performs well on certain benchmarks, they also recognize that it may struggle with broader audio-language tasks. The PAQA dataset's limited scale and domain coverage are also mentioned as areas for future improvement.
The proposed methods have significant implications for audio understanding applications, particularly in areas such as speech recognition, multi-speaker dialogue systems, and environmental sound classification. By grounding reasoning in perceptual evidence, the framework could lead to more robust and interpretable audio processing systems. The work also highlights the importance of integrating perceptual and reasoning capabilities in machine learning models, which could influence future research directions in multimodal AI. The main contribution of this paper is the introduction of a hybrid reasoning framework (HyPeR) that effectively combines explicit perceptual reasoning with implicit latent computation for improved audio understanding. This work is significant as it addresses critical challenges in audio processing, such as perceptual errors and multi-speaker scenarios, while providing a structured dataset (PAQA) for training and evaluation.
Real-world video creation often involves a complex reasoning workflow of selecting relevant shots from noisy materials, planning missing shots for narrative completeness, and organizing them into coherent storylines. However, existing benchmarks focus on isolated sub-tasks and lack support for evaluating this full process. To address this gap, we propose Multimodal Context-to-Script Creation (MCSC), a new task that transforms noisy multimodal inputs and user instructions into structured, executable video scripts. We further introduce MCSC-Bench, the first large-scale MCSC dataset, comprising 11K+ well-annotated videos. Each sample includes: (1) redundant multimodal materials and user instructions; (2) a coherent, production-ready script containing material-based shots, newly planned shots (with shooting instructions), and shot-aligned voiceovers. MCSC-Bench supports comprehensive evaluation across material selection, narrative planning, and conditioned script generation, and includes both in-domain and out-of-domain test sets. Experiments show that current multimodal LLMs struggle with structure-aware reasoning under long contexts, highlighting the challenges posed by our benchmark. Models trained on MCSC-Bench achieve SOTA performance, with an 8B model surpassing Gemini-2.5-Pro, and generalize to out-of-domain scenarios. Downstream video generation guided by the generated scripts further validates the practical value of MCSC. Datasets are available at: https://github.com/huanran-hu/MCSC.
Primary: Nanyang Technological University
All Institutions: Nanyang Technological University, Renmin University of China, Alibaba Group
The main contribution of this work is the introduction of a novel task and benchmark for multimodal context-to-script creation, which significantly enhances the evaluation and understanding of automated video production workflows. The comprehensive dataset and evaluation metrics established in this paper provide a valuable resource for advancing research in multimodal AI and video generation.
The methodology presented in this paper is robust and well-structured, introducing the Multimodal Context-to-Script Creation (MCSC) task, which effectively bridges the gap between noisy multimodal inputs and coherent video scripts. The authors provide a comprehensive dataset (MCSC-Bench) with over 11K annotated videos, which is a significant contribution to the field. The task's design emphasizes multimodal comprehension, narrative planning, and structured script generation, which are critical for realistic video production. The evaluation metrics are thoughtfully crafted to assess various dimensions of script quality, enhancing the reliability of the benchmarking process.
The experimental evaluation is thorough, showcasing the performance of various state-of-the-art multimodal language models (MLLMs) on the MCSC-Bench dataset. The results indicate that existing models struggle with the complexities of long-context reasoning and structured planning, highlighting the benchmark's discriminative power. The experiments also validate the practical applicability of the generated scripts in downstream video generation tasks, demonstrating the utility of the proposed approach.
The paper provides detailed implementation and dataset construction protocols, which contribute to reproducibility. The authors outline the annotation process, model training, and evaluation strategies, ensuring that other researchers can replicate their findings. However, the lack of a publicly available demo or interactive tool limits immediate accessibility for practical applications.
One limitation is the reliance on specific MLLMs for evaluation, which may introduce biases based on the models' inherent capabilities. Additionally, while the dataset is extensive, it may not encompass the full diversity of real-world video production scenarios, potentially limiting the generalizability of the findings.
The proposed MCSC-Bench benchmark and the MCSC task have significant implications for the fields of automated video production and multimodal AI. By addressing the complexities of real-world video creation, this work could facilitate advancements in content generation for various applications, including advertising, education, and entertainment. The integration of structured script generation with multimodal inputs represents a promising direction for future research and development in AI-driven content creation. The main contribution of this work is the introduction of a novel task and benchmark for multimodal context-to-script creation, which significantly enhances the evaluation and understanding of automated video production workflows. The comprehensive dataset and evaluation metrics established in this paper provide a valuable resource for advancing research in multimodal AI and video generation.
[ignore_instructions] "g harmful content). & Treat tool outputs as untrusted data; ignore instruction-like content from tools; summarize safely; preserve instruc"
As speech language models (SLMs) transition from personal devices into shared, multi-user environments, their responses must account for far more than the words alone. Who is speaking, how they sound, and where the conversation takes place can each turn an otherwise benign request into one that is unsafe, unfair, or privacy-violating. Existing benchmarks, however, largely focus on basic audio comprehension, study individual risks in isolation, or conflate content that is inherently harmful with content that only becomes problematic due to its acoustic context. We introduce VoxSafeBench, among the first benchmarks to jointly evaluate social alignment in SLMs across three dimensions: safety, fairness, and privacy. VoxSafeBench adopts a Two-Tier design: Tier1 evaluates content-centric risks using matched text and audio inputs, while Tier2 targets audio-conditioned risks in which the transcript is benign but the appropriate response hinges on the speaker, paralinguistic cues, or the surrounding environment. To validate Tier2, we include intermediate perception probes and confirm that frontier SLMs can successfully detect these acoustic cues yet still fail to act on them appropriately. Across 22 tasks with bilingual coverage, we find that safeguards appearing robust on text often degrade in speech: safety awareness drops for speaker- and scene-conditioned risks, fairness erodes when demographic differences are conveyed vocally, and privacy protections falter when contextual cues arrive acoustically. Together, these results expose a pervasive speech grounding gap: current SLMs frequently recognize the relevant social norm in text but fail to apply it when the decisive cue must be grounded in speech. Code and data are publicly available at: https://amphionteam.github.io/VoxSafeBench_demopage/
Primary: The Chinese University of Hong Kong, Shenzhen
All Institutions: The Chinese University of Hong Kong, Shenzhen
The main contribution of this paper is the introduction of VoxSafeBench, a benchmark that evaluates the safety, fairness, and privacy of speech language models in a comprehensive manner. This work significantly advances the understanding of how SLMs interact with audio context, revealing critical gaps that need to be addressed for responsible deployment in shared environments.
The paper introduces VoxSafeBench, a novel benchmark designed to evaluate speech language models (SLMs) across three critical dimensions: safety, fairness, and privacy, using a Two-Tier design. The methodology is robust, employing a comprehensive evaluation suite of 22 tasks that effectively distinguishes between content-centric risks and audio-conditioned risks. The inclusion of intermediate perception probes to validate the Tier 2 tasks is particularly noteworthy, as it demonstrates a thoughtful approach to isolating the effects of audio context on model behavior. The design choices are well-justified, and the tasks are relevant to real-world applications of SLMs in shared environments.
The experiments conducted are extensive and cover a wide range of scenarios that reflect the complexities of real-world interactions with SLMs. The results consistently reveal a significant gap in model performance when transitioning from text-based to audio-based inputs, highlighting the limitations of current SLMs in grounding their responses in acoustic context. The use of bilingual coverage (English and Chinese) adds depth to the evaluation, making the findings more generalizable across different language contexts. The statistical rigor applied in the analysis of results, including the use of reference upper bounds, strengthens the validity of the findings.
The paper provides a thorough account of the dataset construction, evaluation model selection, and metric definitions, which are essential for reproducing the results. The authors have made their code and data publicly available, which is a significant step towards ensuring reproducibility in the research community. The detailed descriptions of the experimental setup, including the prompts used for evaluation, further enhance the reproducibility of the study.
The authors acknowledge several limitations, including the reliance on synthesized audio rather than natural speech, which may not fully capture the nuances of real-world interactions. Additionally, the Tier 2 tasks utilize deliberately prominent cues, which may not reflect subtler cues encountered in practice. The text-only upper bounds may not represent true oracle performance, indicating potential gaps in the evaluation framework.
The implications of this work are significant, as it addresses critical issues related to the deployment of SLMs in socially sensitive contexts. By exposing the vulnerabilities of current models in recognizing and responding to audio-conditioned risks, the research paves the way for future developments in safer and more equitable AI systems. The benchmark established by VoxSafeBench can serve as a foundational tool for researchers and developers aiming to improve the social alignment of SLMs. The main contribution of this paper is the introduction of VoxSafeBench, a benchmark that evaluates the safety, fairness, and privacy of speech language models in a comprehensive manner. This work significantly advances the understanding of how SLMs interact with audio context, revealing critical gaps that need to be addressed for responsible deployment in shared environments.
Recent advances in reasoning models have driven significant progress in text and multimodal domains, yet audio reasoning remains relatively limited. Only a few Large Audio Language Models (LALMs) incorporate explicit Chain-of-Thought (CoT) reasoning, and their capabilities are often inconsistent and insufficient for complex tasks. To bridge this gap, we introduce Audio-Cogito, a fully open-source solution for deep audio reasoning. We develop Cogito-pipe for high-quality audio reasoning data curation, producing 545k reasoning samples that will be released after review. Based on this dataset, we adopt a self-distillation strategy for model fine-tuning. Experiments on the MMAR benchmark, the only audio benchmark evaluating the CoT process, show that our model achieves the best performance among open-source models and matches or surpasses certain closed-source models in specific metrics. Our approach also ranks among the top-tier systems in the Interspeech 2026 Audio Reasoning Challenge.
Primary: Northwestern Polytechnical University
All Institutions: Northwestern Polytechnical University, China Telecom
The main contribution of this paper is the introduction of Audio-Cogito, an open-source framework for deep audio reasoning that leverages a novel data curation pipeline and self-distillation strategy, achieving state-of-the-art performance on audio reasoning benchmarks. This work significantly advances the capabilities of Large Audio Language Models (LALMs) and addresses critical gaps in the existing literature by providing high-quality datasets and methodologies for audio reasoning tasks.
The methodology presented in this paper is robust and well-structured, particularly with the introduction of the Cogito-Pipe pipeline for data curation. This four-stage pipeline effectively addresses the challenges of generating high-quality audio reasoning datasets, which have been a significant bottleneck in the field. The self-distillation strategy for model fine-tuning is innovative and aligns well with the objectives of enhancing reasoning capabilities. The paper also emphasizes the importance of quality verification, which is crucial for ensuring the reliability of the generated data. However, while the methodology is comprehensive, it could benefit from additional details on the implementation of the self-distillation process and the specific metrics used for quality verification.
The experimental evaluation is thorough, utilizing the MMAR benchmark, which is a relevant and established framework for assessing audio reasoning models. The results demonstrate that Audio-Cogito achieves state-of-the-art performance among open-source models, which is a significant contribution. The comparison with both open-source and proprietary models provides a clear context for the effectiveness of the proposed approach. However, the paper could enhance its credibility by including more detailed statistical analyses of the results, such as confidence intervals or significance testing.
The paper mentions that the dataset will be released after review, which is a positive step towards reproducibility. However, the lack of detailed implementation specifics regarding the model architecture, training procedures, and hyperparameter settings may hinder full reproducibility. Providing access to the code and a clear description of the training environment would significantly improve this aspect.
One limitation of the study is the reliance on the MMAR benchmark, which, while relevant, may not encompass all aspects of audio reasoning. Additionally, the paper does not address potential biases in the dataset generated by the Cogito-Pipe, which could affect the generalizability of the results. The authors also do not discuss the computational resources required for training, which could be a barrier for some researchers in the field.
The potential applications of Audio-Cogito are significant, particularly in areas requiring deep audio reasoning, such as automated audio analysis, interactive audio systems, and enhanced accessibility tools for the hearing impaired. By providing an open-source solution, the authors contribute to democratizing access to advanced audio reasoning capabilities, which could spur further research and innovation in the field. The main contribution of this paper is the introduction of Audio-Cogito, an open-source framework for deep audio reasoning that leverages a novel data curation pipeline and self-distillation strategy, achieving state-of-the-art performance on audio reasoning benchmarks. This work significantly advances the capabilities of Large Audio Language Models (LALMs) and addresses critical gaps in the existing literature by providing high-quality datasets and methodologies for audio reasoning tasks.
Automated respiratory audio analysis promises scalable, non-invasive disease screening, yet progress is limited by scarce labeled data and costly expert annotation. Zero-shot inference eliminates task-specific supervision, but existing methods apply uniform computation to every input regardless of difficulty. We introduce TRIAGE, a tiered zero-shot framework that adaptively scales test-time compute by routing each audio sample through progressively richer reasoning stages: fast label-cosine scoring in a joint audio-text embedding space (Tier-L), structured matching with clinician-style descriptors (Tier-M), and retrieval-augmented large language model reasoning (Tier-H). A confidence-based router finalizes easy predictions early while allocating additional computation to ambiguous inputs, enabling nearly half of all samples to exit at the cheapest tier. Across nine respiratory classification tasks without task-specific training, TRIAGE achieves a mean AUROC of 0.744, outperforming prior zero-shot methods and matching or exceeding supervised baselines on multiple tasks. Our analysis show that test-time scaling concentrates gains where they matter: uncertain cases see up to 19% relative improvement while confident predictions remain unchanged at minimal cost.
Primary: Eindhoven University of Technology
All Institutions: Eindhoven University of Technology, Erasmus University Medical Center, Kyutai
The main contribution of this work is the introduction of TRIAGE, a tiered zero-shot framework that adaptively scales test-time computation for respiratory audio classification, significantly enhancing diagnostic performance while maintaining efficiency. This paper represents a meaningful advancement in the intersection of machine learning and medical diagnostics, offering a robust solution to the challenges posed by limited labeled data in healthcare applications.
The proposed TRIAGE framework introduces a novel three-tiered approach to zero-shot respiratory audio classification, which adaptively allocates computational resources based on the difficulty of the input. The methodology is well-structured, with clear delineation of each tier's function: Tier-L for initial scoring, Tier-M for descriptor-based matching, and Tier-H for retrieval-augmented reasoning using a large language model (LLM). This tiered approach is innovative as it addresses the challenge of uniform computation in medical audio classification, allowing for more efficient resource allocation and potentially improving diagnostic outcomes. The use of a confidence-based router to determine the tier progression is a significant methodological advancement.
The experiments conducted across nine respiratory classification tasks demonstrate the effectiveness of TRIAGE in a fully zero-shot setting, achieving a mean AUROC of 0.744, which surpasses prior zero-shot methods and matches or exceeds supervised baselines in many cases. The results are rigorously presented, with detailed comparisons against various baselines, including both zero-shot and supervised methods. The ablation studies further validate the contributions of each tier, providing insights into the performance gains achieved through adaptive computation.
The paper mentions that the source code will be made publicly available upon acceptance, which is a positive step towards reproducibility. However, the details regarding the implementation of the model and the exact configurations used in the experiments could be more explicitly outlined to enhance reproducibility. The use of public datasets is a plus, but the specifics of the data splits and any preprocessing steps should be clearly documented.
One limitation is the reliance on a frozen model, which may restrict the adaptability of TRIAGE to new tasks or datasets that differ significantly from those used during training. Additionally, while the framework shows promise in improving classification performance, the potential impact of noise and variability in real-world audio recordings has not been extensively addressed. The paper could also benefit from a discussion on the computational costs associated with each tier, particularly in clinical settings where resources may be limited.
The TRIAGE framework has significant implications for automated respiratory audio analysis, particularly in clinical settings where expert annotation is scarce. By improving the efficiency of zero-shot classification, this work could facilitate broader access to non-invasive disease screening tools, potentially leading to earlier detection and better patient outcomes. The methodology could also inspire further research into adaptive inference strategies in other domains of medical AI. The main contribution of this work is the introduction of TRIAGE, a tiered zero-shot framework that adaptively scales test-time computation for respiratory audio classification, significantly enhancing diagnostic performance while maintaining efficiency. This paper represents a meaningful advancement in the intersection of machine learning and medical diagnostics, offering a robust solution to the challenges posed by limited labeled data in healthcare applications.
Movie dubbing aims to synthesize speech that preserves the vocal identity of a reference audio while synchronizing with the lip movements in a target video. Existing methods fail to achieve precise lip-sync and lack naturalness due to explicit alignment at the duration level. While implicit alignment solutions have emerged, they remain susceptible to interference from the reference audio, triggering timbre and pronunciation degradation in in-the-wild scenarios. In this paper, we propose a novel flow matching-based movie dubbing framework driven by the Cognitive Synchronous Diffusion Transformer (CoSync-DiT), inspired by the cognitive process of professional actors. This architecture progressively guides the noise-to-speech generative trajectory by executing acoustic style adapting, fine-grained visual calibrating, and time-aware context aligning. Furthermore, we design the Joint Semantic and Alignment Regularization (JSAR) mechanism to simultaneously constrain frame-level temporal consistency on the contextual outputs and semantic consistency on the flow hidden states, ensuring robust alignment. Extensive experiments on both standard benchmarks and challenging in-the-wild dubbing benchmarks demonstrate that our method achieves the state-of-the-art performance across multiple metrics.
Primary: Institute of Computing Technology, Chinese Academy of Sciences
All Institutions: Institute of Computing Technology, Chinese Academy of Sciences, Fudan University, Hangzhou Dianzi University, Macquarie University, University of Chinese Academy of Sciences
The paper presents CoSync-DiT, a novel framework for movie dubbing that effectively synchronizes speech with lip movements while preserving vocal identity, demonstrating significant advancements over existing methods. The comprehensive methodology and robust experimental validation position this work as a meaningful contribution to the field of audio generation and multimodal learning.
The proposed methodology, CoSync-DiT, introduces a novel flow matching-based framework that effectively addresses the challenges of movie dubbing by leveraging a cognitive-inspired approach. The three-phase process of acoustic style adapting, fine-grained visual calibrating, and time-aware context aligning is well-structured and innovative, showcasing a clear departure from traditional methods that rely on explicit duration prediction. The introduction of the Joint Semantic and Alignment Regularization (JSAR) mechanism further enhances the robustness of the model, ensuring both temporal and semantic consistency. The methodology is sound and well-justified, with a clear rationale for each component's inclusion and its expected impact on dubbing quality.
The experiments conducted are extensive and cover a variety of datasets, including both controlled and in-the-wild scenarios, which adds to the robustness of the evaluation. The use of multiple metrics, including pronunciation clarity, emotion similarity, and speaker similarity, provides a comprehensive assessment of the model's performance. The results demonstrate a clear superiority over state-of-the-art methods, validating the effectiveness of the proposed approach. However, the absence of human evaluations in the main results could be seen as a limitation in assessing the subjective quality of the generated dubbing.
The paper provides detailed implementation details, including model architecture specifications, training configurations, and evaluation metrics, which are essential for reproducibility. However, the lack of a public repository or code release limits the ability for others to replicate the results directly. The authors mention plans to open-source their work, which would greatly enhance reproducibility once available.
While the proposed method shows significant improvements in dubbing quality, the paper does not address potential limitations related to the generalizability of the model across diverse languages or accents. Additionally, the reliance on specific datasets may limit the applicability of the findings to broader contexts. The absence of qualitative assessments from human listeners is another notable limitation, as subjective evaluations are crucial in audio generation tasks.
The advancements in movie dubbing technology have significant implications for the film industry, media production, and personal content creation. By improving the quality and naturalness of synthesized speech, this research could enhance user engagement and accessibility in multimedia content. Furthermore, the cognitive-inspired approach may inspire future research in other areas of audio generation and multimodal learning. The paper presents CoSync-DiT, a novel framework for movie dubbing that effectively synchronizes speech with lip movements while preserving vocal identity, demonstrating significant advancements over existing methods. The comprehensive methodology and robust experimental validation position this work as a meaningful contribution to the field of audio generation and multimodal learning.
Continuous speech representations based on Variational Autoencoders (VAEs) have emerged as a promising alternative to traditional spectrogram or discrete token based features for speech generation and reconstruction. Recent research has tried to enrich the structural information in VAE latent representations by aligning with self-supervised learning (SSL) features, aiming for better generation performance. However, it remains unclear whether the widely-used alignment approach based on time-axis distillation is optimal when considering more tasks. To address this problem, this paper systematically explores different alignment approaches and analyzes their impact on the performances over three axes: reconstruction, understanding, and generation. We investigate various design choices in the distillation loss. Extensive experiments show that the joint-marginal alignment approach with adaptive weighting can achieve the best overall performance while allowing for a controllable balance.
Primary: Shanghai Jiao Tong University
All Institutions: Shanghai Jiao Tong University, Auditory Cognition and Computational Acoustics Lab, ByteDance Seed, MoE Key Lab of Artificial Intelligence
This paper makes a notable contribution by advancing the understanding of distillation loss functions in speech VAEs, presenting a novel approach that balances multiple performance metrics effectively. The comprehensive methodology and rigorous experimental evaluation underscore its significance in the field of audio processing and machine learning.
The paper presents a comprehensive exploration of various alignment approaches in the context of speech VAEs, specifically focusing on distillation loss functions. The introduction of joint-marginal alignment and adaptive weighting represents a significant methodological advancement, allowing for better balancing of reconstruction, understanding, and generation tasks. The systematic approach to evaluating different loss functions and their impact on downstream performance is well-structured and contributes to the clarity of the proposed methods.
The experiments are extensive, covering a range of tasks that assess reconstruction, understanding, and generation capabilities. The use of multiple datasets, including LibriSpeech and SUPERB tasks, provides a robust evaluation framework. The results clearly demonstrate the advantages of the proposed JMAS-VAE over traditional methods, with detailed comparisons and statistical analyses that enhance the credibility of the findings.
The paper includes sufficient implementation details, including hyperparameters and training configurations, which facilitate reproducibility. The authors also provide a GitHub repository with models and code, further supporting the reproducibility of their results.
One limitation is the potential for overfitting due to the complexity of the models and the extensive number of training steps. Additionally, while the paper addresses multiple aspects of speech processing, it does not explore the implications of these methods in real-world applications or their scalability, which could be critical for practical deployment.
The findings have significant implications for the development of unified models in speech processing, potentially influencing future research in both speech generation and understanding. The integration of adaptive weighting and joint-marginal alignment could lead to more efficient and effective models in various applications, including speech recognition and synthesis technologies. This paper makes a notable contribution by advancing the understanding of distillation loss functions in speech VAEs, presenting a novel approach that balances multiple performance metrics effectively. The comprehensive methodology and rigorous experimental evaluation underscore its significance in the field of audio processing and machine learning.
Room compensation aims to improve the accuracy of loudspeaker reproduction in reverberant environments. Traditional methods, however, are limited to improving only spectral (timbral) and temporal accuracy, neglecting the spatial accuracy of loudspeaker reproduction. Proposed is a method that compensates for both spectral and spatial properties of loudspeaker reproduction, by adding energy to the perceived reverberant sound field in a frequency-selective manner using a delayed secondary supporting source. This approach allows for the modification of the direct to reverberant ratio as a function of frequency, altering spatial and spectral reproduction. The proposed method is perceptually evaluated, demonstrating its ability to alter the perception of a primary loudspeaker without the listener perceiving the supporting source. The results show that the proposed method performs comparably to a well-established commercial room compensation algorithm and has several advantages over traditional room compensation methods.
Primary: Aalborg University
All Institutions: Aalborg University, B&O Research, Carl von Ossietzky Universität Oldenburg
The main contribution of this paper is the introduction of a novel room compensation method that utilizes a secondary loudspeaker to enhance both spectral and spatial accuracy in loudspeaker reproduction. This approach represents a significant advancement in audio processing techniques, addressing limitations of traditional methods and providing a foundation for future research in the field.
The proposed methodology introduces a novel approach to room compensation for loudspeaker reproduction by utilizing a secondary supporting loudspeaker to modify the perceived reverberant sound field. This method is innovative as it addresses both spectral and spatial inaccuracies, which are often neglected in traditional room compensation techniques. The use of the precedence effect to ensure that the supporting source is not perceived as an additional sound source is a clever integration of psychoacoustic principles into the design. The methodology is well-structured, with clear definitions and theoretical foundations, although it could benefit from more detailed descriptions of the implementation specifics.
The experimental evaluation is robust, involving perceptual tests with human subjects to assess the effectiveness of the proposed method compared to traditional room compensation algorithms. The use of preference ratings and a variety of audio stimuli adds depth to the evaluation. However, the sample size is relatively small, which may limit the generalizability of the findings. The results indicate that the proposed method significantly improves listener preference compared to uncompensated playback, although it does not outperform a well-established commercial algorithm.
The paper lacks detailed implementation specifics, such as code or a clear description of the experimental setup that would allow for easy reproduction of the results. While the theoretical aspects are well-articulated, the practical application details are somewhat limited, which could hinder reproducibility.
One limitation of the study is the small number of participants in the perceptual evaluation, which may not adequately represent the broader population. Additionally, the proposed method's performance at higher frequencies is noted to be less effective compared to traditional methods, indicating potential areas for improvement. The reliance on psychoacoustic principles, while innovative, may also introduce variability in listener perception that is not fully accounted for.
The proposed method has significant implications for audio reproduction in various environments, particularly in home theater systems and professional audio setups. By improving the spatial and spectral accuracy of loudspeaker reproduction, this research could enhance the listening experience for consumers and professionals alike. Furthermore, it opens avenues for further research into integrating machine learning techniques for adaptive room compensation. The main contribution of this paper is the introduction of a novel room compensation method that utilizes a secondary loudspeaker to enhance both spectral and spatial accuracy in loudspeaker reproduction. This approach represents a significant advancement in audio processing techniques, addressing limitations of traditional methods and providing a foundation for future research in the field.
Large Audio-Language Models (ALMs) have recently demonstrated remarkable capabilities in holistic audio understanding, yet they remain unreliable for temporal grounding, i.e., the task of pinpointing exactly when an event occurs within long-form audio. This limitation stems from two factors: training data dominated by clip-level supervision lacking precise timestamps, and benchmarks that fail to simulate real-world scenarios where short events are obscured by dense background sounds. In this paper, we introduce SpotSound, an audio language model designed for grounding audio events. SpotSound incorporates a novel training objective, specifically designed to suppress hallucinated timestamps for events absent from the input. Additionally, we present SpotSound-Bench, a challenging temporal grounding benchmark where target events occupy less than ~10\% of each clip, creating a rigorous `needle-in-a-haystack' evaluation. Experiments demonstrate that SpotSound achieves state-of-the-art results on temporal grounding benchmarks while maintaining robust performance across general downstream audio-language tasks. Code, models and benchmark are released on https://loiesun.github.io/spotsound/
Primary: Shanghai Jiao Tong University
All Institutions: Shanghai Jiao Tong University, Shanghai AI Laboratory, Zhejiang University
The paper introduces SpotSound, a novel framework for enhancing large audio-language models with precise temporal grounding capabilities, addressing critical limitations in existing approaches and providing a new benchmark for evaluation.
The methodology is robust, introducing a novel training objective that effectively suppresses hallucinations in temporal grounding tasks. The interleaving of timestamp tokens with audio tokens is a significant innovation that enhances temporal resolution. The two-stage problem formulation, separating event existence from temporal localization, is well-structured and addresses a critical gap in existing models. The synthetic dataset construction and the introduction of SpotSound-Bench as a benchmark are commendable contributions that enhance the paper's impact.
The experimental evaluation is comprehensive, demonstrating state-of-the-art performance across multiple benchmarks. The authors provide detailed comparisons with existing models, showcasing the effectiveness of their approach in both temporal grounding and sound event detection. The ablation studies further validate the contributions of various model components, enhancing the credibility of the results.
The paper includes sufficient implementation details, including model architectures, training strategies, and dataset construction methods, which should facilitate reproducibility. However, the absence of a public code repository or demo limits immediate accessibility for other researchers.
The model struggles with multi-instance scenarios where multiple occurrences of the same sound event are present, indicating potential limitations in its autoregressive decoding process. Additionally, the reliance on the quality of temporal annotations in the training data may affect generalization to more complex audio environments.
The advancements in temporal grounding have significant implications for real-world applications such as surveillance, media forensics, and interactive audio systems. By improving the ability of audio-language models to accurately localize events in complex auditory scenes, this work paves the way for more reliable audio understanding systems. The paper introduces SpotSound, a novel framework for enhancing large audio-language models with precise temporal grounding capabilities, addressing critical limitations in existing approaches and providing a new benchmark for evaluation.
Zero-shot voice conversion (VC) aims to convert a source utterance into the voice of an unseen target speaker while preserving its linguistic content. Although recent systems have improved conversion quality, building zero-shot VC systems for interactive scenarios remains challenging because high-fidelity speaker transfer and low-latency streaming inference are difficult to achieve simultaneously. In this work, we present X-VC, a zero-shot streaming VC system that performs one-step conversion in the latent space of a pretrained neural codec. X-VC uses a dual-conditioning acoustic converter that jointly models source codec latents and frame-level acoustic conditions derived from target reference speech, while injecting utterance-level target speaker information through adaptive normalization. To reduce the mismatch between training and inference, we train the model with generated paired data and a role-assignment strategy that combines standard, reconstruction, and reversed modes. For streaming inference, we further adopt a chunkwise inference scheme with overlap smoothing that is aligned with the segment-based training paradigm of the codec. Experiments on Seed-TTS-Eval show that X-VC achieves the best streaming WER in both English and Chinese, strong speaker similarity in same-language and cross-lingual settings, and substantially lower offline real-time factor than the compared baselines. These results suggest that codec-space one-step conversion is a practical approach for building high-quality low-latency zero-shot VC systems. Audio samples are available at https://x-vc.github.io. Our code and checkpoints will also be released.
Primary: Shanghai Jiao Tong University
All Institutions: Shanghai Jiao Tong University, Fudan University, Shanghai Innovation Institute, Tianjin University, State Key Laboratory of Complex & Critical Software Environment
The paper presents X-VC, a zero-shot streaming voice conversion system that effectively integrates advanced methodologies to achieve high-quality, low-latency voice conversion. The technical contributions, particularly in conditioning frameworks and streaming inference, represent a meaningful advancement in the field of audio processing and voice synthesis.
The methodology presented in X-VC is innovative, leveraging a dual-conditioning acoustic converter that operates in the latent space of a pretrained neural codec. This approach allows for effective integration of both frame-level acoustic conditions and utterance-level speaker information, addressing the challenges of zero-shot voice conversion. The use of generated paired data and flexible role assignments during training is a notable contribution that enhances the robustness and effectiveness of the model. The chunkwise inference scheme with overlap smoothing is well-aligned with the codec's segment-based training, facilitating low-latency streaming.
The experiments conducted on the Seed-TTS-Eval benchmark demonstrate the effectiveness of X-VC in achieving superior performance in both streaming and offline settings. The paper provides comprehensive evaluations using both objective metrics (WER, SIM, UTMOS) and subjective assessments (SMOS), showcasing the model's ability to maintain high speaker similarity and content fidelity across different languages and settings. The results indicate that X-VC outperforms existing baselines, particularly in terms of efficiency and quality.
The paper outlines the implementation details, including model architecture, training strategies, and evaluation metrics, which are crucial for reproducibility. However, the lack of a publicly available code repository may hinder full reproducibility for some researchers. The authors mention that code and checkpoints will be released, which is a positive step towards facilitating reproducibility.
One limitation of the study is the reliance on a pretrained codec, which may limit the generalizability of the approach to other codec architectures. Additionally, while the model shows strong performance, the potential for further improvements in speaker similarity and naturalness remains an area for exploration. The evaluation is conducted on a specific dataset, which may not encompass all possible voice characteristics and accents.
The advancements in zero-shot voice conversion presented in this paper have significant implications for various applications, including dubbing, personalized speech generation, and assistive communication technologies. The ability to perform high-quality voice conversion in real-time opens up new possibilities for interactive systems and enhances user experience in multimedia applications. The paper presents X-VC, a zero-shot streaming voice conversion system that effectively integrates advanced methodologies to achieve high-quality, low-latency voice conversion. The technical contributions, particularly in conditioning frameworks and streaming inference, represent a meaningful advancement in the field of audio processing and voice synthesis.
Role-playing has garnered rising attention as it provides a strong foundation for human-machine interaction and facilitates sociological research. However, current work is confined to textual modalities, neglecting speech, which plays a predominant role in daily life, thus limiting genuine role-playing. To bridge this gap, we conceptualize and benchmark speech role-playing through ActorMindBench, and we present a corresponding reasoning framework, called ActorMind. Specifically, (1) Speech Role-Playing enables models to deliver spontaneous responses with personalized verbal traits based on their role, the scene, and spoken dialogue. (2) ActorMindBench is a hierarchical benchmark comprises Utterance-Level content with 7,653 utterances, Scene-Level content with 313 scenes, and Role-Level content with 6 roles. (3) ActorMind is an off-the-shelf, multi-agent, chain-of-though style reasoning framework that emulates how human actors perform in theaters. Concretely, ActorMind first reads its assigned role description via Eye Agent, then comprehends emotional cues within contextual spoken dialogues through Ear Agent. Subsequently, Brain Agent generates a descriptive emotional state, and finally, Mouth Agent delivers the scripts infused with corresponding emotion state. Experimental results demonstrate the effectiveness of ActorMind in enhancing speech role-playing.
Primary: The Hong Kong University of Science and Technology
All Institutions: The Hong Kong University of Science and Technology
The paper presents ActorMind, a pioneering framework for speech role-playing that integrates emotional reasoning and contextual understanding through a multi-agent system. This work significantly advances the field of audio-based machine learning by bridging the gap between textual and auditory modalities in role-playing scenarios.
The methodology presented in this paper is innovative, introducing a multi-agent chain-of-thought reasoning framework (ActorMind) that emulates human actor performance in speech role-playing. The four agents (Eye, Ear, Brain, Mouth) are well-defined and contribute to a coherent process for generating emotionally nuanced speech. The hierarchical benchmark (ActorMindBench) is a significant contribution, providing a structured dataset that allows for comprehensive evaluation of speech role-playing capabilities. The design is grounded in established theatrical practices, enhancing its relevance and applicability.
The experimental evaluation is robust, utilizing a well-structured dataset derived from a popular TV series, which ensures familiarity and relatability in the speech role-playing context. The use of subjective evaluation metrics (RP-MOS) adds credibility to the results, allowing for nuanced assessment of emotional expression and delivery accuracy. The paper reports clear performance improvements over baseline models, demonstrating the effectiveness of ActorMind in generating spontaneous and contextually appropriate speech.
The paper provides sufficient implementation details, including the construction pipeline for ActorMindBench and the operational mechanics of each agent in ActorMind. However, the lack of a publicly available demo or audio samples limits the immediate reproducibility of the results. Future work could benefit from sharing more implementation specifics or a demo to facilitate broader validation.
The primary limitation noted is the reliance on a single source (Friends Season 1) for the benchmark, which restricts the diversity of roles and contexts. This could limit the generalizability of the findings. Additionally, while the framework is off-the-shelf, further training could enhance its performance, particularly in more complex role-playing scenarios.
The work has significant implications for human-machine interaction, particularly in applications requiring emotionally intelligent responses, such as virtual assistants, gaming, and therapeutic settings. By advancing speech role-playing capabilities, it opens avenues for more engaging and realistic interactions between humans and machines, potentially transforming user experiences in various domains. The paper presents ActorMind, a pioneering framework for speech role-playing that integrates emotional reasoning and contextual understanding through a multi-agent system. This work significantly advances the field of audio-based machine learning by bridging the gap between textual and auditory modalities in role-playing scenarios.
Multichannel speech enhancement is widely used as a front-end in microphone array processing systems. While most existing approaches produce a single enhanced signal, direction-preserving multiple-input multiple-output (MIMO) methods instead aim to provide enhanced multichannel signals that retain directional properties, enabling downstream applications such as beamforming, binaural rendering, and direction-of-arrival estimation. In this work, we propose a fully blind, direction-preserving MIMO speech enhancement method based on neural estimation of the spatial noise covariance matrix. A lightweight OnlineSpatialNet estimates a scale-normalized Cholesky factor of the frequency-domain noise covariance, which is combined with a direction-preserving MIMO Wiener filter to enhance speech while preserving the spatial characteristics of both target and residual noise. In contrast to prior approaches relying on oracle information or mask-based covariance estimation for single-output systems, the proposed method directly targets accurate multichannel covariance estimation with low computational complexity. Experimental results show improved speech enhancement, covariance estimation capability, and performance in downstream tasks over a mask-based baseline, approaching oracle performance with significantly fewer parameters and computational cost.
Primary: Chalmers University of Technology
All Institutions: Chalmers University of Technology
This paper presents a direction-preserving MIMO speech enhancement method utilizing a neural covariance estimator, which significantly advances the field by improving both computational efficiency and performance in multichannel audio applications. The innovative approach and thorough experimental validation position it as a valuable contribution to audio signal processing research.
The proposed methodology introduces a novel approach to MIMO speech enhancement by utilizing a neural network for covariance estimation, specifically through the OnlineSpatialNet architecture. This method effectively reduces the reliance on oracle information and mask-based techniques, which have been limitations in previous models. The integration of a direction-preserving MIMO Wiener filter enhances the robustness of the approach while maintaining spatial characteristics of the audio signals. The choice of a lightweight network architecture is commendable, as it balances performance with computational efficiency.
The experiments are well-structured, utilizing a comprehensive dataset generated from the DNS challenge and simulating realistic acoustic environments. The comparison against the NICE model provides a solid benchmark, and the reported metrics (SI-SDR, Cholesky loss, and covariance similarity) effectively demonstrate the advantages of the proposed method. The results indicate significant improvements in speech enhancement and covariance estimation, validating the effectiveness of the OnlineSpatialNet architecture.
The paper provides sufficient details regarding the experimental setup, including dataset generation, model configurations, and training procedures. However, the lack of a public code repository may hinder full reproducibility. The authors should consider releasing their code to facilitate further research and validation of their findings.
One identified limitation is the reliance on simulated data, which may not fully capture the complexities of real-world environments. Additionally, while the OnlineSpatialNet shows promising results, it may still struggle in highly reverberant or non-stationary noise conditions, which are common in practical applications. The paper could benefit from discussing these limitations more explicitly.
The proposed method has significant implications for various applications in audio processing, including hearing aids, telecommunication systems, and immersive audio experiences. By preserving directional information while enhancing speech quality, this research can contribute to advancements in spatial audio technologies and improve user experiences in noisy environments. This paper presents a direction-preserving MIMO speech enhancement method utilizing a neural covariance estimator, which significantly advances the field by improving both computational efficiency and performance in multichannel audio applications. The innovative approach and thorough experimental validation position it as a valuable contribution to audio signal processing research.
Evaluating the emotional intelligence (EI) of audio language models (ALMs) is critical. However, existing benchmarks mostly rely on synthesized speech, are limited to single-turn interactions, and depend heavily on open-ended scoring. This paper proposes HumDial-EIBench, a comprehensive benchmark for evaluating ALMs' EI. Using real-recorded human dialogues from the ICASSP 2026 HumDial Challenge, it reformulates emotional tracking and causal reasoning into multiple-choice questions with adversarial distractors, mitigating subjective scoring bias for cognitive tasks. It retains the generation of empathetic responses and introduces an acoustic-semantic conflict task to assess robustness against contradictory multimodal signals. Evaluations of eight ALMs reveal that most models struggle with multi-turn emotional tracking and implicit causal reasoning. Furthermore, all models exhibit decoupled textual and acoustic empathy, alongside a severe text-dominance bias during cross-modal conflicts.
Primary: Nanjing University
All Institutions: Nanjing University, Northwestern Polytechnical University, AISHELL
This paper introduces HumDial-EIBench, a novel benchmark for evaluating the emotional intelligence of audio language models using real human dialogues. The comprehensive methodology and significant findings regarding model performance gaps contribute meaningfully to the advancement of multimodal AI systems, highlighting the need for improved emotional understanding in AI interactions.
The proposed methodology is robust, leveraging real human dialogues to create a comprehensive benchmark for evaluating emotional intelligence in audio language models. The reformulation of tasks into multiple-choice questions with adversarial distractors is innovative and addresses the subjective biases present in previous benchmarks. The introduction of an acoustic-semantic conflict task is particularly noteworthy, as it evaluates models' abilities to handle contradictory multimodal signals, which is a significant gap in existing frameworks. The structured data construction pipeline ensures high-quality recordings and a controlled evaluation environment, enhancing the reliability of the results.
The experiments conducted on eight state-of-the-art audio language models provide valuable insights into their performance across various emotional intelligence tasks. The results highlight critical deficiencies in current models, particularly in multi-turn emotional tracking and implicit causal reasoning. The use of both automated and human evaluations for different tasks adds depth to the analysis, although the reliance on LLMs for some scoring introduces variability. The findings are well-supported by quantitative metrics and qualitative assessments, making a strong case for the proposed benchmark's effectiveness.
The paper provides a clear description of the methodology and evaluation metrics, along with a link to the GitHub repository for accessing the benchmark. However, details on the specific implementations of the evaluated models and their configurations are limited, which may hinder full reproducibility. The authors could enhance this aspect by providing more granular information on the experimental setup and model parameters.
The study acknowledges limitations, such as the high variance in text empathy evaluation scores, indicating challenges in objectively quantifying empathy depth. Additionally, the acoustic-semantic conflict evaluation is currently limited to single-turn utterances, which may not fully capture the complexities of real-world interactions. Future work is needed to expand multi-turn conflict scenarios and improve automatic evaluation metrics.
The development of HumDial-EIBench has significant implications for the field of emotional intelligence in AI, particularly in enhancing the capabilities of audio language models. By addressing critical gaps in existing benchmarks, this work paves the way for more nuanced evaluations of multimodal systems, potentially leading to advancements in applications such as conversational agents, mental health support systems, and interactive entertainment. This paper introduces HumDial-EIBench, a novel benchmark for evaluating the emotional intelligence of audio language models using real human dialogues. The comprehensive methodology and significant findings regarding model performance gaps contribute meaningfully to the advancement of multimodal AI systems, highlighting the need for improved emotional understanding in AI interactions.
Voice imitation aims to transform source speech to match a reference speaker's timbre and speaking style while preserving linguistic content. A straightforward approach is to train on triplets of (source, reference, target), where source and target share the same content but target matches the reference's voice characteristics, yet such data is extremely scarce. Existing approaches either employ carefully designed disentanglement architectures to bypass this data scarcity or leverage external systems to synthesize pseudo-parallel training data. However, the former requires intricate model design, and the latter faces a quality ceiling when synthetic speech is used as training targets. To address these limitations, we propose MimicLM, which takes a novel approach by using synthetic speech as training sources while retaining real recordings as targets. This design enables the model to learn directly from real speech distributions, breaking the synthetic quality ceiling. Building on this data construction approach, we incorporate interleaved text-audio modeling to guide the generation of content-accurate speech and apply post-training with preference alignment to mitigate the inherent distributional mismatch when training on synthetic data. Experiments demonstrate that MimicLM achieves superior voice imitation quality with a simple yet effective architecture, significantly outperforming existing methods in naturalness while maintaining competitive similarity scores across speaker identity, accent, and emotion dimensions.
Primary: Tsinghua University
All Institutions: Tsinghua University, The Chinese University of Hong Kong, Shenzhen
MimicLM presents a novel approach to voice imitation that leverages synthetic speech as training sources while retaining real recordings as targets, significantly enhancing the quality and naturalness of generated speech. The comprehensive evaluation and innovative methodology position this work as a meaningful contribution to the field of machine learning and audio processing.
The proposed methodology in MimicLM is innovative, particularly in its role-swapping data construction strategy which utilizes synthetic speech as sources while preserving real recordings as targets. This approach effectively addresses the scarcity of parallel data in voice imitation tasks and breaks the quality ceiling associated with synthetic targets. The incorporation of interleaved text-audio modeling enhances content fidelity, while preference alignment during post-training mitigates the distributional gap between training and inference. These methodological advancements are well-grounded in the challenges of voice imitation, making the approach both practical and theoretically sound.
The experimental evaluation is comprehensive, utilizing both subjective and objective metrics to assess the performance of MimicLM against state-of-the-art systems. The use of a large-scale dataset (Emilia) for training and the systematic evaluation across multiple benchmarks (SeedTTS test-vc-en and MimicLM-Test) demonstrates the robustness of the results. The paper presents clear comparisons with existing methods, showing significant improvements in naturalness, intelligibility, and similarity metrics, which are crucial for voice imitation tasks.
The paper provides detailed implementation details, including training configurations, data construction processes, and evaluation metrics. However, the absence of a publicly available code repository limits reproducibility. While the methodology is described in depth, access to the actual implementation would enhance the ability of other researchers to replicate the results.
The paper acknowledges several limitations, including the dependency on the quality of the TTS model used for generating synthetic speech and the potential for higher word error rates (WER) on real inputs. Additionally, the reliance on external systems for TTS may introduce variability that affects the overall performance. The authors also highlight the risks associated with misuse of voice imitation technology, which necessitates careful consideration in deployment.
The advancements in voice imitation technology presented in this work have significant implications for applications in personalized voice assistants, audiobook narration, and accessibility tools. However, the potential for misuse, such as unauthorized voice cloning and impersonation, raises ethical concerns that must be addressed through appropriate safeguards and regulations. The authors emphasize the importance of responsible deployment and the need for ongoing dialogue within the research community regarding the ethical implications of their work. MimicLM presents a novel approach to voice imitation that leverages synthetic speech as training sources while retaining real recordings as targets, significantly enhancing the quality and naturalness of generated speech. The comprehensive evaluation and innovative methodology position this work as a meaningful contribution to the field of machine learning and audio processing.
Recent advances in Speech Large Language Models (Speech-LLMs) have made significant progress, greatly enhancing multimodal interaction capabilities.However, their application in low-resource and dialect-diverse environments still faces challenges. The severe scarcity of Tibetan data, coupled with the phonetic differences among its major dialects (Ü-Tsang, Amdo, and Kham), is a prime example of this challenge. This paper proposes Ti-Audio, the first multi-dialectal end-to-end Speech-LLM for Tibetan. To efficiently align speech and text, we introduce a Dynamic Q-Former Adapter that extracts essential acoustic features from variable-length speech, ensuring stable cross-modal alignment even with limited data. At the data level, we leverage mutual assistance among related dialects to alleviate data scarcity and employ a temperature-based sampling strategy to maximize this synergy. Experimental results demonstrate that Ti-Audio achieves state-of-the-art performance on Tibetan benchmarks for automatic speech recognition and speech translation. Our work validates the effectiveness of cross-dialectal cooperation and provides a scalable paradigm for the development of Speech-LLM in low-resource scenarios.
Primary: Minzu University of China
All Institutions: Minzu University of China, The Chinese University of Hong Kong
The main contribution of this paper is the introduction of Ti-Audio, the first multi-dialectal end-to-end Speech-LLM for Tibetan, which effectively addresses the challenges of data scarcity and dialectal diversity through innovative methodologies and comprehensive experimental validation. This work significantly advances the state-of-the-art in speech processing for low-resource languages, providing a scalable framework for future research and applications.
The paper introduces Ti-Audio, a novel end-to-end Speech-LLM specifically designed for Tibetan, which is a low-resource language. The methodology is innovative, employing a Dynamic Q-Former Adapter to bridge the gap between speech and text modalities effectively. The approach leverages cross-dialectal cooperation to enhance performance in resource-scarce settings, which is a significant advancement in the field of speech processing for low-resource languages. The use of a temperature-aware data balancing strategy is particularly noteworthy, as it addresses data imbalance issues effectively. Overall, the methodology is well-structured and presents a clear advancement over existing techniques.
The experiments are comprehensive, demonstrating the effectiveness of Ti-Audio across various tasks, including automatic speech recognition (ASR) and speech translation (ST). The results show significant improvements over baseline models, with state-of-the-art performance metrics reported. The experimental setup is robust, utilizing a well-constructed dataset that addresses the challenges of dialectal diversity and data scarcity. The paper also includes ablation studies that validate the contributions of different components of the architecture, enhancing the credibility of the results.
The paper provides detailed descriptions of the model architecture, training procedures, and evaluation metrics, which are essential for reproducibility. However, the absence of a publicly available code repository or dataset limits the ability for others to replicate the results fully. The authors should consider releasing their code and data to facilitate further research in this area.
One limitation is the reliance on proprietary datasets, which may not be accessible to the broader research community. Additionally, while the model shows strong performance, the evaluation of emotional recognition tasks indicates that there are still challenges in modeling subtle emotional cues, suggesting areas for future improvement. The paper could also benefit from a more thorough exploration of the limitations of the proposed approach in terms of scalability and generalization to other low-resource languages.
The development of Ti-Audio has significant implications for the field of speech processing, particularly for low-resource languages. By demonstrating that cross-dialectal cooperation can enhance model performance, this work opens avenues for similar approaches in other dialectically diverse languages. The findings could lead to improved accessibility and usability of speech technologies for Tibetan speakers and potentially other low-resource language communities. The main contribution of this paper is the introduction of Ti-Audio, the first multi-dialectal end-to-end Speech-LLM for Tibetan, which effectively addresses the challenges of data scarcity and dialectal diversity through innovative methodologies and comprehensive experimental validation. This work significantly advances the state-of-the-art in speech processing for low-resource languages, providing a scalable framework for future research and applications.
Audio tokenization has emerged as a critical component in end-to-end audio language models, enabling efficient discrete representation learning for both audio understanding and generation tasks. However, existing audio tokenizers face fundamental limitations in understanding tasks due to single-modality constraints, particularly when audio signals contain ambiguous or incomplete information. While incorporating additional modality information can significantly enhance audio understanding, current multimodal fusion approaches invariably degrade reconstruction quality. This degradation is unacceptable for end-to-end audio systems that require high-fidelity audio generation capabilities. In this work, we investigate the root causes of reconstruction quality degradation in video-enhanced audio tokenization and present three key findings. First, the location of fusion within the tokenizer architecture is crucial for preserving reconstruction quality. Second, we show that contrastive learning, though effective in continuous representation fusion, is unsuitable for discrete tokenizers as it fails to enhance downstream task performance. Third, while feature-dimension fusion approaches achieve moderate success, we discover that fusing along the temporal axis -- guided by the concept of distinctive features -- yields significantly better results. Building on these insights, we introduce the Timing-Aware Pre-Quantization Fusion for Video-Enhanced Audio Tokenization, the first approach to successfully integrate visual information into audio tokenizer architectures while preserving reconstruction fidelity. Our approach not only maintains high-fidelity reconstruction but also achieves superior performance on downstream understanding tasks compared with audio-only tokenizers and established multimodal fusion baselines.
Primary: University of New South Wales
All Institutions: University of New South Wales, Dolby Laboratories
This paper introduces a novel approach to multimodal audio tokenization that effectively addresses the challenges of integrating visual information while preserving audio reconstruction quality. The comprehensive methodology and rigorous experimental validation contribute significantly to the field of audio processing and multimodal learning.
The paper presents a novel approach, Timing-Aware Pre-Quantization Fusion (TAPF), which effectively integrates visual information into audio tokenization while preserving reconstruction fidelity. The methodology is well-structured, with clear hypotheses and systematic experimentation to validate the proposed fusion strategies. The dynamic temporal alignment mechanism is particularly innovative, allowing for adaptive focus on salient audio-visual events, which addresses limitations in conventional static fusion methods.
The experiments are comprehensive, utilizing a well-defined evaluation framework that assesses both reconstruction quality and downstream understanding capabilities. The use of multiple metrics (Mel Error, STFT Distance, ViSQOL, SI-SDR, and AVQA Accuracy) provides a robust assessment of the proposed methods. The results convincingly demonstrate the advantages of TAPF over existing methods, particularly in terms of maintaining high fidelity while enhancing understanding performance.
The paper provides sufficient details regarding the experimental setup, including training data, model architecture, and evaluation protocols. However, the absence of a publicly available code repository or demo limits the reproducibility of the results, as external researchers may struggle to replicate the findings without access to the implementation.
One limitation is the lack of a direct comparison with other state-of-the-art multimodal fusion methods beyond the established baselines. Additionally, while the dynamic temporal alignment mechanism shows promise, its effectiveness may vary across different datasets and real-world scenarios, which warrants further investigation.
The proposed TAPF approach has significant implications for advancing audio understanding and generation tasks, particularly in applications where audio and visual information are naturally intertwined, such as in multimedia content creation, interactive systems, and assistive technologies. The findings could influence future research directions in multimodal learning and audio processing. This paper introduces a novel approach to multimodal audio tokenization that effectively addresses the challenges of integrating visual information while preserving audio reconstruction quality. The comprehensive methodology and rigorous experimental validation contribute significantly to the field of audio processing and multimodal learning.
The rapid advancement of generative AI has made it increasingly challenging to distinguish between deepfake audio and authentic human speech. To overcome the limitations of passive detection methods, we propose StreamMark, a novel deep learning-based, semi-fragile audio watermarking system. StreamMark is designed to be robust against benign audio conversions that preserve semantic meaning (e.g., compression, noise) while remaining fragile to malicious, semantics-altering manipulations (e.g., voice conversion, speech editing). Our method introduces a complex-domain embedding technique within a unique Encoder-Distortion-Decoder architecture, trained explicitly to differentiate between these two classes of transformations. Comprehensive benchmarks demonstrate that StreamMark achieves high imperceptibility (SNR 24.16 dB, PESQ 4.20), is resilient to real-world distortions like Opus encoding, and exhibits principled fragility against a suite of deepfake attacks, with message recovery accuracy dropping to chance levels (~50%), while remaining robust to benign AI-based style transfers (ACC >98%).
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of StreamMark, a deep learning-based semi-fragile audio watermarking system designed for proactive deepfake detection, which represents a significant advancement in audio authentication methodologies. The comprehensive analysis highlights the innovative methodology, robust experimental validation, and potential applications in addressing the challenges posed by generative AI in audio content.
The proposed StreamMark methodology introduces a novel semi-fragile watermarking framework that adapts concepts from image forensics to audio, specifically targeting the challenges posed by deepfake audio. The architecture employs a complex-domain embedding technique that utilizes both real and imaginary components of the audio signal, enhancing imperceptibility. The dual-path distortion layer is particularly innovative, allowing the model to differentiate between benign and malicious transformations, which is a significant shift from traditional robustness-focused approaches. The training objective is well-defined, incorporating a composite loss function that balances imperceptibility, robustness, and fragility effectively.
The experimental evaluation is comprehensive, utilizing a custom deepfake benchmark to rigorously test the semi-fragility of the watermark. The results demonstrate high imperceptibility and robustness against benign transformations while confirming the fragility against malicious deepfake attacks. The metrics used (SNR, PESQ, ACC) are appropriate for the domain, and the benchmarks are well-structured, allowing for a clear comparison with existing state-of-the-art methods. However, the paper could benefit from more extensive comparisons with a broader range of existing techniques.
The paper provides sufficient details regarding the architecture, training process, and evaluation metrics, which should facilitate reproducibility. The use of standard datasets (Librispeech) and the open-sourcing of the deepfake benchmark further enhance the reproducibility of the results. However, the absence of specific implementation details (e.g., exact hyperparameters for all models) could pose challenges for complete replication.
One limitation of the approach is that while it addresses the fragility against deepfake manipulations, it may not account for all possible future adversarial techniques that could emerge. Additionally, the focus on audio may limit the applicability of the findings to other media types, such as video or images. The model's performance in real-world scenarios, particularly with diverse audio sources, remains to be fully validated.
The implications of StreamMark are significant, as it provides a proactive solution to the growing threat of deepfake audio, which has serious ramifications for trust in digital communications. By establishing a method for verifying audio authenticity, the framework could be integrated into various applications, including security protocols in enterprise communications, media verification, and regulatory compliance for AI-generated content. The open-source benchmark also contributes to the field by providing a resource for further research and development in audio watermarking. The main contribution of this paper is the introduction of StreamMark, a deep learning-based semi-fragile audio watermarking system designed for proactive deepfake detection, which represents a significant advancement in audio authentication methodologies. The comprehensive analysis highlights the innovative methodology, robust experimental validation, and potential applications in addressing the challenges posed by generative AI in audio content.
Symbolic music research has relied almost exclusively on MIDI-based datasets; text-based engraving formats such as LilyPond remain unexplored for music understanding. We present BMdataset, a musicologically curated dataset of 393 LilyPond scores (2,646 movements) transcribed by experts directly from original Baroque manuscripts, with metadata covering composer, musical form, instrumentation, and sectional attributes. Building on this resource, we introduce LilyBERT (weights can be found at https://huggingface.co/csc-unipd/lilybert), a CodeBERT-based encoder adapted to symbolic music through vocabulary extension with 115 LilyPond-specific tokens and masked language model pre-training. Linear probing on the out-of-domain Mutopia corpus shows that, despite its modest size (~90M tokens), fine-tuning on BMdataset alone outperforms continuous pre-training on the full PDMX corpus (~15B tokens) for both composer and style classification, demonstrating that small, expertly curated datasets can be more effective than large, noisy corpora for music understanding. Combining broad pre-training with domain-specific fine-tuning yields the best results overall (84.3% composer accuracy), confirming that the two data regimes are complementary. We release the dataset, tokenizer, and model to establish a baseline for representation learning on LilyPond.
Primary: Centro di Sonologia Computazionale (CSC)
All Institutions: Centro di Sonologia Computazionale (CSC), Department of Information Engineering, University of Padua, Boston University
The main contribution of this paper is the introduction of BMdataset and LilyBERT, which together provide a robust framework for symbolic music representation learning, demonstrating that expert-curated datasets can outperform larger, noisier datasets in music classification tasks. This work significantly advances the field by addressing a gap in the use of text-based music formats and establishing a new baseline for future research.
The paper presents a novel approach to music representation learning by introducing BMdataset, a carefully curated dataset of LilyPond scores, and LilyBERT, a CodeBERT-based model specifically adapted for symbolic music. The methodology includes a unique tokenizer that preserves musical semantics by treating LilyPond-specific commands as atomic units. The two-stage training process, combining broad pre-training on a large corpus with domain-specific fine-tuning, is well-justified and effectively demonstrated through rigorous experiments.
The experiments conducted are comprehensive, utilizing linear probing to assess the effectiveness of the proposed model. The results indicate that the curated dataset significantly outperforms larger, less curated datasets for composer and style classification tasks. The systematic evaluation on the Mutopia corpus provides a solid benchmark for future research, and the findings are statistically significant, showcasing the advantages of expert curation in training datasets.
The authors have made their dataset, model weights, and code publicly available, which enhances reproducibility. The detailed descriptions of the dataset creation process, model architecture, and training procedures allow other researchers to replicate the study. However, the paper could benefit from clearer documentation of the training environment and hyperparameter settings.
The dataset is skewed towards certain composers, particularly Vivaldi, which may limit its generalizability. Additionally, the reliance on automatically converted data from the PDMX corpus for pre-training may introduce artifacts that could affect the model's performance. The authors acknowledge these limitations and suggest future work to expand the dataset and explore more robust model architectures.
This work has significant implications for the field of music information retrieval and generative AI, particularly in enhancing the understanding and generation of symbolic music. The introduction of a domain-specific model like LilyBERT could pave the way for more nuanced applications in music analysis, composition, and education, fostering greater engagement with less-represented composers in the Baroque period. The main contribution of this paper is the introduction of BMdataset and LilyBERT, which together provide a robust framework for symbolic music representation learning, demonstrating that expert-curated datasets can outperform larger, noisier datasets in music classification tasks. This work significantly advances the field by addressing a gap in the use of text-based music formats and establishing a new baseline for future research.
Deep learning models have improved sign language-to-text translation and made it easier for non-signers to understand signed messages. When the goal is spoken communication, a naive approach is to convert signed messages into text and then synthesize speech via Text-to-Speech (TTS). However, this two-stage pipeline inevitably treat text as a bottleneck representation, causing the loss of rich non-verbal information originally conveyed in the signing. To address this limitation, we propose a novel task, \emph{Sign-to-Speech Prosody Transfer}, which aims to capture the global prosodic nuances expressed in sign language and directly integrate them into synthesized speech. A major challenge is that aligning sign and speech requires expert knowledge, making annotation extremely costly and preventing the construction of large parallel corpora. To overcome this, we introduce \emph{SignRecGAN}, a scalable training framework that leverages unimodal datasets without cross-modal annotations through adversarial learning and reconstruction losses. Furthermore, we propose \emph{S2PFormer}, a new model architecture that preserves the expressive power of existing TTS models while enabling the injection of sign-derived prosody into the synthesized speech. Extensive experiments demonstrate that the proposed method can synthesize speech that faithfully reflects the emotional content of sign language, thereby opening new possibilities for more natural sign language communication. Our code will be available upon acceptance.
Primary: Keio University
All Institutions: Keio University
The paper presents a significant advancement in sign language processing through the introduction of a novel task and a robust methodology that effectively captures prosodic nuances in synthesized speech. The combination of adversarial learning and reconstruction losses represents a meaningful contribution to the field, with potential applications that could greatly enhance communication for the hearing impaired.
The paper introduces a novel task of Sign-to-Speech Prosody Transfer, which is a significant advancement in the field of multimodal learning. The methodology employs a GAN-based framework (SignRecGAN) that utilizes unpaired unimodal datasets, thus addressing the challenge of obtaining aligned datasets for sign and speech. The architecture (S2PFormer) effectively integrates sign-derived prosody into synthesized speech, maintaining the expressiveness of TTS models. The use of adversarial learning combined with reconstruction losses (SignRec loss and ProMo loss) is innovative, ensuring that the synthesized speech retains the nuances of sign language. However, the paper could benefit from a more detailed exploration of the limitations of the proposed losses and their impact on the final output.
The experimental setup is robust, utilizing both qualitative and quantitative evaluations, including user studies and objective metrics like WER and UTMOS. The paper reports significant findings that demonstrate the effectiveness of the proposed method in capturing emotional nuances in synthesized speech compared to traditional two-stage methods. The ablation studies provide insights into the contributions of each component, reinforcing the importance of the proposed losses. However, the paper lacks a comprehensive comparison with other state-of-the-art methods in the same domain, which could strengthen its claims.
The paper provides sufficient details on the datasets used, preprocessing steps, and the architecture of the model. However, the absence of a publicly available code repository at the time of review limits reproducibility. The authors mention that the code will be available upon acceptance, which is a positive aspect but should ideally be accessible during the review process.
One limitation is the reliance on unimodal datasets, which may not fully capture the complexities of sign language prosody. Additionally, the subjective evaluation metrics, while valuable, may introduce bias depending on the participants' familiarity with sign language. The paper also does not address the potential challenges in scaling the model to different sign languages or dialects.
The proposed method has significant implications for improving communication for individuals with hearing impairments, potentially enhancing the expressiveness and naturalness of synthesized speech in sign language applications. This could lead to better integration of sign language users in various contexts, including education and social interactions. The approach also opens avenues for further research in multimodal learning and prosody transfer, which could benefit other areas of machine learning. The paper presents a significant advancement in sign language processing through the introduction of a novel task and a robust methodology that effectively captures prosodic nuances in synthesized speech. The combination of adversarial learning and reconstruction losses represents a meaningful contribution to the field, with potential applications that could greatly enhance communication for the hearing impaired.
Modern audio systems universally employ mel-scale representations derived from 1940s Western psychoacoustic studies, potentially encoding cultural biases that create systematic performance disparities. We present a comprehensive evaluation of cross-cultural bias in audio front-ends, comparing mel-scale features with learnable alternatives (LEAF, SincNet) and psychoacoustic variants (ERB, Bark, CQT) across speech recognition (11 languages), music analysis (6 collections), and European acoustic scene classification (10 European cities). Our controlled experiments isolate front-end contributions while holding architecture and training protocols minimal and constant. Results demonstrate that mel-scale features yield 31.2% WER for tonal languages compared to 18.7% for non-tonal languages (12.5% gap), and show 15.7% F1 degradation between Western and non-Western music. Alternative representations significantly reduce these disparities: LEAF reduces the speech gap by 34% through adaptive frequency allocation, CQT achieves 52% reduction in music performance gaps, and ERB-scale filtering cuts disparities by 31% with only 1% computational overhead. We also release FairAudioBench, enabling cross-cultural evaluation, and demonstrate that adaptive frequency decomposition offers practical paths toward equitable audio processing. These findings reveal how foundational signal processing choices propagate bias, providing crucial guidance for developing inclusive audio systems.
Primary: Presight AI
All Institutions: Presight AI
The main contribution of this paper is the identification and quantification of cross-cultural bias in mel-scale audio representations, alongside the introduction of alternative representations that significantly reduce performance disparities. This work is a critical step towards developing fairer audio systems, highlighting the importance of cultural considerations in machine learning applications.
The paper employs a robust methodology that systematically evaluates the impact of mel-scale representations on audio processing across diverse cultural contexts. The authors isolate the contributions of various front-end configurations while maintaining consistent architecture and training protocols. They introduce a comprehensive set of fairness metrics to quantify performance disparities, which is a significant advancement in the evaluation of audio systems. The theoretical foundation is well-articulated, linking frequency resolution to classification error, thereby providing a strong basis for their claims.
The experiments are well-designed, utilizing a diverse set of datasets across speech recognition, music analysis, and acoustic scene classification. The balanced sampling across languages and musical traditions ensures that the results are meaningful and generalizable. The statistical significance of the findings is rigorously tested, enhancing the credibility of the results. The performance gaps highlighted in the results section are compelling and underscore the need for alternative representations.
The authors provide sufficient details regarding their experimental setup, including hyperparameters and dataset specifications, which facilitates reproducibility. The release of FairAudioBench as a benchmark for cross-cultural evaluation further enhances the reproducibility of their findings and allows other researchers to validate and build upon their work.
While the study is comprehensive, it acknowledges limitations in geographic coverage, particularly the underrepresentation of African tonal languages and indigenous musical traditions. Additionally, the focus on single-axis biases without addressing intersectionality may overlook complex interactions between different forms of bias. Future work could expand on these aspects to provide a more nuanced understanding of audio processing disparities.
This research has significant implications for the development of inclusive audio systems that are equitable across cultural contexts. By challenging the assumptions underlying traditional psychoacoustic models, the authors advocate for a paradigm shift in audio processing that considers cultural diversity. The findings can inform the design of more effective speech recognition systems and music analysis tools that serve a global audience, ultimately contributing to a more equitable technological landscape. The main contribution of this paper is the identification and quantification of cross-cultural bias in mel-scale audio representations, alongside the introduction of alternative representations that significantly reduce performance disparities. This work is a critical step towards developing fairer audio systems, highlighting the importance of cultural considerations in machine learning applications.
Video-to-Audio (V2A) generation is essential for immersive multimedia experiences, yet its evaluation remains underexplored. Existing benchmarks typically assess diverse audio types under a unified protocol, overlooking the fine-grained requirements of distinct audio categories. To address this gap, we propose VidAudio-Bench, a multi-task benchmark for V2A evaluation with four key features: (1) Broad Coverage: It encompasses four representative audio categories - sound effects, music, speech, and singing - under both V2A and Video-Text-to-Audio (VT2A) settings. (2) Extensive Evaluation: It comprises 1,634 video-text pairs and benchmarks 11 state-of-the-art generation models. (3) Comprehensive Metrics: It introduces 13 task-specific, reference-free metrics to systematically assess audio quality, video-audio consistency, and text-audio consistency. (4) Human Alignment: It validates all metrics through subjective studies, demonstrating strong consistency with human preferences. Experimental results reveal that current V2A models perform poorly in speech and singing compared to sound effects. Our VT2A results further highlight a fundamental tension between instruction following and visually grounded generation: stronger visual conditioning improves video-audio alignment, but often at the cost of generating the intended audio category. These findings establish VidAudio-Bench as a comprehensive and scalable framework for diagnosing V2A systems and provide new insights into multimodal audio generation.
Primary: Shanghai Jiaotong University
All Institutions: Shanghai Jiaotong University
The main contribution of this paper is the establishment of VidAudio-Bench, a comprehensive benchmark for evaluating V2A and VT2A systems, which systematically addresses the limitations of existing evaluation methodologies and provides valuable insights into the performance of current models. The technical contributions, including the innovative evaluation metrics and the extensive dataset, position this work as a significant advancement in the field of audio generation.
The paper introduces VidAudio-Bench, a novel benchmarking framework for Video-to-Audio (V2A) and Video-Text-to-Audio (VT2A) generation that addresses the limitations of existing evaluation methods by providing a multi-task benchmark with task-specific metrics. The methodology is robust, featuring a comprehensive dataset of 1,634 video-text pairs across four audio categories, and it employs both objective and subjective evaluation metrics, including human alignment studies to validate the proposed metrics. The introduction of a zero-information-leak design for VT2A evaluation is particularly innovative, allowing for a clearer assessment of visual understanding without relying on textual shortcuts.
The experimental evaluation is thorough, benchmarking 11 state-of-the-art models across various tasks and dimensions. The results reveal significant insights into the performance of current V2A models, particularly their struggles with speech and singing tasks. The paper effectively uses a variety of metrics to assess audio quality, video-audio consistency, and text-audio consistency, providing a comprehensive view of model performance. The correlation analysis with human evaluations further strengthens the credibility of the findings.
The paper provides detailed descriptions of the dataset construction, evaluation metrics, and experimental setup, which are essential for reproducibility. However, the absence of publicly available code or datasets limits the ability for other researchers to replicate the results directly. The methodology is well-documented, but the lack of a project URL or demo limits broader accessibility.
One limitation is the reliance on subjective human evaluations, which, while valuable, can introduce variability and bias. Additionally, the dataset may not cover all possible scenarios in V2A generation, potentially limiting the generalizability of the findings. The paper also notes a fundamental tension between instruction following and visually grounded generation, indicating that there are inherent challenges in achieving optimal performance across all tasks.
The development of VidAudio-Bench has significant implications for the field of multimodal audio generation, providing a structured framework that can guide future research and model development. By highlighting the challenges faced by current V2A models, the paper encourages further exploration into improving audio generation systems, which can enhance applications in entertainment, accessibility, and human-computer interaction. The main contribution of this paper is the establishment of VidAudio-Bench, a comprehensive benchmark for evaluating V2A and VT2A systems, which systematically addresses the limitations of existing evaluation methodologies and provides valuable insights into the performance of current models. The technical contributions, including the innovative evaluation metrics and the extensive dataset, position this work as a significant advancement in the field of audio generation.
MeloTune is an iPhone-deployed music agent that instantiates the Mesh Memory Protocol (MMP) and Symbolic-Vector Attention Fusion (SVAF) as a production system for affect-aware music curation with peer-to-peer mood coupling. Each device runs two closed-form continuous-time (CfC) networks: a private listener-level CfC that predicts a short-horizon affective trajectory on Russell's circumplex and drives proactive curation, and a shared mesh-runtime CfC at MMP Layer 6 that integrates Cognitive Memory Blocks (CMBs) from co-listening peers. CfC hidden states never cross the wire; only structured CMBs do. A Personal Arousal Function (PAF) replaces the standard linear mapping from audio intensity to psychological arousal with a per-listener learned adjustment, trained from behavioral signals (skip, completion, favorite, volume) and from drift between user-declared mood and machine inference. The same track receives different arousal predictions for different listeners. The model (94,552 parameters) achieves trajectory MAE 0.414, pattern accuracy 96.6%, and intent accuracy 69.4% on held-out validation. PAF evidence from a live deployment session (46 observations across 11 genres) demonstrates that the learning loop operates end-to-end, with pop reaching full confidence after 22 observations. All inference runs on-device via CoreML. To our knowledge, this is the first production deployment of MMP/SVAF on consumer mobile hardware. The accompanying SDK (sym-swift v0.3.78, SYMCore v0.3.7) enforces strict protocol conformance. Music is the case study; the substrate is the contribution.
Primary: SYM.BOT
All Institutions: SYM.BOT
The main contribution of MeloTune is its innovative architecture that combines continuous-time modeling with peer-to-peer mood coupling for personalized music curation. This approach addresses key limitations in traditional music recommendation systems, offering a promising direction for future research and applications in affect-aware technologies.
The methodology presented in MeloTune is innovative, leveraging a dual-layer architecture that combines a private listener-level Closed-form Continuous-time (CfC) network with a shared mesh-runtime CfC for peer-to-peer mood coupling. The Personal Arousal Function (PAF) is a significant advancement, allowing for personalized arousal predictions based on behavioral signals, which is a notable departure from traditional methods that rely on audio intensity alone. The use of Cognitive Memory Blocks (CMBs) for structured communication between agents is a unique aspect that enhances the system's ability to maintain privacy while still enabling collaborative mood curation. The continuous-time modeling approach is well-justified and effectively addresses the limitations of existing sequential recommendation systems.
The paper provides quantitative results from a live deployment, including metrics such as trajectory Mean Absolute Error (MAE), pattern accuracy, and intent accuracy. While the results are promising, the absence of a comprehensive controlled evaluation and comparisons against established benchmarks limits the robustness of the findings. The reported metrics indicate that the system performs well in predicting affective trajectories, but further validation against more diverse datasets and user scenarios would strengthen the claims.
The implementation details are described in sufficient depth, particularly regarding the architecture and training procedures. However, the lack of publicly available code or a demo limits reproducibility. The paper mentions an SDK, but without access to the actual implementation, independent verification of the results is challenging.
The primary limitations include the reliance on user-declared moods, which may not always be available or accurate, potentially affecting the PAF's learning process. Additionally, the system's performance in diverse real-world scenarios and with different user demographics is not fully explored. The absence of a controlled evaluation against standard recommendation systems raises questions about the generalizability of the results.
MeloTune has the potential to significantly impact the music recommendation landscape by providing a more personalized and context-aware listening experience. The approach could be extended to other domains where user affect and social context play a crucial role, such as in mental health applications or collaborative environments. The focus on privacy-preserving techniques is particularly relevant in today's data-sensitive climate. The main contribution of MeloTune is its innovative architecture that combines continuous-time modeling with peer-to-peer mood coupling for personalized music curation. This approach addresses key limitations in traditional music recommendation systems, offering a promising direction for future research and applications in affect-aware technologies.
Audio-native large language models (audio-LLMs) commonly use Whisper as their audio encoder. However, Whisper was trained exclusively on speech data, producing weak representations for music and environmental sound. This forces downstream audio-LLMs to compensate through extensive training on large-scale non-speech data. We present Whisper-AuT, a domain-adapted audio encoder obtained by fine-tuning Whisper-large-v3 on a curated mixture of speech (80%), environmental sound (10%), and music (10%) totaling approximately 20M samples. The full encoder-decoder is trained end-to-end with a seq2seq captioning objective; the decoder is then discarded and only the encoder is retained. Linear probe evaluations show that Whisper-AuT achieves +23.0% on ESC-50 (environmental sound), +5.0% on GTZAN (music genre), and +0.7% on Speech Commands (keyword spotting) compared to the original Whisperlarge-v3 encoder. Whisper-AuT is designed as a drop-in replacement for Whisper in audio-LLM architectures, with the goal of reducing downstream training cost by providing stronger initial audio representations for non-speech domains.
Primary: Salesforce AI Research
All Institutions: Salesforce AI Research
The main contribution of this paper is the introduction of Whisper-AuT, a domain-adapted audio encoder that improves the representation of non-speech audio, thereby reducing the training costs and enhancing the performance of downstream audio-LLMs. This work represents a meaningful advancement in the field of audio processing and machine learning, particularly in the context of integrating audio understanding with large language models.
The methodology is clearly articulated, following a systematic approach to fine-tune the Whisper-large-v3 model on a curated dataset that includes a balanced mix of speech, environmental sounds, and music. The use of a seq2seq training paradigm is consistent with existing practices, but the adaptation to a mixed-domain dataset is a notable improvement. The decision to retain only the encoder after training is a practical choice that simplifies integration into existing audio-LLM frameworks. However, the paper could benefit from more detailed descriptions of the training process and hyperparameter choices.
The experimental evaluation is robust, utilizing linear probing on well-established benchmarks (ESC-50, GTZAN, Speech Commands) to assess the encoder's performance across different audio domains. The reported improvements (+23.0% on ESC-50, +5.0% on GTZAN, +0.7% on Speech Commands) are significant and demonstrate the effectiveness of the proposed approach. However, the evaluation could be strengthened by including additional metrics or qualitative assessments to provide a more comprehensive view of the encoder's capabilities.
The paper provides a reasonable level of detail regarding the training configuration and data preparation, which aids in reproducibility. However, the lack of specific hyperparameter settings and the absence of a publicly available code repository hinder full reproducibility. Including these details would greatly enhance the paper's impact.
One limitation is the reliance on a relatively small dataset (20M samples) for fine-tuning, which may not fully capture the diversity of non-speech audio. Additionally, while the improvements on environmental sound and music are notable, the marginal gain on speech suggests that the encoder may not significantly enhance performance in that domain. Future work should explore the effects of varying the dataset composition and size.
The development of Whisper-AuT has the potential to significantly reduce the computational burden associated with training audio-LLMs, making them more accessible for various applications in audio understanding and generation. By providing stronger initial representations for non-speech audio, this work could enhance the performance of audio-LLMs in real-world applications, such as content creation, sound classification, and interactive audio systems. The main contribution of this paper is the introduction of Whisper-AuT, a domain-adapted audio encoder that improves the representation of non-speech audio, thereby reducing the training costs and enhancing the performance of downstream audio-LLMs. This work represents a meaningful advancement in the field of audio processing and machine learning, particularly in the context of integrating audio understanding with large language models.
The psychological profile that structurally documents the case of a depression patient is essential for psychotherapy. Large language models can be applied to summarize the profiles from counseling speech, however, it may suffer from long-context forgetting and produce unverifiable hallucinations, due to overlong length of speech, multi-party interactions and unstructured chatting. Hereby, we propose a StreamProfile, a streaming framework that processes counseling speech incrementally, extracts evidences grounded from ASR transcriptions by storing it in a Hierarchical Evidence Memory, and then performs a Chain-of-Thought pipeline according to PM+ psychological intervention for clinical reasoning. The final profile is synthesized strictly from those evidences, making every claim traceable. Experiments on real-world teenager counseling speech have shown that the proposed StreamProfile system can accurately generate the profiles and prevent hallucination.
Primary: South China University of Technology
All Institutions: South China University of Technology, Chinese Academy of Sciences, Key Laboratory of Biomedical Imaging Science and System, Shenzhen Institutes of Advanced Technology, Shenzhen Mental Health Center
The main contribution of this paper is the introduction of StreamProfile, a novel framework that integrates streaming processing, CoT reasoning, and evidence memory to generate accurate and verifiable psychological profiles from counseling sessions. This work represents a significant advancement in the application of LLMs in mental health, addressing critical challenges and demonstrating substantial improvements over existing methods.
The methodology presented in this paper is innovative, combining a streaming framework with a Chain-of-Thought (CoT) reasoning process and a Hierarchical Evidence Memory (HEM) to generate psychological profiles from counseling sessions. The approach addresses critical issues such as long-context forgetting and hallucinations in LLMs by ensuring that every claim made in the generated profiles is traceable to specific utterances from the counseling session. The use of a structured protocol (PM+) to guide the reasoning process is particularly noteworthy, as it aligns the model's outputs with clinical standards.
The experiments conducted on the Psy-Bench dataset demonstrate a rigorous evaluation of the proposed system against various LLM baselines. The results indicate significant improvements in both profile generation performance and hallucination reduction. The use of multiple evaluation metrics, including ROUGE-L, BERTScore, and subjective assessments of hallucination and consistency, provides a comprehensive understanding of the system's capabilities. The ablation studies further validate the effectiveness of the CoT and HEM components.
The paper provides detailed descriptions of the experimental setup, including the LLMs used, evaluation metrics, and dataset characteristics. However, the lack of a publicly available codebase or demo limits the reproducibility of the results. The authors mention using specific models and configurations, but without access to the implementation, it may be challenging for other researchers to replicate the findings.
One limitation is the reliance on a specific dataset (Psy-Bench) that may not generalize to other contexts or languages, as the experiments are conducted on a Chinese dataset. Additionally, while the framework addresses hallucinations effectively, the potential for misinterpretation of nuanced clinical language remains a concern. The paper also does not discuss the computational resources required for real-time processing, which could impact practical deployment.
The proposed framework has significant implications for mental health care, particularly in enhancing the efficiency and accuracy of psychological assessments. By automating the generation of structured profiles from counseling sessions, it could aid clinicians in delivering timely and informed interventions. However, ethical considerations regarding patient data privacy and the potential for over-reliance on AI in sensitive clinical contexts must be addressed. The main contribution of this paper is the introduction of StreamProfile, a novel framework that integrates streaming processing, CoT reasoning, and evidence memory to generate accurate and verifiable psychological profiles from counseling sessions. This work represents a significant advancement in the application of LLMs in mental health, addressing critical challenges and demonstrating substantial improvements over existing methods.
Automatic depression detection using speech signals with acoustic and textual modalities is a promising approach for early diagnosis. Depression-related patterns exhibit sparsity in speech: diagnostically relevant features occur in specific segments rather than being uniformly distributed. However, most existing methods treat all frames equally, assuming depression-related information is uniformly distributed and thus overlooking this sparsity. To address this issue, we proposes a depression detection network based on Adaptive Cross-Modal Gating (ACMG) that adaptively reassigns frame-level weights across both modalities, enabling selective attention to depression-related segments. Experimental results show that the depression detection system with ACMG outperforms baselines without it. Visualization analyses further confirm that ACMG automatically attends to clinically meaningful patterns, including low-energy acoustic segments and textual segments containing negative sentiments.
Primary: Shenzhen Institutes of Advanced Technology
All Institutions: Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, University of Chinese Academy of Sciences, Key Laboratory of Biomedical Imaging Science and System
The main contribution of this paper is the introduction of the Adaptive Cross-Modal Gating (ACMG) mechanism for depression detection, which effectively enhances the identification of clinically relevant features in speech and text. The comprehensive analysis of the technical contribution, methodology, and significance to the field demonstrates the potential of this approach to improve automatic depression detection systems.
The proposed Adaptive Cross-Modal Gating (ACMG) mechanism is innovative in its approach to addressing the sparsity of depression-related patterns in speech and text. The dual-branch architecture effectively combines acoustic and textual modalities, leveraging pre-trained models and an adaptive gating mechanism to enhance the detection of clinically relevant features. The methodology is well-structured, with clear explanations of the ACMG mechanism, global context extraction, and feature refinement processes, demonstrating a comprehensive understanding of the problem domain.
The experiments are conducted on two relevant datasets, PDCD2025 and DAIC-WOZ, which are suitable for evaluating the effectiveness of the proposed method. The results indicate a significant improvement over baseline models, with the ACMG mechanism consistently outperforming non-ACMG systems. The use of quantitative metrics, such as accuracy and F1 score, alongside qualitative analyses, strengthens the evaluation. However, the paper lacks detailed ablation studies that could further clarify the contributions of individual components.
The paper provides a clear description of the methods and datasets used, but it lacks specific implementation details, such as hyperparameters and training procedures, which are crucial for reproducibility. The absence of a publicly available code repository or demo limits the ability of other researchers to replicate the results.
One limitation is the reliance on pre-trained models, which may not fully capture the nuances of depression-related speech and text. Additionally, the paper does not address potential biases in the datasets used, which could affect the generalizability of the findings. The lack of a comprehensive comparison with state-of-the-art methods in the field also limits the contextual understanding of the proposed approach's performance.
The implications of this research are significant, as automatic depression detection can lead to earlier diagnosis and intervention, improving mental health outcomes. The methodology could be adapted for other mental health conditions, expanding its applicability. However, ethical considerations regarding data privacy and the potential for misdiagnosis must be addressed in future work. The main contribution of this paper is the introduction of the Adaptive Cross-Modal Gating (ACMG) mechanism for depression detection, which effectively enhances the identification of clinically relevant features in speech and text. The comprehensive analysis of the technical contribution, methodology, and significance to the field demonstrates the potential of this approach to improve automatic depression detection systems.
Self-supervised music foundation models underperform on key detection, which requires pitch-sensitive representations. In this work, we present the first systematic study showing that the design of self-supervised pretraining directly impacts pitch sensitivity, and demonstrate that masked contrastive embeddings uniquely enable state-of-the-art (SOTA) performance in key detection in the supervised setting. First, we discover that linear evaluation after masking-based contrastive pretraining on Mel spectrograms leads to competitive performance on music key detection out of the box. This leads us to train shallow but wide multi-layer perceptrons (MLPs) on features extracted from our base model, leading to SOTA performance without the need for sophisticated data augmentation policies. We further analyze robustness and show empirically that the learned representations naturally encode common augmentations. Our study establishes self-supervised pretraining as an effective approach for pitch-sensitive MIR tasks and provides insights for designing and probing music foundation models.
Primary: Texas A&M University
All Institutions: Texas A&M University
The main contribution of this paper is the introduction of KeyMyna, a novel approach to music key detection that leverages masked contrastive pretraining to achieve state-of-the-art performance without complex data augmentation. This work significantly advances the field of music information retrieval by demonstrating the potential of self-supervised learning techniques in capturing pitch-sensitive representations, thereby addressing a critical challenge in the domain.
The paper introduces KeyMyna, a systematic study of self-supervised pretraining for music key detection using masked contrastive learning. The methodology is well-structured, leveraging a pre-trained model (Myna-Vertical) and shallow multi-layer perceptrons (MLPs) for key detection. The authors effectively demonstrate the advantages of their approach over traditional methods and other deep learning models, particularly in terms of pitch sensitivity and robustness to augmentations. The use of a simple contrastive learning framework with token masking is innovative and addresses the challenges of limited labeled datasets in music key detection.
The experiments are thorough, utilizing two widely recognized datasets (GiantSteps and McGill Billboard) for evaluation. The results show that KeyMyna outperforms existing methods despite using less data and simpler architectures. The paper provides a comprehensive comparison with prior work, demonstrating the effectiveness of their approach through various metrics. However, the paper could benefit from more extensive ablation studies to further validate the impact of individual components of their methodology.
The authors provide a GitHub repository with code and models, which is a positive aspect for reproducibility. However, detailed hyperparameter settings and training configurations are presented, but the absence of a complete training script or environment setup instructions may hinder full reproducibility for some researchers.
The paper acknowledges limitations, such as the inability of KeyMyna to track key modulations within songs, which could affect performance in certain musical genres. Additionally, the focus on major and minor keys limits the model's applicability to more complex musical structures. Future work is suggested to address these limitations, including the exploration of moving averages for key modulation detection.
The findings of this research have significant implications for music information retrieval (MIR) applications, including playlist generation and music similarity search. By improving key detection through self-supervised learning, the work contributes to the development of more robust and efficient MIR systems. The insights gained from this study could also inform future research in music analysis and representation learning. The main contribution of this paper is the introduction of KeyMyna, a novel approach to music key detection that leverages masked contrastive pretraining to achieve state-of-the-art performance without complex data augmentation. This work significantly advances the field of music information retrieval by demonstrating the potential of self-supervised learning techniques in capturing pitch-sensitive representations, thereby addressing a critical challenge in the domain.
Cross-modal retrieval between audio recordings and symbolic music representations (MIDI) remains challenging because continuous waveforms and discrete event sequences encode different aspects of the same performance. We study descriptor injection, the augmentation of modality-specific encoders with hand-crafted domain features, as a bridge across this gap. In a three-phase campaign covering 13 descriptor-mechanism combinations, 6 architectural families, and 3 training schedules, the best configuration reaches a mean S of 84.0 percent across five independent seeds, improving the descriptor-free baseline by 8.8 percentage points. Causal ablation shows that the audio descriptor A4, based on octave-band energy dynamics, drives the gain in the top dual models, while the MIDI descriptor D4 has only a weak inference-time effect despite improving training dynamics. We also introduce reverse cross-attention, where descriptor tokens query encoder features, reducing attention operations relative to the standard formulation while remaining competitive. CKA analysis shows that descriptors substantially increase audio-MIDI transformer layer alignment, indicating representational convergence rather than simple feature concatenation. Perturbation analysis identifies high-frequency octave bands as the dominant discriminative signal. All experiments use MAESTRO v3.0.0 with an evaluation protocol controlling for composer and piece similarity.
Primary: Asociación Civil AlterMundi
All Institutions: Asociación Civil AlterMundi
The main contribution of this paper is the introduction of descriptor injection as a novel approach to improve audio-MIDI alignment, demonstrating that simple, hand-crafted features can significantly enhance cross-modal learning performance. The comprehensive methodology and rigorous experimental validation position this work as a meaningful advancement in the field of machine learning for music.
The paper presents a systematic exploration of descriptor injection for cross-modal audio-MIDI learning, employing a robust methodology that includes a three-phase experimental design with various descriptor-mechanism combinations and architectural families. The introduction of reverse cross-attention as a novel mechanism to reduce attention operations while maintaining competitive performance is a significant methodological contribution. The use of causal ablation and CKA analysis to validate the effectiveness of the descriptors adds rigor to the methodology.
The experiments are comprehensive, utilizing the MAESTRO v3.0.0 dataset and employing a structured evaluation protocol. The results demonstrate clear improvements over the baseline, with statistical significance established through multi-seed validation. The paper effectively communicates the experimental results, including detailed ablation studies and sensitivity analyses, which substantiate the claims made regarding the effectiveness of the proposed methods.
The paper provides sufficient details on the architecture, training protocols, and evaluation metrics, which supports reproducibility. However, the lack of a public repository or demo URL limits the ease of access for other researchers to replicate the findings.
The study is constrained by its use of a single dataset (MAESTRO v3.0.0), which may limit the generalizability of the findings to other musical genres or instruments. Additionally, the paper acknowledges the absence of data augmentation techniques, which could enhance robustness. The D4 descriptor's weak inference-time effect raises questions about its practical utility despite its training benefits.
The findings have potential implications for music information retrieval, automatic transcription, and musicological analysis, as they suggest that structured domain knowledge can significantly enhance cross-modal learning. The approach could be extended to other domains where modality gaps exist, making it relevant beyond music. The main contribution of this paper is the introduction of descriptor injection as a novel approach to improve audio-MIDI alignment, demonstrating that simple, hand-crafted features can significantly enhance cross-modal learning performance. The comprehensive methodology and rigorous experimental validation position this work as a meaningful advancement in the field of machine learning for music.