Achieving natural full-duplex interaction in spoken dialogue systems (SDS) remains a challenge due to the difficulty of accurately detecting user interruptions. Current solutions are polarized between "trigger-happy" VAD-based methods that misinterpret backchannels and robust end-to-end models that exhibit unacceptable response delays. Moreover, the absence of real-world benchmarks and holistic metrics hinders progress in the field. This paper presents a comprehensive frame-work to overcome these limitations. We first introduce SID-Bench, the first benchmark for semantic-aware interruption detection built entirely from real-world human dialogues. To provide a rigorous assessment of the responsiveness-robustness trade-off, we propose the Average Penalty Time (APT) metric, which assigns a temporal cost to both false alarms and late responses. Building on this framework, we design an LLM-based detection model optimized through a novel training paradigm to capture subtle semantic cues of intent. Experimental results show that our model significantly outperforms mainstream baselines, achieving a nearly threefold reduction in APT. By successfully resolving the long-standing tension between speed and stability, our work establishes a new state-of-the-art for intelligent interruption handling in SDS. To facilitate future research, SID-Bench and the associated code are available at: https://github.com/xkx-hub/SID-bench.
Primary: Qwen Team, Alibaba
All Institutions: Qwen Team, Alibaba, Independent Researcher
The main contribution of this paper is the introduction of SID-Bench and the APT metric, which together provide a comprehensive framework for evaluating and improving interruption detection in spoken dialogue systems. This work significantly enhances the understanding of user interruptions in conversational AI, offering a robust methodology and strong experimental validation that addresses key challenges in the field.
The paper introduces a novel framework for interruption detection in spoken dialogue systems, emphasizing the creation of SID-Bench, a benchmark based on real-world data, and the Average Penalty Time (APT) metric for evaluation. The methodology is robust, incorporating a two-stage training paradigm that effectively leverages large language models (LLMs) for semantic understanding, and a hybrid annotation approach that combines LLMs with forced alignment for precise interruption labeling. This innovative approach addresses the limitations of existing VAD-based systems and enhances the model's ability to discern genuine interruptions from backchannels.
The experimental results are comprehensive, demonstrating a significant performance improvement in the proposed model over existing baselines across various metrics, including APT, FIR, and IRL. The use of SID-Bench allows for a rigorous evaluation of the model's capabilities in real-world scenarios, and the results clearly illustrate the trade-off between responsiveness and robustness, validating the effectiveness of the proposed methods.
The paper provides sufficient details regarding the model architecture, training procedures, and evaluation metrics, which would allow for reproducibility. The availability of SID-Bench and the associated code on GitHub further enhances the potential for other researchers to replicate the study and build upon the findings.
One limitation is the reliance on a specific set of conversational data, which may not encompass all possible interaction scenarios. Additionally, while the model achieves significant improvements in APT, further exploration into its performance across diverse languages and dialects could be beneficial. The paper also does not address the computational resources required for training the LLM-based model, which may limit accessibility for some researchers.
The proposed framework and benchmark have the potential to significantly advance the field of spoken dialogue systems by providing a more nuanced understanding of interruption handling. This could lead to more natural and efficient human-computer interactions, with applications in customer service, virtual assistants, and other conversational AI systems. The introduction of SID-Bench sets a precedent for future research in this area, encouraging the development of more sophisticated models that can better understand human intent. The main contribution of this paper is the introduction of SID-Bench and the APT metric, which together provide a comprehensive framework for evaluating and improving interruption detection in spoken dialogue systems. This work significantly enhances the understanding of user interruptions in conversational AI, offering a robust methodology and strong experimental validation that addresses key challenges in the field.
Most existing text-to-speech (TTS) systems either synthesize speech sentence by sentence and stitch the results together, or drive synthesis from plain-text dialogues alone. Both approaches leave models with little understanding of global context or paralinguistic cues, making it hard to capture real-world phenomena such as multi-speaker interactions (interruptions, overlapping speech), evolving emotional arcs, and varied acoustic environments. We introduce the Borderless Long Speech Synthesis framework for agent-centric, borderless long audio synthesis. Rather than targeting a single narrow task, the system is designed as a unified capability set spanning VoiceDesigner, multi-speaker synthesis, Instruct TTS, and long-form text synthesis. On the data side, we propose a "Labeling over filtering/cleaning" strategy and design a top-down, multi-level annotation schema we call Global-Sentence-Token. On the model side, we adopt a backbone with a continuous tokenizer and add Chain-of-Thought (CoT) reasoning together with Dimension Dropout, both of which markedly improve instruction following under complex conditions. We further show that the system is Native Agentic by design: the hierarchical annotation doubles as a Structured Semantic Interface between the LLM Agent and the synthesis engine, creating a layered control protocol stack that spans from scene semantics down to phonetic detail. Text thereby becomes an information-complete, wide-band control channel, enabling a front-end LLM to convert inputs of any modality into structured generation commands, extending the paradigm from Text2Speech to borderless long speech synthesis.
Primary: Nanjing University
All Institutions: Nanjing University, WeNet Open Source Community
The main contribution of this paper is the introduction of the Borderless Long Speech Synthesis framework, which innovatively integrates multi-dimensional annotations and contextual understanding into TTS systems, significantly advancing the state-of-the-art in audio synthesis. The technical contributions and proposed methodologies offer substantial improvements over existing systems, although further experimental validation and reproducibility efforts are necessary to solidify its impact in the field.
The proposed methodology introduces a novel framework for long-form speech synthesis that emphasizes the importance of global context and paralinguistic cues. The "Labeling over filtering/cleaning" strategy is innovative, as it challenges conventional practices in data preparation by advocating for the inclusion of complex, noisy data that reflects real-world speech dynamics. The Global-Sentence-Token hierarchical annotation schema is a significant advancement, enabling a structured approach to capturing the nuances of speech synthesis. The integration of Chain-of-Thought reasoning and Dimension Dropout enhances the model's ability to follow complex instructions, which is a notable methodological improvement over existing TTS systems.
The paper lacks quantitative evaluations of the proposed system's performance, particularly in terms of emotional arc coherence and multi-speaker interaction naturalness. While it discusses the challenges of evaluating borderless long audio synthesis, it does not provide concrete experimental results or comparisons with existing methods. The absence of benchmark results limits the ability to assess the system's effectiveness rigorously. Future work is needed to establish robust evaluation metrics that can capture the richness of the proposed framework.
The paper does not provide sufficient implementation details or access to code and datasets, which raises concerns about reproducibility. The lack of a demo or project URL further complicates the ability for other researchers to replicate the findings or build upon this work. Clearer documentation and shared resources would enhance reproducibility.
The system is currently optimized for content creation rather than real-time interactions, which limits its applicability in dynamic environments. Additionally, the training data is primarily speech-centric, and the system's emergent capabilities for sound effects and music are not fully developed. These limitations suggest that while the framework is promising, it requires further refinement and expansion to address broader applications.
The potential applications of this research extend beyond traditional TTS systems, offering possibilities for enhanced audio experiences in content creation, gaming, and virtual environments. The ability to synthesize speech with rich emotional and contextual cues could significantly improve user engagement and interaction quality in various multimedia applications. However, the challenges in real-time synthesis and the need for more diverse training data must be addressed to realize its full impact. The main contribution of this paper is the introduction of the Borderless Long Speech Synthesis framework, which innovatively integrates multi-dimensional annotations and contextual understanding into TTS systems, significantly advancing the state-of-the-art in audio synthesis. The technical contributions and proposed methodologies offer substantial improvements over existing systems, although further experimental validation and reproducibility efforts are necessary to solidify its impact in the field.
Integrating Federated Learning (FL) with self-supervised learning (SSL) enables privacy-preserving fine-tuning for speech tasks. However, federated environments exhibit significant heterogeneity: clients differ in computational capacity, causing straggler effects under unified fine-tuning, while diverse downstream tasks require different representation depths, making full-model updates inefficient. To address these challenges, we propose an adaptive federated fine-tuning framework with early exits. Lightweight prediction heads are inserted at intermediate layers of the SSL backbone, allowing clients to terminate computation based on local constraints and task requirements. We further introduce a layer-wise, depth-aware partial aggregation strategy to better utilize representations from different network depths. Experiments show that the framework reduces edge overhead, supports heterogeneous hardware, and maintains competitive performance in resource-constrained federated environments.
Primary: University of Cambridge
All Institutions: University of Cambridge, Electronic Information School, Flower Labs, University of Auckland, University of Melbourne, Wuhan University
This paper presents a novel adaptive federated fine-tuning framework that effectively addresses the challenges of heterogeneous environments in self-supervised speech representation learning. The technical contributions, particularly in the areas of early exits and layer-wise aggregation, represent a meaningful advancement in the field of federated learning for audio applications.
The proposed adaptive federated fine-tuning framework introduces innovative mechanisms such as early exits and layer-wise partial aggregation, which effectively address the challenges posed by heterogeneity in federated learning environments. The methodology is well-structured, leveraging an elastic multi-branch architecture that allows clients to dynamically select their training depth based on local resources and task complexity. This approach not only enhances computational efficiency but also maintains performance across diverse speech tasks. The integration of lightweight prediction heads and depth-aware aggregation strategies is a significant advancement in federated learning for speech applications.
The experiments are comprehensive, covering five diverse downstream tasks that span various aspects of speech understanding. The results demonstrate the effectiveness of the proposed framework in reducing computational overhead while achieving competitive performance compared to centralized training. The evaluation metrics used, including word error rates and classification error rates, are appropriate for the tasks at hand. However, the paper could benefit from additional comparisons with existing state-of-the-art methods to further contextualize the results.
The paper provides a detailed description of the experimental setup, including datasets, model architectures, and training configurations, which aids reproducibility. However, the lack of a publicly available code repository limits the ease with which others can replicate the experiments. Including a link to the implementation would significantly enhance reproducibility.
One limitation is the reliance on a specific backbone model (Wav2Vec 2.0), which may not generalize to all speech tasks or architectures. Additionally, while the framework addresses resource constraints, it does not fully explore the implications of data heterogeneity beyond the basic partitioning strategy employed. The paper could also discuss potential trade-offs between performance and computational efficiency in more detail.
The proposed framework has significant implications for deploying speech recognition systems in privacy-sensitive environments, such as mobile devices and personal assistants. By enabling efficient fine-tuning without compromising user data privacy, this work contributes to the growing field of privacy-preserving machine learning. The methodology could be adapted to other domains where federated learning is applicable, potentially influencing future research in decentralized learning systems. This paper presents a novel adaptive federated fine-tuning framework that effectively addresses the challenges of heterogeneous environments in self-supervised speech representation learning. The technical contributions, particularly in the areas of early exits and layer-wise aggregation, represent a meaningful advancement in the field of federated learning for audio applications.
Achieving natural full-duplex interaction in spoken dialogue systems (SDS) remains a challenge due to the difficulty of accurately detecting user interruptions. Current solutions are polarized between "trigger-happy" VAD-based methods that misinterpret backchannels and robust end-to-end models that exhibit unacceptable response delays. Moreover, the absence of real-world benchmarks and holistic metrics hinders progress in the field. This paper presents a comprehensive frame-work to overcome these limitations. We first introduce SID-Bench, the first benchmark for semantic-aware interruption detection built entirely from real-world human dialogues. To provide a rigorous assessment of the responsiveness-robustness trade-off, we propose the Average Penalty Time (APT) metric, which assigns a temporal cost to both false alarms and late responses. Building on this framework, we design an LLM-based detection model optimized through a novel training paradigm to capture subtle semantic cues of intent. Experimental results show that our model significantly outperforms mainstream baselines, achieving a nearly threefold reduction in APT. By successfully resolving the long-standing tension between speed and stability, our work establishes a new state-of-the-art for intelligent interruption handling in SDS. To facilitate future research, SID-Bench and the associated code are available at: https://github.com/xkx-hub/SID-bench.
Primary: Qwen Team, Alibaba
All Institutions: Qwen Team, Alibaba, Independent Researcher
The main contribution of this paper is the introduction of SID-Bench and the APT metric, which together provide a comprehensive framework for evaluating and improving interruption detection in spoken dialogue systems. This work significantly enhances the understanding of user interruptions in conversational AI, offering a robust methodology and strong experimental validation that addresses key challenges in the field.
The paper introduces a novel framework for interruption detection in spoken dialogue systems, emphasizing the creation of SID-Bench, a benchmark based on real-world data, and the Average Penalty Time (APT) metric for evaluation. The methodology is robust, incorporating a two-stage training paradigm that effectively leverages large language models (LLMs) for semantic understanding, and a hybrid annotation approach that combines LLMs with forced alignment for precise interruption labeling. This innovative approach addresses the limitations of existing VAD-based systems and enhances the model's ability to discern genuine interruptions from backchannels.
The experimental results are comprehensive, demonstrating a significant performance improvement in the proposed model over existing baselines across various metrics, including APT, FIR, and IRL. The use of SID-Bench allows for a rigorous evaluation of the model's capabilities in real-world scenarios, and the results clearly illustrate the trade-off between responsiveness and robustness, validating the effectiveness of the proposed methods.
The paper provides sufficient details regarding the model architecture, training procedures, and evaluation metrics, which would allow for reproducibility. The availability of SID-Bench and the associated code on GitHub further enhances the potential for other researchers to replicate the study and build upon the findings.
One limitation is the reliance on a specific set of conversational data, which may not encompass all possible interaction scenarios. Additionally, while the model achieves significant improvements in APT, further exploration into its performance across diverse languages and dialects could be beneficial. The paper also does not address the computational resources required for training the LLM-based model, which may limit accessibility for some researchers.
The proposed framework and benchmark have the potential to significantly advance the field of spoken dialogue systems by providing a more nuanced understanding of interruption handling. This could lead to more natural and efficient human-computer interactions, with applications in customer service, virtual assistants, and other conversational AI systems. The introduction of SID-Bench sets a precedent for future research in this area, encouraging the development of more sophisticated models that can better understand human intent. The main contribution of this paper is the introduction of SID-Bench and the APT metric, which together provide a comprehensive framework for evaluating and improving interruption detection in spoken dialogue systems. This work significantly enhances the understanding of user interruptions in conversational AI, offering a robust methodology and strong experimental validation that addresses key challenges in the field.
General audio understanding is a fundamental goal for large audio-language models, with audio captioning serving as a cornerstone task for their development. However, progress in this domain is hindered by existing datasets, which lack the scale and descriptive granularity required to train truly versatile models. To address this gap, we introduce ACAVCaps, a new large-scale, fine-grained, and multi-faceted audio captioning dataset. Derived from the ACAV100M collection, ACAVCaps is constructed using a multi-expert pipeline that analyzes audio from diverse perspectives-including speech, music, and acoustic properties-which are then synthesized into rich, detailed descriptions by a large language model. Experimental results demonstrate that models pre-trained on ACAVCaps exhibit substantially stronger generalization capabilities on various downstream tasks compared to those trained on other leading captioning datasets. The dataset is available at https://github.com/xiaomi-research/acavcaps.
Primary: Xiaomi Research
All Institutions: Xiaomi Research
The main contribution of this paper is the introduction of ACAVCaps, a novel large-scale audio captioning dataset that significantly enhances the granularity and diversity of audio understanding, thereby advancing the development of robust audio-language models. The methodology and experimental validation presented in this work position it as a valuable resource for future research in the field of audio processing and multimodal learning.
The methodology for constructing the ACAVCaps dataset is innovative, utilizing a multi-expert pipeline that integrates various analytical perspectives (speech, music, acoustic properties) and synthesizes detailed descriptions using a large language model (LLM). This approach addresses the limitations of existing datasets by ensuring both scale and descriptive granularity, which are crucial for training versatile audio models. The use of Chain-of-Thought (CoT) prompting for LLMs to generate diverse and semantically rich captions is particularly noteworthy, as it enhances the quality of the generated annotations.
The experimental evaluation is robust, demonstrating clear superiority of models trained on ACAVCaps across various downstream tasks compared to other datasets. The use of comprehensive benchmarks like MECAT-Caption and the detailed analysis of generalization performance across multiple audio domains (speech, sound events, music) provide strong evidence of the dataset's effectiveness. The results are quantitatively supported by metrics that emphasize both descriptive specificity and semantic similarity, reinforcing the dataset's intended impact.
The paper provides sufficient implementation details regarding the training process, model architecture, and evaluation metrics. However, the reproducibility may be limited by the lack of access to the specific expert models used in the multi-expert pipeline, which are crucial for generating the dataset. The dataset itself is available, which aids in reproducibility, but the exact configurations and parameters for the LLM and expert models could be better documented.
One limitation is the potential bias introduced by the expert models used for audio analysis, which may not capture all nuances of audio content. Additionally, while the dataset is large and diverse, it may still miss certain rare or unique audio events that could be important for comprehensive audio understanding. The reliance on automated processes for generating captions might also lead to inconsistencies in quality across different audio samples.
The introduction of ACAVCaps has significant implications for the field of audio understanding and multimodal AI. By providing a rich, large-scale dataset, it enables the development of more capable audio-language models that can generalize better across various tasks. This can lead to advancements in applications such as automatic audio transcription, sound event detection, and even creative audio generation, ultimately enhancing the capabilities of AI systems in understanding and interacting with the auditory world. The main contribution of this paper is the introduction of ACAVCaps, a novel large-scale audio captioning dataset that significantly enhances the granularity and diversity of audio understanding, thereby advancing the development of robust audio-language models. The methodology and experimental validation presented in this work position it as a valuable resource for future research in the field of audio processing and multimodal learning.
Reliable evaluation of modern zero-shot text-to-speech (TTS) models remains challenging. Subjective tests are costly and hard to reproduce, while objective metrics often saturate, failing to distinguish SOTA systems. To address this, we propose Iterate to Differentiate (I2D), an evaluation framework that recursively synthesizes speech using the model's own outputs as references. Higher-quality models exhibit greater resilience to the distributional shift induced by iterative synthesis, resulting in slower performance degradation. I2D exploits this differential degradation to amplify performance gaps and reveal robustness. By aggregating objective metrics across iterations, I2D improves discriminability and alignment with human judgments, increasing system-level SRCC from 0.118 to 0.464 for UTMOSv2. Experiments on 11 models across Chinese, English, and emotion datasets demonstrate that I2D enables more reliable automated evaluation for zero-shot TTS.
Primary: Hong Kong University of Science and Technology
All Institutions: Hong Kong University of Science and Technology, Nanjing University
The main contribution of this paper is the introduction of the I2D framework, which enhances the reliability and discriminability of zero-shot TTS evaluations through an innovative iterative synthesis approach. This work addresses critical challenges in TTS evaluation, providing a robust methodology that can significantly impact the field by enabling more accurate assessments of model performance.
The proposed I2D framework introduces an innovative approach to TTS evaluation by leveraging iterative synthesis to amplify performance differences among models. This methodology addresses critical issues of score saturation in traditional evaluation metrics, providing a more reliable means of assessing TTS systems. The recursive use of synthesized outputs as references is a novel strategy that effectively reveals robustness differences among models, which is a significant advancement in the field.
The experiments conducted on 11 TTS models across multiple datasets are comprehensive and well-structured. The paper demonstrates a clear correlation between the proposed evaluation method and human judgments, significantly improving the reliability of automated TTS assessments. The use of both objective and subjective metrics strengthens the findings, although the paper could benefit from more detailed statistical analyses to further validate the results.
The paper provides sufficient details on the datasets, evaluation metrics, and experimental setup, which supports reproducibility. However, the lack of a publicly accessible code repository limits the ability for others to directly replicate the results. Including a project URL would enhance reproducibility.
The paper acknowledges higher computational costs associated with the I2D framework and its potential bias towards model stability over diversity. Additionally, the evaluation's reliance on reference audio quality may introduce conflicts in assessing naturalness versus speaker similarity, particularly in zero-shot settings.
The I2D framework has significant implications for the TTS community, offering a scalable and practical solution for evaluating model performance. By improving the discriminability of evaluation metrics, it can facilitate advancements in TTS technology, leading to better user experiences in applications such as virtual assistants, audiobooks, and more. The main contribution of this paper is the introduction of the I2D framework, which enhances the reliability and discriminability of zero-shot TTS evaluations through an innovative iterative synthesis approach. This work addresses critical challenges in TTS evaluation, providing a robust methodology that can significantly impact the field by enabling more accurate assessments of model performance.
Speaker verification at large scale remains an open challenge as fixed-margin losses treat all samples equally regardless of quality. We hypothesize that mislabeled or degraded samples introduce noisy gradients that disrupt compact speaker manifolds. We propose Curry (CURriculum Ranking), an adaptive loss that estimates sample difficulty online via Sub-center ArcFace: confidence scores from dominant sub-center cosine similarity rank samples into easy, medium, and hard tiers using running batch statistics, without auxiliary annotations. Learnable weights guide the model from stable identity foundations through manifold refinement to boundary sharpening. To our knowledge, this is the largest-scale speaker verification system trained to date. Evaluated on VoxCeleb1-O, and SITW, Curry reduces EER by 86.8\% and 60.0\% over the Sub-center ArcFace baseline, establishing a new paradigm for robust speaker verification on imperfect large-scale data.
Primary: Carnegie Mellon University
All Institutions: Carnegie Mellon University
The main contribution of this paper is the introduction of the CURriculum Ranking loss, which effectively addresses the challenges of large-scale speaker verification by dynamically adjusting the learning process based on sample difficulty. This innovative methodology, coupled with strong experimental results, positions the work as a significant advancement in the field of audio processing and speaker verification.
The proposed CURriculum Ranking (Curry) loss introduces an innovative approach to handling sample difficulty in speaker verification tasks. By utilizing Sub-center ArcFace for estimating sample difficulty and dynamically adjusting the learning process based on sample quality, the methodology stands out for its adaptability and lack of reliance on auxiliary annotations. This approach addresses a significant gap in existing loss functions that treat all samples uniformly, thereby enhancing the robustness of the model against noisy gradients from mislabeled or degraded samples.
The experiments conducted on large-scale datasets such as VoxCeleb1-O and SITW are rigorous and demonstrate substantial improvements in Equal Error Rate (EER) over the baseline. The reported reductions of 86.8% and 60.0% in EER are compelling, showcasing the effectiveness of the proposed method. However, further details on the experimental setup, including the specific configurations and hyperparameters used, would enhance the transparency and replicability of the results.
The paper lacks sufficient implementation details, such as code availability or specific configurations, which could hinder reproducibility. While the methodology is well-articulated, providing access to the code and datasets used would significantly bolster the paper's impact and allow other researchers to validate the findings.
One limitation is the potential overfitting to the specific datasets used for evaluation. The performance improvements may not generalize to other datasets or real-world scenarios without further validation. Additionally, the reliance on running batch statistics may introduce variability depending on batch sizes and compositions, which could affect the stability of the training process.
The proposed method has significant implications for speaker verification systems, particularly in real-world applications where data quality can vary widely. By improving robustness against mislabeled and degraded samples, this research could enhance the reliability of speaker verification in security, forensics, and personal assistant technologies. The approach could also inspire further research into adaptive loss functions across various machine learning domains. The main contribution of this paper is the introduction of the CURriculum Ranking loss, which effectively addresses the challenges of large-scale speaker verification by dynamically adjusting the learning process based on sample difficulty. This innovative methodology, coupled with strong experimental results, positions the work as a significant advancement in the field of audio processing and speaker verification.
Regenerating singing voices with altered lyrics while preserving melody consistency remains challenging, as existing methods either offer limited controllability or require laborious manual alignment. We propose YingMusic-Singer, a fully diffusion-based model enabling melody-controllable singing voice synthesis with flexible lyric manipulation. The model takes three inputs: an optional timbre reference, a melody-providing singing clip, and modified lyrics, without manual alignment. Trained with curriculum learning and Group Relative Policy Optimization, YingMusic-Singer achieves stronger melody preservation and lyric adherence than Vevo2, the most comparable baseline supporting melody control without manual alignment. We also introduce LyricEditBench, the first benchmark for melody-preserving lyric modification evaluation. The code, weights, benchmark, and demos are publicly available at https://github.com/ASLP-lab/YingMusic-Singer.
Primary: University
All Institutions: University
YingMusic-Singer presents a significant advancement in controllable singing voice synthesis, offering a novel approach to lyric manipulation while maintaining melody fidelity. The technical contributions, particularly in methodology and evaluation, position this work as a valuable asset in the field of audio and music technology.
The methodology presented in YingMusic-Singer is innovative, leveraging a fully diffusion-based model that synthesizes singing voices from minimal input without requiring manual alignment. The use of curriculum learning and Group Relative Policy Optimization (GRPO) is particularly noteworthy as it addresses the trade-off between melody adherence and lyric fidelity. The architecture integrates a Variational Autoencoder, a Melody Extractor, and an IPA Tokenizer, which collectively enhance the model's ability to generate high-quality outputs. The introduction of LyricEditBench as a benchmark for evaluating lyric modification is a significant contribution, providing a structured framework for future research.
The experiments are thorough, comparing YingMusic-Singer against Vevo2 across multiple tasks and languages. The results demonstrate clear improvements in performance metrics such as Phoneme Error Rate (PER), melody adherence (F0-CORR), and subjective evaluations (N-MOS, M-MOS). The comprehensive evaluation across six editing types and two languages provides a robust validation of the model's capabilities, showcasing its strength in maintaining melody while allowing for lyric modifications.
The paper provides sufficient implementation details, including architecture specifications, training protocols, and datasets used. The authors have made their code, model weights, and benchmark publicly available, which enhances reproducibility. However, the complexity of the model and the specific configurations used may still pose challenges for replication without additional guidance.
One limitation noted is the potential for increased phoneme error rates when the model is tasked with generating significantly altered phoneme sequences while preserving melody. Additionally, while the model shows promise, its performance may vary with different singing techniques or languages not covered in the training data. The reliance on large-scale singing data also raises questions about the generalizability of the model to diverse vocal styles.
The implications of YingMusic-Singer are substantial, as it opens avenues for practical applications in music production, personalized music generation, and cross-lingual adaptations. The ability to modify lyrics while preserving melody could revolutionize how artists approach song covers and adaptations, making the technology accessible to a broader audience. Furthermore, the introduction of a benchmark for lyric editing could stimulate further research in the field of singing voice synthesis. YingMusic-Singer presents a significant advancement in controllable singing voice synthesis, offering a novel approach to lyric manipulation while maintaining melody fidelity. The technical contributions, particularly in methodology and evaluation, position this work as a valuable asset in the field of audio and music technology.
Multi-channel speech enhancement aims to recover clean speech from noisy multi-channel recordings. Most deep learning methods employ discriminative training, which can lead to non-linear distortions from regression-based objectives, especially under challenging environmental noise conditions. Inspired by ArrayDPS for unsupervised multi-channel source separation, we introduce ArrayDPS-Refine, a method designed to enhance the outputs of discriminative models using a clean speech diffusion prior. ArrayDPS-Refine is training-free, generative, and array-agnostic. It first estimates the noise spatial covariance matrix (SCM) from the enhanced speech produced by a discriminative model, then uses this estimated noise SCM for diffusion posterior sampling. This approach allows direct refinement of any discriminative model's output without retraining. Our results show that ArrayDPS-Refine consistently improves the performance of various discriminative models, including state-of-the-art waveform and STFT domain models. Audio demos are provided at https://xzwy.github.io/ArrayDPSRefineDemo/.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of ArrayDPS-Refine, a novel generative refinement method for multi-channel speech enhancement that significantly improves the performance of existing discriminative models without retraining. This work represents a meaningful advancement in the field of audio processing, particularly in enhancing speech intelligibility and quality in challenging acoustic environments.
The methodology presented in the paper is innovative, introducing ArrayDPS-Refine as a generative refinement technique that enhances outputs from discriminative models without the need for retraining. This is achieved through the estimation of the noise spatial covariance matrix (SCM) and the application of diffusion posterior sampling. The approach is well-structured, leveraging existing techniques in multi-channel speech enhancement while addressing the limitations of previous methods. The training-free aspect is particularly noteworthy, as it allows for flexibility across different models and configurations.
The experimental evaluation is comprehensive, demonstrating the effectiveness of ArrayDPS-Refine across various discriminative models, including state-of-the-art techniques. The use of multiple metrics such as STOI, eSTOI, PESQ, SI-SDR, and WER provides a robust framework for assessing performance improvements. The results indicate significant enhancements in intelligibility and perceptual quality, validating the proposed method's effectiveness. However, the paper could benefit from more detailed comparisons with baseline models to further contextualize the improvements.
The paper provides a detailed account of the experimental setup, including configurations for the diffusion model and the datasets used. However, the lack of a publicly available code repository limits reproducibility. Future work should consider releasing the code and models to facilitate validation of results by the community.
One limitation of the proposed method is its reliance on the quality of the initial discriminative model outputs. If the initial outputs are significantly distorted, the refinement process may not yield optimal results. Additionally, the method's performance in highly complex noise environments or with multiple overlapping speakers remains to be fully explored.
The implications of this work are significant for applications in speech recognition, telecommunications, and assistive technologies. By improving speech enhancement techniques, the proposed method could enhance communication in noisy environments, benefiting users in various real-world scenarios. The training-free nature of the method also suggests potential for broader adoption across different devices and applications. The main contribution of this paper is the introduction of ArrayDPS-Refine, a novel generative refinement method for multi-channel speech enhancement that significantly improves the performance of existing discriminative models without retraining. This work represents a meaningful advancement in the field of audio processing, particularly in enhancing speech intelligibility and quality in challenging acoustic environments.
Speech Emotion Recognition (SER) in real-world scenarios remains challenging due to severe class imbalance and the prevalence of spontaneous, natural speech. While recent approaches leverage self-supervised learning (SSL) representations and multimodal fusion of speech and text, most existing methods apply supervision only at the final classification layer, limiting the discriminative power of intermediate representations. In this work, we propose Crab (Contrastive Representation and Multimodal Aligned Bottleneck), a bimodal Cross-Modal Transformer architecture that integrates speech representations from WavLM and textual representations from RoBERTa, together with a novel \textit{Multi Layer Contrastive Supervision} (MLCS) strategy. MLCS injects multi-positive contrastive learning signals at multiple layers of the network, encouraging emotionally discriminative representations throughout the model without introducing additional parameters at inference time. To further address data imbalance, we adopt weighted cross-entropy during training. We evaluate the proposed approach on three benchmark datasets covering different degrees of emotional naturalness: IEMOCAP, MELD, and MSP-Podcast 2.0. Experimental results demonstrate that Crab consistently outperforms strong unimodal and multimodal baselines across all datasets, with particularly large gains under naturalistic and highly imbalanced conditions. These findings highlight the effectiveness of \textit{Multi Layer Contrastive Supervision} as a general and robust strategy for SER. Official implementation can be found in https://github.com/AI-Unicamp/Crab.
Primary: Universidade Estadual de Campinas (UNICAMP)
All Institutions: Universidade Estadual de Campinas (UNICAMP), MCTI, CAPES, FAPESP
The paper presents Crab, a multimodal SER framework that effectively integrates speech and text representations through a novel contrastive learning strategy, achieving significant performance improvements in emotion recognition tasks. The innovative approach and rigorous evaluation contribute meaningfully to the field of speech emotion recognition, addressing key challenges in real-world applications.
The proposed methodology introduces a novel Cross-Modal Transformer architecture, integrating speech and text representations while employing Multi Layer Contrastive Supervision (MLCS) to enhance emotion recognition. This approach is innovative as it applies contrastive learning at multiple layers, which is not common in existing SER frameworks. The use of weighted cross-entropy to address class imbalance further strengthens the methodology, making it robust for real-world applications.
The experimental evaluation is comprehensive, utilizing three benchmark datasets (IEMOCAP, MELD, and MSP-Podcast 2.0) that vary in emotional naturalness. The results consistently demonstrate superior performance of the Crab model compared to strong baselines, particularly in challenging naturalistic scenarios with class imbalance. The use of multiple evaluation metrics (UAR and WAR) provides a well-rounded assessment of model performance.
The paper includes a link to the official implementation on GitHub, which is crucial for reproducibility. However, specific implementation details such as hyperparameters and training configurations could be more explicitly stated to facilitate easier replication by other researchers.
One limitation is the reliance on specific datasets, which may not fully capture the diversity of emotional expressions in real-world scenarios. Additionally, while the model shows robustness to class imbalance, the performance on unseen speakers in naturalistic conditions could be further explored.
The findings have significant implications for applications in human-computer interaction, customer service, and online education, where understanding emotional cues can enhance user experience. The proposed model's ability to handle class imbalance makes it particularly valuable for deploying SER systems in real-world contexts. The paper presents Crab, a multimodal SER framework that effectively integrates speech and text representations through a novel contrastive learning strategy, achieving significant performance improvements in emotion recognition tasks. The innovative approach and rigorous evaluation contribute meaningfully to the field of speech emotion recognition, addressing key challenges in real-world applications.
We introduce Echoes, a new dataset for music deepfake detection designed for training and benchmarking detectors under realistic and provider-diverse conditions. Echoes comprises 3,577 tracks (110 hours of audio) spanning multiple genres (pop, rock, electronic), and includes content generated by ten popular AI music generation systems. To prevent shortcut learning and promote robust generalization, the dataset is deliberately constructed to be challenging, enforcing semantic-level alignment between spoofed audio and bona fide references. This alignment is achieved by conditioning generated audio samples directly on bona-fide waveforms or song descriptors. We evaluate Echoes in a cross-dataset setting against three existing AI-generated music datasets using state-of-the-art Wav2Vec2 XLS-R 2B representations. Results show that (i) Echoes is the hardest in-domain dataset; (ii) detectors trained on existing datasets transfer poorly to Echoes; (iii) training on Echoes yields the strongest generalization performance. These findings suggest that provider diversity and semantic alignment help learn more transferable detection cues.
Primary: National University of Science and Technology POLITEHNICA Bucharest
All Institutions: Fraunhofer AISEC, National University of Science and Technology POLITEHNICA Bucharest
The paper presents Echoes, a semantically-aligned dataset for AI-generated music detection, which significantly enhances the benchmarking landscape for music deepfake detection by addressing key challenges in data diversity and shortcut learning.
The methodology is robust, focusing on generating a diverse dataset that emphasizes semantic alignment between real and AI-generated music. The use of LLMs to derive song descriptors for conditioning the generation process is innovative and addresses the challenges of shortcut learning in deepfake detection. The dataset's design, including the variety of music genres and the inclusion of multiple AI music generation systems, enhances its applicability and relevance.
The experimental evaluation is thorough, demonstrating the dataset's effectiveness through cross-dataset testing. The results highlight the difficulty of the Echoes dataset and its ability to promote generalization in detection models. The use of Wav2Vec2 XLS-R 2B for feature extraction and the evaluation metrics employed (EER) are appropriate for the task at hand.
The paper provides sufficient details about the dataset generation process, including the selection of bona fide tracks and the conditioning methods used for AI-generated samples. However, the lack of a direct link to the code or model training details limits full reproducibility.
One limitation is the reliance on a specific set of AI music generation systems, which may not encompass the full spectrum of current technologies. Additionally, the dataset may not cover all possible music genres or styles, potentially limiting its generalizability. The paper also mentions that future work will explore more complex scenarios, indicating that the current evaluation may not fully capture real-world conditions.
The dataset has significant implications for the music industry, particularly in addressing the challenges posed by AI-generated music. By providing a benchmark for deepfake detection, it can help improve the integrity of music platforms and support the development of more reliable detection systems. This work also opens avenues for further research in audio forensics and the ethical implications of AI in creative fields. The paper presents Echoes, a semantically-aligned dataset for AI-generated music detection, which significantly enhances the benchmarking landscape for music deepfake detection by addressing key challenges in data diversity and shortcut learning.
Self-supervised learning (SSL) has advanced speech processing. However, existing speech SSL methods typically assume a single sampling rate and struggle with mixed-rate data due to temporal resolution mismatch. To address this limitation, we propose MSRHuBERT, a multi-sampling-rate adaptive pre-training method. Building on HuBERT, we replace its single-rate downsampling CNN with a multi-sampling-rate adaptive downsampling CNN that maps raw waveforms from different sampling rates to a shared temporal resolution without resampling. This design enables unified mixed-rate pre-training and fine-tuning. In experiments spanning 16 to 48 kHz, MSRHuBERT outperforms HuBERT on speech recognition and full-band speech reconstruction, preserving high-frequency detail while modeling low-frequency semantic structure. Moreover, MSRHuBERT retains HuBERT's mask-prediction objective and Transformer encoder, so existing analyses and improvements that were developed for HuBERT can apply directly.
Primary: Tianjin University
All Institutions: Tianjin University, Chinese Academy of Sciences, Huiyan Technology Company, Tianjin Key Laboratory of Cognitive Computing and Application
This paper presents MSRHuBERT, a self-supervised learning framework that effectively addresses the resolution mismatch problem in speech processing across multiple sampling rates. The technical contributions are substantial, with a well-justified methodology and promising experimental results, positioning this work as a notable advancement in the field of audio machine learning.
The proposed MSRHuBERT method introduces a novel multi-sampling-rate adaptive downsampling CNN that effectively addresses the resolution mismatch problem in self-supervised speech learning. By allowing the model to process audio at various sampling rates without resampling, it preserves high-frequency information critical for tasks like speech reconstruction while maintaining low-frequency semantic content for ASR. The methodology is well-structured, retaining the core elements of the HuBERT framework, which facilitates the integration of existing improvements and analyses. The approach is theoretically sound and presents a clear advancement over existing methods.
The experiments conducted span multiple sampling rates (16 kHz to 48 kHz) and evaluate both ASR and full-band speech reconstruction tasks. The results demonstrate that MSRHuBERT outperforms the baseline HuBERT model across various metrics, showcasing its effectiveness in preserving high-frequency details while maintaining low-frequency content. The use of diverse datasets and the systematic evaluation of performance across different sampling rates strengthens the findings. However, the paper could benefit from additional comparative analyses with other state-of-the-art models beyond HuBERT.
The paper provides a detailed description of the experimental setup, including the datasets used and the training configurations. However, the absence of a publicly available code repository or demo URL limits reproducibility. Future work should consider releasing the model and code to facilitate validation by the research community.
One limitation is the reliance on the HuBERT architecture, which may restrict the generalizability of the proposed method to other architectures. Additionally, while the paper addresses the resolution mismatch problem, it does not explore the implications of using the model in real-world applications where sampling rates may vary dynamically. The paper could also expand on potential computational costs associated with the multi-sampling-rate adaptive downsampling CNN.
The implications of this research are significant for the field of speech processing, particularly in applications requiring robust performance across varying audio qualities and sampling rates. The ability to handle mixed-rate data without loss of information can enhance the usability of speech models in diverse environments, potentially leading to improvements in voice recognition systems, virtual assistants, and other audio applications. This paper presents MSRHuBERT, a self-supervised learning framework that effectively addresses the resolution mismatch problem in speech processing across multiple sampling rates. The technical contributions are substantial, with a well-justified methodology and promising experimental results, positioning this work as a notable advancement in the field of audio machine learning.
This paper presents the Interspeech 2026 Audio Encoder Capability Challenge, a benchmark specifically designed to evaluate and advance the performance of pre-trained audio encoders as front-end modules for Large Audio Language Models (LALMs). While LALMs have shown remarkable understanding of complex acoustic scenes, their performance depends on the semantic richness of the underlying audio encoder representations. This challenge addresses the integration gap by providing a unified generative evaluation framework, XARES-LLM, which assesses submitted encoders across a diverse suite of downstream classification and generation tasks. By decoupling encoder development from LLM fine-tuning, the challenge establishes a standardized protocol for general-purpose audio representations that can effectively be used for the next generation of multimodal language models.
Primary: University of Surrey
All Institutions: University of Surrey, DataOcean AI Inc
The paper presents a novel benchmark for evaluating audio encoders in the context of LALMs, contributing to the advancement of multimodal machine learning. The introduction of XARES-LLM and the structured challenge framework represent significant steps forward in the evaluation of audio representations, with implications for future research and applications in audio understanding.
The paper introduces the Interspeech 2026 Audio Encoder Capability Challenge, which is a well-structured benchmark for evaluating audio encoders in the context of Large Audio Language Models (LALMs). The proposed methodology, XARES-LLM, effectively decouples encoder development from LLM fine-tuning, allowing for a more focused evaluation of audio representations. The challenge's design, which includes multiple tracks and a unified generative evaluation framework, is innovative and addresses existing gaps in the evaluation of audio encoders. The use of a single decoder model for diverse tasks is a significant methodological advancement.
The experiments conducted across four tracks provide a comprehensive assessment of the performance of various audio encoders. The inclusion of both public and hidden test sets enhances the robustness of the evaluation. The results indicate a clear performance advantage for encoders that leverage LALM alignment, showcasing the effectiveness of the proposed evaluation framework. The leaderboard results are well-documented, providing insights into the strengths of different approaches.
The paper emphasizes reproducibility by detailing the experimental setup, including the use of fixed random seeds and multiple hardware configurations. However, the absence of a publicly accessible code repository limits the ability for external validation of results.
One notable limitation is the reliance on proprietary audio encoders, which may restrict the generalizability of findings to publicly available models. Additionally, while the challenge addresses various tasks, the focus on generative outputs may not fully encompass all aspects of audio understanding.
The challenge has the potential to significantly advance the field of audio processing by establishing a standardized protocol for evaluating audio encoders. This could lead to improved performance in multimodal language models and broader applications in areas such as speech recognition, emotion detection, and audio classification. The paper presents a novel benchmark for evaluating audio encoders in the context of LALMs, contributing to the advancement of multimodal machine learning. The introduction of XARES-LLM and the structured challenge framework represent significant steps forward in the evaluation of audio representations, with implications for future research and applications in audio understanding.
Audio-Language Models (ALMs) are making strides in understanding speech and non-speech audio. However, domain-specialist Foundation Models (FMs) remain the best for closed-ended speech processing tasks such as Speech Emotion Recognition (SER). Using ALMs for Zero-shot SER is a popular choice, but their potential to work with specialists to achieve state-of-the-art (SOTA) performance remains unexplored. We propose ZS-Fuse, a late-fusion method that combines zero-shot emotion estimates from a dual-encoder ALM with specialist FMs. To handle ambiguity in emotions and sensitivity to prompt choice, 1) we use a simple prompt ensemble and 2) suggest a novel technique called prompt amplification, which repeats audio and text queries to discover stronger zero-shot capabilities. We demonstrate the efficacy of our technique by evaluating ZS-Fuse with three dual-encoder ALMs and two FMs, and report improvements over SOTA baselines, such as WavLM-Large, on three speech emotion recognition datasets.
Primary: Emory University
All Institutions: Emory University
The main contribution of this paper is the introduction of ZS-Fuse, a novel late-fusion method that combines zero-shot predictions from Audio-Language Models with specialist Foundation Models to improve Speech Emotion Recognition performance. This work showcases a promising direction in leveraging multimodal learning for enhanced emotion recognition, addressing both practical applications and theoretical advancements in the field.
The paper introduces ZS-Fuse, a late-fusion method that effectively combines zero-shot emotion estimates from dual-encoder Audio-Language Models (ALMs) with domain-specialist Foundation Models (FMs). The methodology is well-structured, employing prompt amplification and a simple prompt ensemble to enhance the zero-shot capabilities of ALMs. The choice of dual-encoder models is justified, and the approach to handle ambiguity in emotions is innovative, though the simplicity of the prompt engineering could be seen as a limitation in exploring more complex interactions.
The experiments are comprehensive, evaluating the proposed method across three datasets (RAVDESS, MSP-Podcast, and IEMOCAP) and multiple ALM and FM combinations. The results demonstrate significant improvements over state-of-the-art baselines, particularly with the CLSP model. However, the paper could benefit from more detailed statistical analysis and discussion of the results, such as confidence intervals or significance testing.
The paper provides sufficient details regarding the training process, including the choice of optimizers, batch sizes, and the number of epochs. However, the lack of a public repository or demo URL limits the reproducibility of the results, as external researchers cannot easily validate the findings or replicate the experiments.
One major limitation is the reliance on prompt amplification, which can lead to unpredictable performance, as indicated by the results showing that some configurations degrade performance. Additionally, the paper does not explore the implications of using larger or more complex prompt ensembles, which could enhance the results further.
The proposed method has significant implications for the development of emotion-aware systems, such as empathetic virtual assistants and customer service applications. The integration of ALMs with FMs could lead to advancements in various fields, including mental health monitoring and interactive dialogue systems. The main contribution of this paper is the introduction of ZS-Fuse, a novel late-fusion method that combines zero-shot predictions from Audio-Language Models with specialist Foundation Models to improve Speech Emotion Recognition performance. This work showcases a promising direction in leveraging multimodal learning for enhanced emotion recognition, addressing both practical applications and theoretical advancements in the field.
Integrating Federated Learning (FL) with self-supervised learning (SSL) enables privacy-preserving fine-tuning for speech tasks. However, federated environments exhibit significant heterogeneity: clients differ in computational capacity, causing straggler effects under unified fine-tuning, while diverse downstream tasks require different representation depths, making full-model updates inefficient. To address these challenges, we propose an adaptive federated fine-tuning framework with early exits. Lightweight prediction heads are inserted at intermediate layers of the SSL backbone, allowing clients to terminate computation based on local constraints and task requirements. We further introduce a layer-wise, depth-aware partial aggregation strategy to better utilize representations from different network depths. Experiments show that the framework reduces edge overhead, supports heterogeneous hardware, and maintains competitive performance in resource-constrained federated environments.
Primary: University of Cambridge
All Institutions: University of Cambridge, Electronic Information School, Flower Labs, University of Auckland, University of Melbourne, Wuhan University
This paper presents a novel adaptive federated fine-tuning framework that effectively addresses the challenges of heterogeneous environments in self-supervised speech representation learning. The technical contributions, particularly in the areas of early exits and layer-wise aggregation, represent a meaningful advancement in the field of federated learning for audio applications.
The proposed adaptive federated fine-tuning framework introduces innovative mechanisms such as early exits and layer-wise partial aggregation, which effectively address the challenges posed by heterogeneity in federated learning environments. The methodology is well-structured, leveraging an elastic multi-branch architecture that allows clients to dynamically select their training depth based on local resources and task complexity. This approach not only enhances computational efficiency but also maintains performance across diverse speech tasks. The integration of lightweight prediction heads and depth-aware aggregation strategies is a significant advancement in federated learning for speech applications.
The experiments are comprehensive, covering five diverse downstream tasks that span various aspects of speech understanding. The results demonstrate the effectiveness of the proposed framework in reducing computational overhead while achieving competitive performance compared to centralized training. The evaluation metrics used, including word error rates and classification error rates, are appropriate for the tasks at hand. However, the paper could benefit from additional comparisons with existing state-of-the-art methods to further contextualize the results.
The paper provides a detailed description of the experimental setup, including datasets, model architectures, and training configurations, which aids reproducibility. However, the lack of a publicly available code repository limits the ease with which others can replicate the experiments. Including a link to the implementation would significantly enhance reproducibility.
One limitation is the reliance on a specific backbone model (Wav2Vec 2.0), which may not generalize to all speech tasks or architectures. Additionally, while the framework addresses resource constraints, it does not fully explore the implications of data heterogeneity beyond the basic partitioning strategy employed. The paper could also discuss potential trade-offs between performance and computational efficiency in more detail.
The proposed framework has significant implications for deploying speech recognition systems in privacy-sensitive environments, such as mobile devices and personal assistants. By enabling efficient fine-tuning without compromising user data privacy, this work contributes to the growing field of privacy-preserving machine learning. The methodology could be adapted to other domains where federated learning is applicable, potentially influencing future research in decentralized learning systems. This paper presents a novel adaptive federated fine-tuning framework that effectively addresses the challenges of heterogeneous environments in self-supervised speech representation learning. The technical contributions, particularly in the areas of early exits and layer-wise aggregation, represent a meaningful advancement in the field of federated learning for audio applications.
Recent advances in generative models, such as diffusion and flow matching, have shown strong performance in audio tasks. However, speech enhancement (SE) models are typically trained on limited datasets and evaluated under narrow conditions, limiting real-world applicability. To address this, we propose DiT-Flow, a flow matching-based SE framework built on the latent Diffusion Transformer (DiT) backbone and trained for robustness across diverse distortions, including noise, reverberation, and compression. DiT-Flow operates on compact variational auto-encoders (VAEs)-derived latent features. We validated our approach on StillSonicSet, a synthetic yet acoustically realistic dataset composed of LibriSpeech, FSD50K, FMA, and 90 Matterport3D scenes. Experiments show that DiT-Flow consistently outperforms state-of-the-art generative SE models, demonstrating the effectiveness of flow matching in multi-condition speech enhancement. Despite ongoing efforts to expand synthetic data realism, a persistent bottleneck in SE is the inevitable mismatch between training and deployment conditions. By integrating LoRA with the MoE framework, we achieve both parameter-efficient and high-performance training for DiT-Flow robust to multiple distortions with using 4.9% percentage of the total parameters to obtain a better performance on five unseen distortions.
Primary: Johns Hopkins University
All Institutions: Johns Hopkins University, Technion Israel Institute of Technology, University of Haifa
The main contribution of this paper is the development of DiT-Flow, a novel speech enhancement framework that effectively utilizes flow matching and latent representations to improve robustness against multiple distortions. This work represents a significant step forward in the field of audio processing, addressing common challenges faced in real-world applications and demonstrating the potential for future advancements in speech enhancement technologies.
The methodology of DiT-Flow is robust, leveraging flow matching and latent Diffusion Transformers to enhance speech under multiple distortions. The integration of LoRA with the Mixture-of-Experts framework is particularly innovative, allowing for parameter-efficient adaptation to varying acoustic conditions. The use of a synthetic dataset, StillSonicSet, designed to simulate realistic conditions, further strengthens the approach. However, the paper could benefit from clearer descriptions of hyperparameter choices and training procedures.
The experiments are comprehensive, validating DiT-Flow against state-of-the-art models across various conditions. The use of multiple evaluation metrics, including PESQ, ESTOI, and DNSMOS, provides a well-rounded assessment of performance. The results demonstrate significant improvements over baseline models, particularly in challenging scenarios, indicating the effectiveness of the proposed methods. However, the paper lacks detailed comparisons with a broader range of existing methods, which could provide more context for its contributions.
The paper includes sufficient detail regarding the model architecture and training process, but lacks a clear link to code or datasets, which hampers reproducibility. Providing access to the StillSonicSet dataset and the trained models would enhance reproducibility and facilitate further research.
One limitation is the reliance on synthetic data, which may not fully capture the complexities of real-world audio environments. Additionally, while the model shows robustness to multiple distortions, its performance in extreme or novel conditions remains to be tested. The computational efficiency of the model in real-time applications also needs further exploration.
The advancements in speech enhancement presented in this paper have significant implications for real-world applications, particularly in telecommunication, virtual meetings, and assistive technologies. The ability to enhance speech quality in diverse acoustic environments can improve communication clarity and accessibility for users in various settings. The main contribution of this paper is the development of DiT-Flow, a novel speech enhancement framework that effectively utilizes flow matching and latent representations to improve robustness against multiple distortions. This work represents a significant step forward in the field of audio processing, addressing common challenges faced in real-world applications and demonstrating the potential for future advancements in speech enhancement technologies.
Animal vocalizations provide crucial insights for wildlife assessment, particularly in complex environments such as forests, aiding species identification and ecological monitoring. Recent advances in deep learning have enabled automatic species classification from their vocalizations. However, classifying species unseen during training remains challenging. To address this limitation, we introduce AnimalCLAP, a taxonomy-aware language-audio framework comprising a new dataset and model that incorporate hierarchical biological information. Specifically, our vocalization dataset consists of 4,225 hours of recordings covering 6,823 species, annotated with 22 ecological traits. The AnimalCLAP model is trained on this dataset to align audio and textual representations using taxonomic structures, improving the recognition of unseen species. We demonstrate that our proposed model effectively infers ecological and biological attributes of species directly from their vocalizations, achieving superior performance compared to CLAP. Our dataset, code, and models will be publicly available at https://dahlian00.github.io/AnimalCLAP_Page/.
Primary: quadInstitute of Science Tokyo
All Institutions: quadInstitute of Science Tokyo, The University of Osaka, The University of Tokyo
The main contribution of this paper is the introduction of AnimalCLAP, a taxonomy-aware language-audio pretraining framework that significantly improves species recognition and trait inference from animal vocalizations. This work represents a meaningful advancement in the application of machine learning to ecological monitoring, with a robust methodology and promising results that could influence future research and practices in wildlife assessment.
The methodology presented in AnimalCLAP is innovative, leveraging a taxonomy-aware framework that integrates hierarchical biological information into the model's training process. The authors introduce a substantial dataset of animal vocalizations, which is a critical asset for training and evaluating the model. The alignment of audio and textual representations through taxonomic structures is a novel approach that enhances the model's ability to generalize to unseen species, which is a significant challenge in the field. The use of contrastive learning techniques is well-justified and effectively applied to the task of species recognition and trait inference.
The experiments are comprehensive, utilizing a large dataset of 4,225 hours of recordings from 6,823 species, which is a considerable contribution to the field. The results demonstrate that AnimalCLAP outperforms existing models, including CLAP, in recognizing unseen species and inferring ecological traits. The evaluation metrics used are appropriate, and the authors provide a clear comparison of their model's performance against baseline methods, showcasing the effectiveness of their approach.
The authors commit to making their dataset, code, and models publicly available, which is crucial for reproducibility. However, the paper would benefit from a more detailed description of the experimental setup, including hyperparameter settings and training procedures, to facilitate replication by other researchers.
One limitation of the study is the potential bias in the dataset, which may not cover all ecological contexts or species diversity adequately. Additionally, the model's performance on edge cases or species with very similar vocalizations may not be thoroughly addressed. The reliance on taxonomic structures may also limit the model's applicability in more complex ecological scenarios where such hierarchies are not well defined.
The implications of this research are significant for wildlife conservation and ecological monitoring, as it provides a tool for non-invasive species identification and trait inference from vocalizations. This could enhance biodiversity assessments and inform conservation strategies. The methodology could also be adapted for other domains where audio classification is relevant, such as environmental monitoring or even human-related vocalizations. The main contribution of this paper is the introduction of AnimalCLAP, a taxonomy-aware language-audio pretraining framework that significantly improves species recognition and trait inference from animal vocalizations. This work represents a meaningful advancement in the application of machine learning to ecological monitoring, with a robust methodology and promising results that could influence future research and practices in wildlife assessment.
Audio-Visual Semantic Segmentation (AVSS) aligns audio and video at the pixel level but requires costly per-frame annotations. We introduce Weakly Supervised Audio-Visual Semantic Segmentation (WSAVSS), which uses only video-level labels to generate per-frame semantic masks of sounding objects. We decompose WSAVSS into looking, listening, and segmentation, and propose Progressive Cross-modal Alignment for Semantics (PCAS) with two modules: *Looking-before-Listening* and *Listening-before-Segmentation*. PCAS builds a classification task to train the audio-visual encoder using video labels, injects visual semantic prompts to enhance frame-level audio understanding, and then applies progressive contrastive alignment to map audio categories to image regions without mask annotations. Experiments show PCAS achieves state-of-the-art performance among weakly supervised methods on AVS and remains competitive with fully supervised baselines on AVSS, validating its effectiveness.
Primary: Beijing Institute of Technology
All Institutions: Beijing Institute of Technology
The main contribution of this paper is the introduction of a novel weakly supervised framework for audio-visual semantic segmentation that effectively aligns audio and visual features without requiring dense annotations. This work represents a significant step forward in the field of audio-visual understanding, providing a robust methodology and promising results that could influence future research and applications.
The methodology presented in this paper is innovative, particularly in its decomposition of the WSAVSS task into three distinct phases: looking, listening, and segmentation. The introduction of Temporal Visual Prompting (TVP) to enhance audio understanding through visual cues is a novel approach that leverages the inherent relationships between audio and visual modalities. The Progressive Cross-modal Alignment for Semantics (PCAS) framework, which combines instance-wise and token-wise contrastive learning, is well-conceived and addresses the challenge of aligning audio and visual features without requiring dense annotations. This progressive alignment strategy is a significant advancement over existing methods, making it a valuable contribution to the field.
The experiments are comprehensive, demonstrating the effectiveness of the proposed method through comparisons with both weakly supervised and fully supervised baselines. The use of multiple datasets and the reporting of mean IoU and F-score metrics provide a robust evaluation of the model's performance. The ablation studies effectively highlight the contributions of each module within the proposed framework, reinforcing the claims of improved performance. However, the absence of a demo or project URL limits the accessibility of the results for further validation by the community.
While the paper provides a detailed description of the methodology and experimental setup, it lacks specific implementation details such as code availability or dataset access instructions. The absence of these resources may hinder reproducibility. Clearer guidelines on how to replicate the experiments would enhance the paper's impact.
One limitation of the study is the reliance on video-level labels, which, while reducing annotation costs, may not capture the full complexity of audio-visual interactions. Additionally, the paper does not address potential biases in the datasets used, which could affect the generalizability of the results. The performance on more complex scenes with overlapping sounds and visuals could also be explored further.
The proposed WSAVSS framework has significant implications for applications in multimedia content analysis, human-computer interaction, and assistive technologies. By reducing the need for extensive annotations, this research can facilitate advancements in real-time audio-visual processing systems, enhancing accessibility and user experience in various domains. The approach could also inspire further research into weakly supervised learning paradigms across different modalities. The main contribution of this paper is the introduction of a novel weakly supervised framework for audio-visual semantic segmentation that effectively aligns audio and visual features without requiring dense annotations. This work represents a significant step forward in the field of audio-visual understanding, providing a robust methodology and promising results that could influence future research and applications.
This paper presents SelfTTS, a text-to-speech (TTS) model designed for cross-speaker style transfer that eliminates the need for external pre-trained speaker or emotion encoders. The architecture achieves emotional expressivity in neutral speakers through an explicit disentanglement strategy utilizing Gradient Reversal Layers (GRL) combined with cosine similarity loss to decouple speaker and emotion information. We introduce Multi Positive Contrastive Learning (MPCL) to induce clustered representations of speaker and emotion embeddings based on their respective labels. Furthermore, SelfTTS employs a self-refinement strategy via Self-Augmentation, exploiting the model's voice conversion capabilities to enhance the naturalness of synthesized speech. Experimental results demonstrate that SelfTTS achieves superior emotional naturalness (eMOS) and robust stability in target timbre and emotion compared to state-of-the-art baselines.
Primary: Universidade Estadual de Campinas (UNICAMP)
All Institutions: Universidade Estadual de Campinas (UNICAMP)
The main contribution of this paper is the development of SelfTTS, a robust TTS framework that achieves high-quality cross-speaker style transfer through innovative embedding disentanglement and self-refinement strategies. This work represents a meaningful advancement in the field of speech synthesis, addressing key challenges related to emotional expressivity and speaker identity while providing a solid experimental foundation to support its claims.
The paper introduces SelfTTS, a novel TTS framework that effectively decouples speaker and emotion embeddings without relying on external encoders. The methodology employs Gradient Reversal Layers (GRL) and Multi Positive Contrastive Learning (MPCL) to achieve disentanglement and clustering of embeddings, which is a significant advancement over existing methods that often suffer from speaker leakage. The self-refinement strategy through Self-Augmentation is particularly innovative, leveraging the modelโs voice conversion capabilities to enhance the naturalness of synthesized speech. This approach is well-justified and clearly articulated, demonstrating a solid understanding of the challenges in TTS.
The experimental setup is robust, utilizing both subjective (eMOS, nMOS, sMOS) and objective metrics (UTMOS, WER, SECS, EECS) to evaluate performance. The results indicate that SelfTTS outperforms state-of-the-art models in emotional naturalness and stability, which is a crucial aspect of TTS systems. The use of cross-corpus experiments adds to the credibility of the findings, although the paper could benefit from more extensive comparisons with additional baselines.
The paper provides adequate implementation details, including the architecture, training procedures, and evaluation metrics, which facilitate reproducibility. The authors have made their code publicly available, enhancing the likelihood that other researchers can replicate the results. However, some hyperparameters and specific configurations could be more explicitly detailed to ensure complete clarity.
One limitation noted is the model's performance in cross-corpus scenarios, where emotional adherence is lower due to the differences in recording conditions. Additionally, while the Self-Augmentation strategy shows promise, its effectiveness may vary based on the quality of synthetic samples generated, which could introduce artifacts into the training process.
The advancements presented in SelfTTS have significant implications for the development of expressive TTS systems, particularly in applications requiring emotional expressivity and speaker identity preservation. This work could benefit various fields, including virtual assistants, audiobooks, and gaming, where natural and emotionally engaging speech synthesis is essential. The main contribution of this paper is the development of SelfTTS, a robust TTS framework that achieves high-quality cross-speaker style transfer through innovative embedding disentanglement and self-refinement strategies. This work represents a meaningful advancement in the field of speech synthesis, addressing key challenges related to emotional expressivity and speaker identity while providing a solid experimental foundation to support its claims.
Speech emotion recognition (SER) systems can exhibit gender-related performance disparities, but how such bias manifests in multilingual speech LLMs across languages and modalities is unclear. We introduce a novel multilingual, multimodal benchmark built on MELD-ST, spanning English, Japanese, and German, to quantify language-specific SER performance and gender gaps. We find bias is strongly language-dependent, and multimodal fusion does not reliably improve fairness. To address these, we propose ERM-MinMaxGAP, a fairness-informed training objective, which augments empirical risk minimization (ERM) with a proposed adaptive fairness weight mechanism and a novel MinMaxGAP regularizer on the maximum male-female loss gap within each language and modality. Building upon the Qwen2-Audio backbone, our ERM-MinMaxGAP approach improves multilingual SER performance by 5.5% and 5.0% while reducing the overall gender bias gap by 0.1% and 1.4% in the unimodal and multimodal settings, respectively.
Primary: Kyoto University
All Institutions: Agency for Science, Kyoto University, A*STAR
This paper presents a novel benchmark and a fairness-aware training objective for mitigating gender bias in multilingual multimodal speech emotion recognition systems. The technical contributions and methodology are robust, addressing a pressing issue in the field of machine learning and AI.
The proposed methodology, ERM-MinMaxGAP, is a significant advancement in addressing gender bias in multilingual multimodal speech emotion recognition (SER). The integration of empirical risk minimization with a fairness regularization term that focuses on the maximum male-female loss gap is innovative. The adaptive fairness weight mechanism further enhances the robustness of the training process, allowing for dynamic adjustments based on the model's performance. The detailed description of the MinMaxGAP regularizer and its implementation demonstrates a thorough understanding of the complexities involved in SER tasks, particularly in a multilingual context.
The experimental setup is well-structured, utilizing the MELD-ST dataset to benchmark the proposed method against existing models. The results indicate that ERM-MinMaxGAP not only improves SER performance but also reduces gender disparity effectively across different languages and modalities. The ablation studies provide valuable insights into the contributions of each component of the proposed method, reinforcing the effectiveness of the MinMaxGAP regularization approach.
The paper states that all code, data, and models will be released upon acceptance, which is a positive aspect for reproducibility. However, specific implementation details regarding the training process, hyperparameters, and dataset preparation are provided, which aids in replicating the experiments. The clarity in methodology and results presentation supports reproducibility.
One limitation is that while the proposed method shows improvements in SER and fairness, it does not achieve the minimum post-hoc gender gap in every setting, indicating that the approach may not be universally applicable across all datasets or languages. Additionally, the reliance on a specific dataset (MELD-ST) may limit the generalizability of the findings.
The implications of this research are significant, as it addresses a critical issue of fairness in AI systems, particularly in emotion recognition, which has applications in various fields such as mental health assessment, customer service, and human-computer interaction. By improving fairness in SER systems, this work contributes to the development of more equitable AI technologies that can better serve diverse populations. This paper presents a novel benchmark and a fairness-aware training objective for mitigating gender bias in multilingual multimodal speech emotion recognition systems. The technical contributions and methodology are robust, addressing a pressing issue in the field of machine learning and AI.
Composing coherent long-form music remains a significant challenge due to the complexity of modeling long-range dependencies and the prohibitive memory and computational requirements associated with lengthy audio representations. In this work, we propose a simple yet powerful trick: we assume that AI models can understand and generate time-accelerated (speeded-up) audio at rates such as 2x, 4x, or even 8x. By first generating a high-speed version of the music, we greatly reduce the temporal length and resource requirements, making it feasible to handle long-form music that would otherwise exceed memory or computational limits. The generated audio is then restored to its original speed, recovering the full temporal structure. This temporal speed-up and slow-down strategy naturally follows the principle of hierarchical generation from abstract to detailed content, and can be conveniently applied to existing music generation models to enable long-form music generation. We instantiate this idea in SqueezeComposer, a framework that employs diffusion models for generation in the accelerated domain and refinement in the restored domain. We validate the effectiveness of this approach on two tasks: long-form music generation, which evaluates temporal-wise control (including continuation, completion, and generation from scratch), and whole-song singing accompaniment generation, which evaluates track-wise control. Experimental results demonstrate that our simple temporal speed-up trick enables efficient, scalable, and high-quality long-form music generation. Audio samples are available at https://SqueezeComposer.github.io/.
Primary: Peking University
All Institutions: Peking University, The State Key Laboratory of Multimedia Information Processing, Hong Kong SAR, The Hong Kong University of Science and Technology
The main contribution of this paper is the introduction of SqueezeComposer, a novel framework for long-form music generation that utilizes temporal speed-up to enhance computational efficiency while preserving musical coherence. This work represents a significant advancement in the field of audio generation, addressing key challenges and opening avenues for future research in scalable music composition.
The methodology presented in SqueezeComposer is innovative, leveraging a temporal speed-up approach to address the challenges of long-form music generation. By generating music in an accelerated domain and restoring it to normal speed, the authors effectively reduce computational requirements while maintaining musical coherence. The hierarchical generation paradigm is well-structured, allowing for both abstract and detailed content generation. The use of diffusion models for both generation and refinement is a strong choice, aligning with current trends in audio synthesis. However, the paper could benefit from a more detailed explanation of the implementation specifics and the choice of hyperparameters.
The experiments are comprehensive, utilizing a variety of datasets and evaluation metrics, including Frรฉchet Audio Distance (FAD) and AudioBox-Aesthetics metrics. The results demonstrate that SqueezeComposer outperforms existing methods in terms of generation efficiency and quality, particularly in long-form music generation tasks. The comparison against established baselines is robust, showcasing the framework's effectiveness across different music generation scenarios. However, further qualitative assessments through user studies could enhance the evaluation of generated audio quality.
The paper provides a clear algorithmic description of the SqueezeComposer framework, but it lacks detailed implementation specifics, such as the exact architectures used for the diffusion models and the training process. Including code or a more thorough description of the experimental setup would improve reproducibility.
One limitation is the potential degradation in audio quality when using accelerated audio representations, which could affect the fidelity of the generated music. Additionally, while the framework shows promise for long-form music generation, the scalability to even longer compositions or more complex musical structures is not fully explored. The reliance on existing vocoders without retraining may also limit the potential for achieving the highest audio quality.
SqueezeComposer has the potential to significantly impact the field of music generation by enabling efficient production of long-form compositions, which could be beneficial for various applications in music production, film scoring, and interactive media. The approach could also inspire further research into hierarchical generation techniques and the use of accelerated representations in other domains of generative modeling. The main contribution of this paper is the introduction of SqueezeComposer, a novel framework for long-form music generation that utilizes temporal speed-up to enhance computational efficiency while preserving musical coherence. This work represents a significant advancement in the field of audio generation, addressing key challenges and opening avenues for future research in scalable music composition.
Multimodal Large Language Models (MLLMs) excel in Open-Vocabulary (OV) emotion recognition but often neglect fine-grained acoustic modeling. Existing methods typically use global audio encoders, failing to capture subtle, local temporal dynamics like micro-prosody and intonation shifts within individual utterances. To address this, we propose AcoustEmo, a time-sensitive MLLM featuring a novel Utterance-Aware Acoustic Q-Former. Our approach utilizes a timestamp-synchronized sliding window to dynamically extract segment-level audio tokens instead of coarse global representations. This enables the model to explicitly trace the temporal evolution of subtle acoustic clues and capture deep contextual dependencies in dialogues. Experiments on the Explainable Multimodal Emotion Recognition (EMER) task show that AcoustEmo significantly enhances complex emotion reasoning, outperforming baselines while maintaining robust contextual accuracy.
Primary: The University of Osaka
All Institutions: The University of Osaka, The University of Tokyo
The paper presents AcoustEmo, a time-sensitive MLLM that significantly enhances open-vocabulary emotion reasoning by capturing local acoustic dynamics through a novel Utterance-Aware Acoustic Q-Former. This work is a meaningful contribution to the field of multimodal emotion recognition, addressing critical gaps in existing methodologies and demonstrating substantial technical advancements.
The proposed methodology introduces a novel architecture, AcoustEmo, which leverages an Utterance-Aware Acoustic Q-Former to address the limitations of traditional global audio encoders in capturing fine-grained acoustic details. The use of a timestamp-synchronized sliding window for dynamic extraction of segment-level audio tokens is innovative, allowing the model to maintain semantic coherence between audio and text modalities. This approach is well-justified and effectively targets the nuances of emotion conveyed through micro-prosody and intonation shifts, which are critical for accurate emotion recognition.
The experiments conducted on the Explainable Multimodal Emotion Recognition (EMER) task demonstrate the effectiveness of the proposed model. The paper provides a comprehensive evaluation against multiple baseline models, showcasing significant improvements in performance metrics. The ablation studies further validate the necessity of the proposed components, reinforcing the claims made regarding the importance of local acoustic dynamics and timestamp synchronization.
The paper includes sufficient implementation details, including the architecture, optimization strategies, and dataset descriptions, which facilitate reproducibility. However, the absence of a publicly available code repository or demo URL limits the ability for other researchers to directly replicate the results.
While the model shows promising results, it occasionally misclassifies ambiguous emotional states, particularly in sarcastic utterances. Additionally, the performance can degrade in low-SNR scenarios due to background noise interference. These limitations highlight areas for future improvement, particularly in enhancing robustness against challenging acoustic conditions.
The advancements presented in AcoustEmo have significant implications for applications in empathetic conversational agents, mental health monitoring, and human-computer interaction. By improving the accuracy of emotion recognition in multimodal contexts, the model can contribute to more socially aware AI systems, enhancing user experiences in various interactive settings. The paper presents AcoustEmo, a time-sensitive MLLM that significantly enhances open-vocabulary emotion reasoning by capturing local acoustic dynamics through a novel Utterance-Aware Acoustic Q-Former. This work is a meaningful contribution to the field of multimodal emotion recognition, addressing critical gaps in existing methodologies and demonstrating substantial technical advancements.
Recent advancements in text-to-speech technologies enable generating high-fidelity synthetic speech nearly indistinguishable from real human voices. While recent studies show the efficacy of self-supervised learning-based speech encoders for deepfake detection, these models struggle to generalize across unseen speakers. Our quantitative analysis suggests these encoder representations are substantially influenced by speaker information, causing detectors to exploit speaker-specific correlations rather than artifact-related cues. We call this phenomenon speaker entanglement. To mitigate this reliance, we introduce SNAP, a speaker-nulling framework. We estimate a speaker subspace and apply orthogonal projection to suppress speaker-dependent components, isolating synthesis artifacts within the residual features. By reducing speaker entanglement, SNAP encourages detectors to focus on artifact-related patterns, leading to state-of-the-art performance.
Primary: NAVER Cloud
All Institutions: NAVER Cloud, clubsuit, clubsuit Kyudan Jung, clubsuit Jihwan Kim, dagger Cheonbok Park
The main contribution of this paper is the introduction of the SNAP framework, which effectively mitigates speaker entanglement in deepfake detection by employing orthogonal projection techniques to isolate synthesis artifacts. This innovative approach not only achieves state-of-the-art performance but also demonstrates robust generalization capabilities across unseen speakers and TTS models, marking a significant advancement in the field of audio deepfake detection.
The proposed SNAP framework introduces a novel approach to disentangle speaker identity from synthetic speech detection by employing orthogonal projection techniques. This mathematical decomposition of the feature space into speaker-dependent and artifact subspaces is innovative and effectively addresses the identified issue of speaker entanglement. The use of a simple logistic regression classifier on the refined features demonstrates a practical application of the method, emphasizing efficiency without compromising performance.
The experiments are well-structured, utilizing established datasets such as ASVspoof 2019 and 2021, and the In-the-Wild benchmark. The results show a clear improvement in detection performance, with significant reductions in equal error rates (EER) across various conditions, including unseen speakers and TTS models. The quantitative analysis of speaker entanglement through silhouette scores adds depth to the evaluation, reinforcing the effectiveness of the SNAP method.
The paper provides a clear description of the methodology, including feature extraction, subspace projection, and classification processes. However, the absence of a publicly available code repository or demo limits reproducibility. Future work should consider sharing implementation details to facilitate independent validation of results.
While the SNAP framework shows promising results, it primarily focuses on speaker nulling, which may not address other potential confounding factors in deepfake detection. Additionally, the reliance on logistic regression may limit the exploration of more complex models that could further enhance performance. The generalization to unseen TTS models is commendable, but the robustness across all possible variations in synthetic speech generation remains to be fully evaluated.
The implications of this research extend beyond deepfake detection, as the speaker-nulling framework could be applied to other areas of audio processing, such as emotion recognition and speaker-independent speech recognition. The ability to isolate artifacts from speaker identity can enhance the reliability of various speech-related applications, contributing to the development of more secure and trustworthy audio technologies. The main contribution of this paper is the introduction of the SNAP framework, which effectively mitigates speaker entanglement in deepfake detection by employing orthogonal projection techniques to isolate synthesis artifacts. This innovative approach not only achieves state-of-the-art performance but also demonstrates robust generalization capabilities across unseen speakers and TTS models, marking a significant advancement in the field of audio deepfake detection.
Large Language Models (LLMs) have advanced audio generation through discrete representation learning. However, most existing neural codecs focus on speech and emphasize reconstruction fidelity, overlooking unified low frame rate modeling across diverse audio domains, including speech, music, and general sound. Moreover, high reconstruction quality does not necessarily yield semantically informative representations, limiting effectiveness in downstream generation tasks. We propose OmniCodec, a universal neural audio codec tailored for low frame rate. It adopts a hierarchical multi-codebook design with semantic-acoustic decoupling by leveraging the audio encoder of the pre-trained understanding model, along with a self-guidance strategy to improve codebook utilization and reconstruction. Compared with the Mimi codec, experiments show that OmniCodec achieves outstanding performance at the same bitrate, delivering superior reconstruction quality while also providing more semantically informative representations that benefit downstream generation tasks. Our model and code will be open-sourced. Our demo page is available.
Primary: Northwestern Polytechnical University
All Institutions: Northwestern Polytechnical University, Shanghai Lingguang Zhaxian Technology
The main contribution of this paper is the introduction of OmniCodec, a universal neural audio codec that effectively combines low frame rate modeling with semantic-acoustic decoupling, achieving superior reconstruction quality and semantic representation across diverse audio domains. This work significantly advances the state of audio codecs, particularly in their application to large language models and generative tasks, and sets a foundation for future research in audio representation learning.
The methodology presented in this paper is innovative, particularly with its hierarchical multi-codebook design and the semantic-acoustic decoupling approach. The use of a pre-trained understanding model's audio encoder to enhance semantic representation is a novel contribution that addresses the limitations of existing codecs. The self-guidance strategy to improve codebook utilization is also a noteworthy addition, demonstrating a thoughtful approach to enhancing reconstruction quality while maintaining low frame rates.
The experiments are robust, utilizing a comprehensive dataset of approximately 160,000 hours of audio across various domains (speech, music, and general sound). The evaluation metrics are well-chosen, including both objective measures (PESQ, STOI, Mel distance) and subjective assessments (N-MOS, S-MOS). The results indicate that OmniCodec outperforms existing models, particularly in the music and general sound domains, which validates the effectiveness of the proposed architecture.
The paper provides sufficient implementation details, including model architecture, training procedures, and hyperparameters, which facilitates reproducibility. The open-sourcing of the model and code further enhances the potential for other researchers to replicate and build upon this work.
One limitation noted is the performance disparity in the speech domain compared to other models, which may be attributed to the structure of the WavLM model used for semantic supervision. Additionally, the paper acknowledges challenges in achieving optimal semantic decoupling for speech, suggesting that future work will be needed to address these issues.
The proposed OmniCodec has significant implications for audio generation tasks across various domains, including speech synthesis and music generation. Its ability to provide semantically informative representations can enhance applications in multimedia content creation, real-time audio processing, and interactive systems. The open-source nature of the project encourages further exploration and innovation in the field. The main contribution of this paper is the introduction of OmniCodec, a universal neural audio codec that effectively combines low frame rate modeling with semantic-acoustic decoupling, achieving superior reconstruction quality and semantic representation across diverse audio domains. This work significantly advances the state of audio codecs, particularly in their application to large language models and generative tasks, and sets a foundation for future research in audio representation learning.
Most existing text-to-speech (TTS) systems either synthesize speech sentence by sentence and stitch the results together, or drive synthesis from plain-text dialogues alone. Both approaches leave models with little understanding of global context or paralinguistic cues, making it hard to capture real-world phenomena such as multi-speaker interactions (interruptions, overlapping speech), evolving emotional arcs, and varied acoustic environments. We introduce the Borderless Long Speech Synthesis framework for agent-centric, borderless long audio synthesis. Rather than targeting a single narrow task, the system is designed as a unified capability set spanning VoiceDesigner, multi-speaker synthesis, Instruct TTS, and long-form text synthesis. On the data side, we propose a "Labeling over filtering/cleaning" strategy and design a top-down, multi-level annotation schema we call Global-Sentence-Token. On the model side, we adopt a backbone with a continuous tokenizer and add Chain-of-Thought (CoT) reasoning together with Dimension Dropout, both of which markedly improve instruction following under complex conditions. We further show that the system is Native Agentic by design: the hierarchical annotation doubles as a Structured Semantic Interface between the LLM Agent and the synthesis engine, creating a layered control protocol stack that spans from scene semantics down to phonetic detail. Text thereby becomes an information-complete, wide-band control channel, enabling a front-end LLM to convert inputs of any modality into structured generation commands, extending the paradigm from Text2Speech to borderless long speech synthesis.
Primary: Nanjing University
All Institutions: Nanjing University, WeNet Open Source Community
The main contribution of this paper is the introduction of the Borderless Long Speech Synthesis framework, which innovatively integrates multi-dimensional annotations and contextual understanding into TTS systems, significantly advancing the state-of-the-art in audio synthesis. The technical contributions and proposed methodologies offer substantial improvements over existing systems, although further experimental validation and reproducibility efforts are necessary to solidify its impact in the field.
The proposed methodology introduces a novel framework for long-form speech synthesis that emphasizes the importance of global context and paralinguistic cues. The "Labeling over filtering/cleaning" strategy is innovative, as it challenges conventional practices in data preparation by advocating for the inclusion of complex, noisy data that reflects real-world speech dynamics. The Global-Sentence-Token hierarchical annotation schema is a significant advancement, enabling a structured approach to capturing the nuances of speech synthesis. The integration of Chain-of-Thought reasoning and Dimension Dropout enhances the model's ability to follow complex instructions, which is a notable methodological improvement over existing TTS systems.
The paper lacks quantitative evaluations of the proposed system's performance, particularly in terms of emotional arc coherence and multi-speaker interaction naturalness. While it discusses the challenges of evaluating borderless long audio synthesis, it does not provide concrete experimental results or comparisons with existing methods. The absence of benchmark results limits the ability to assess the system's effectiveness rigorously. Future work is needed to establish robust evaluation metrics that can capture the richness of the proposed framework.
The paper does not provide sufficient implementation details or access to code and datasets, which raises concerns about reproducibility. The lack of a demo or project URL further complicates the ability for other researchers to replicate the findings or build upon this work. Clearer documentation and shared resources would enhance reproducibility.
The system is currently optimized for content creation rather than real-time interactions, which limits its applicability in dynamic environments. Additionally, the training data is primarily speech-centric, and the system's emergent capabilities for sound effects and music are not fully developed. These limitations suggest that while the framework is promising, it requires further refinement and expansion to address broader applications.
The potential applications of this research extend beyond traditional TTS systems, offering possibilities for enhanced audio experiences in content creation, gaming, and virtual environments. The ability to synthesize speech with rich emotional and contextual cues could significantly improve user engagement and interaction quality in various multimedia applications. However, the challenges in real-time synthesis and the need for more diverse training data must be addressed to realize its full impact. The main contribution of this paper is the introduction of the Borderless Long Speech Synthesis framework, which innovatively integrates multi-dimensional annotations and contextual understanding into TTS systems, significantly advancing the state-of-the-art in audio synthesis. The technical contributions and proposed methodologies offer substantial improvements over existing systems, although further experimental validation and reproducibility efforts are necessary to solidify its impact in the field.
While Large Audio-Language Models (LALMs) have advanced audio captioning, robust evaluation remains difficult. Reference-based metrics are expensive and often fail to assess acoustic fidelity, while Contrastive Language-Audio Pretraining (CLAP)-based approaches frequently overlook syntactic errors and fine-grained details. We propose CAF-Score, a reference-free metric that calibrates CLAP's coarse-grained semantic alignment with the fine-grained comprehension and syntactic awareness of LALMs. By combining contrastive audio-text embeddings with LALM reasoning, CAF-Score effectively detects syntactic inconsistencies and subtle hallucinations. Experiments on the BRACE benchmark demonstrate that our approach achieves the highest correlation with human judgments, even outperforming reference-based baselines in challenging scenarios. These results highlight the efficacy of CAF-Score for reference-free audio captioning evaluation. Code and results are available at https://github.com/inseong00/CAF-Score.
Primary: Sogang University
All Institutions: Sogang University
The main contribution of this paper is the introduction of CAF-Score, a novel reference-free metric for audio captioning evaluation that effectively combines the coarse-grained semantic alignment of CLAP with the fine-grained comprehension and syntactic awareness of LALMs. This work represents a significant step forward in the evaluation of audio captioning systems, addressing key challenges in the field and providing a foundation for future research and development.
The methodology presented in this paper is innovative, combining the strengths of CLAP and LALMs to create a reference-free evaluation metric for audio captioning. The use of a sliding-window approach with max pooling to enhance alignment accuracy is particularly noteworthy, as is the adaptation of the FLEUR metric for audio evaluation. The hybrid design effectively addresses the limitations of both models, allowing for more nuanced assessments of audio-text alignment.
The experiments conducted on the BRACE benchmark are extensive and well-structured, demonstrating the effectiveness of CAF-Score in comparison to both reference-based and existing reference-free metrics. The paper provides a thorough analysis of the performance across multiple models and configurations, showcasing the robustness of the proposed metric in various scenarios, including hallucination detection.
The implementation details are clearly outlined, and the authors provide a GitHub repository with code and results, enhancing the reproducibility of the study. However, the reliance on specific model configurations and the computational overhead of LALMs may pose challenges for some researchers attempting to replicate the results.
The paper acknowledges that the performance of CAF-Score is bounded by the capabilities of the underlying models, and instances of simultaneous misalignment between CLAP and LALMs can lead to failures in evaluation. Additionally, the fixed weighting parameter may not be optimal for all audio-caption pairs, suggesting a need for further exploration of adaptive strategies.
The proposed CAF-Score metric has significant implications for the field of audio captioning, providing a scalable and robust evaluation framework that does not rely on costly ground-truth annotations. This advancement could facilitate the development of more effective audio understanding and captioning systems, ultimately enhancing the accessibility and usability of audio content across various applications. The main contribution of this paper is the introduction of CAF-Score, a novel reference-free metric for audio captioning evaluation that effectively combines the coarse-grained semantic alignment of CLAP with the fine-grained comprehension and syntactic awareness of LALMs. This work represents a significant step forward in the evaluation of audio captioning systems, addressing key challenges in the field and providing a foundation for future research and development.
A multi-task learning framework is proposed for optimizing a single deep neural network (DNN) for joint noise reduction (NR) and hearing loss compensation (HLC). A distinct training objective is defined for each task, and the DNN predicts two time-frequency masks. During inference, the amounts of NR and HLC can be adjusted independently by exponentiating each mask before combining them. In contrast to recent approaches that rely on training an auditory-model emulator to define a differentiable training objective, we propose an auditory model that is inherently differentiable, thus allowing end-to-end optimization. The audiogram is provided as an input to the DNN, thereby enabling listener-specific personalization without the need for retraining. Results show that the proposed approach not only allows adjusting the amounts of NR and HLC individually, but also improves objective metrics compared to optimizing a single training objective. It also outperforms a cascade of two DNNs that were separately trained for NR and HLC, and shows competitive HLC performance compared to a traditional hearing-aid prescription. To the best of our knowledge, this is the first study that uses an auditory model to train a single DNN for both NR and HLC across a wide range of listener profiles.
Primary: Technical University of Denmark
All Institutions: Technical University of Denmark
This paper presents a novel multi-task learning framework for joint noise reduction and hearing loss compensation using a single deep neural network. The approach's innovative use of a differentiable auditory model and listener-specific personalization is a significant contribution to the field, with promising experimental results that could lead to practical applications in hearing aids and auditory processing technologies.
The paper introduces a multi-task learning framework that optimizes a single DNN for joint noise reduction (NR) and hearing loss compensation (HLC). The distinct training objectives for each task are well-defined, and the use of a differentiable auditory model for end-to-end optimization is a significant methodological advance. The incorporation of listener-specific audiograms as input for personalization without retraining is particularly innovative, showcasing a practical approach to tailoring solutions for individual users. The methodology is sound, but further details on the architecture and training process would enhance understanding.
The experiments conducted demonstrate a clear comparison between the proposed method and existing approaches, including a cascade of two separately trained DNNs and traditional hearing-aid prescriptions. The results indicate improvements in objective metrics, which are crucial for validating the effectiveness of the proposed framework. However, the paper could benefit from more extensive subjective evaluations (e.g., user studies) to complement the objective metrics and provide a holistic view of performance.
The paper lacks detailed implementation specifics, such as hyperparameters, training data characteristics, and the exact architecture of the DNN. This omission makes it challenging to fully assess reproducibility. Including a supplementary material section with code, datasets, and configuration files would significantly enhance reproducibility.
One limitation is the reliance on the audiogram as an input, which may not be available for all users. Additionally, while the results are promising, the paper does not address potential scalability issues or the performance of the model in real-world scenarios with diverse acoustic environments. The generalizability of the findings across different populations and hearing profiles also warrants further investigation.
The proposed framework has significant implications for the field of audiology and assistive technologies, potentially improving the quality of life for individuals with hearing loss. By enabling personalized adjustments to noise reduction and hearing compensation, this research could lead to more effective hearing aids and auditory devices. The integration of machine learning in this domain represents a step forward in the intersection of health technology and artificial intelligence. This paper presents a novel multi-task learning framework for joint noise reduction and hearing loss compensation using a single deep neural network. The approach's innovative use of a differentiable auditory model and listener-specific personalization is a significant contribution to the field, with promising experimental results that could lead to practical applications in hearing aids and auditory processing technologies.
Recent Video-to-Audio (V2A) methods have achieved remarkable progress, enabling the synthesis of realistic, high-quality audio. However, they struggle with fine-grained temporal control in multi-event scenarios or when visual cues are insufficient, such as small regions, off-screen sounds, or occluded or partially visible objects. In this paper, we propose FoleyDirector, a framework that, for the first time, enables precise temporal guidance in DiT-based V2A generation while preserving the base model's audio quality and allowing seamless switching between V2A generation and temporally controlled synthesis. FoleyDirector introduces Structured Temporal Scripts (STS), a set of captions corresponding to short temporal segments, to provide richer temporal information. These features are integrated via the Script-Guided Temporal Fusion Module, which employs Temporal Script Attention to fuse STS features coherently. To handle complex multi-event scenarios, we further propose Bi-Frame Sound Synthesis, enabling parallel in-frame and out-of-frame audio generation and improving controllability. To support training and evaluation, we construct the DirectorSound dataset and introduce VGGSoundDirector and DirectorBench. Experiments demonstrate that FoleyDirector substantially enhances temporal controllability while maintaining high audio fidelity, empowering users to act as Foley directors and advancing V2A toward more expressive and controllable generation.
Primary: Zhejiang University
All Institutions: Zhejiang University, The State Key Lab of Brain-Machine Intelligence, Zhejiang University
FoleyDirector introduces a novel framework for fine-grained temporal control in video-to-audio generation, significantly advancing the state-of-the-art in this domain. The combination of innovative methodologies and comprehensive experimental validation positions this work as a meaningful contribution to the field of machine learning and audio synthesis.
The methodology presented in FoleyDirector is innovative, particularly with the introduction of Structured Temporal Scripts (STS) and the Script-Guided Temporal Fusion Module. These components allow for fine-grained temporal control in video-to-audio generation, addressing a significant gap in existing methods that struggle with complex audio generation scenarios. The integration of Bi-Frame Sound Synthesis further enhances the capability to manage both in-frame and out-of-frame audio, showcasing a thoughtful approach to improving controllability in audio synthesis. The methodology is well-structured and provides a clear framework for implementation.
The experimental section demonstrates a robust evaluation of the proposed framework. The construction of the DirectorSound dataset and the introduction of evaluation benchmarks (VGGSoundDirector and DirectorBench) are commendable, as they provide necessary resources for training and evaluation. The experiments effectively illustrate the improvements in temporal controllability and audio fidelity, with results that substantiate the claims made in the paper. However, details on the evaluation metrics used and their significance could be elaborated further to enhance clarity.
While the paper outlines the methodology and experiments, it lacks explicit details regarding the implementation and availability of the code or datasets, which could hinder reproducibility. Providing a link to a project repository or supplementary materials would greatly enhance the paper's reproducibility and allow other researchers to build upon this work.
One limitation is the potential complexity in user interaction with the system, as fine-grained control may require a steep learning curve for users unfamiliar with audio synthesis. Additionally, the paper does not address the scalability of the framework in real-world applications or the computational resources required for training and inference.
The advancements made in FoleyDirector have significant implications for various applications, including film production, video game development, and virtual reality, where precise audio generation is critical. By empowering users to act as Foley directors, the framework can enhance the creative process in multimedia content creation, potentially leading to more immersive experiences. FoleyDirector introduces a novel framework for fine-grained temporal control in video-to-audio generation, significantly advancing the state-of-the-art in this domain. The combination of innovative methodologies and comprehensive experimental validation positions this work as a meaningful contribution to the field of machine learning and audio synthesis.
Spoken dialogue generation is crucial for applications like podcasts, dynamic commentary, and entertainment content, but poses significant challenges compared to single-utterance text-to-speech (TTS). Key requirements include accurate turn-taking, cross-turn acoustic consistency, and long-form stability, which current models often fail to address due to a lack of dialogue context modeling. To bridge this gap, we present MOSS-TTSD, a spoken dialogue synthesis model designed for expressive, multi-party conversational speech across multiple languages. With enhanced long-context modeling, MOSS-TTSD generates long-form spoken conversations from dialogue scripts with explicit speaker tags, supporting up to 60 minutes of single-pass synthesis, multi-party dialogue with up to 5 speakers, and zero-shot voice cloning from a short reference audio clip. The model supports various mainstream languages, including English and Chinese, and is adapted to several long-form scenarios. Additionally, to address limitations of existing evaluation methods, we propose TTSD-eval, an objective evaluation framework based on forced alignment that measures speaker attribution accuracy and speaker similarity without relying on speaker diarization tools. Both objective and subjective evaluation results show that MOSS-TTSD surpasses strong open-source and proprietary baselines in dialogue synthesis.
Primary: Shanghai Innovation Institute
All Institutions: Shanghai Innovation Institute, MOSI Intelligence, Fudan University
MOSS-TTSD presents a significant advancement in spoken dialogue generation, effectively addressing key challenges in the field. The comprehensive evaluation framework and the model's capabilities for long-form synthesis and multi-party interactions mark a notable contribution to the audio processing landscape.
The methodology presented in MOSS-TTSD is robust and well-structured, addressing significant challenges in spoken dialogue generation. The use of a fully discrete speech generation paradigm, combined with a multi-head delay pattern for autoregressive prediction, is innovative. The model's ability to handle long-form synthesis and multi-party dialogue through explicit speaker tagging and zero-shot voice cloning is a notable advancement. The introduction of the TTSD-eval framework for objective evaluation is a significant contribution, as it addresses the limitations of existing metrics that rely on speaker diarization.
The experiments conducted are comprehensive, utilizing both objective and subjective evaluation methods. The paper provides a clear comparison against strong open-source and proprietary baselines, demonstrating the superiority of MOSS-TTSD in terms of speaker consistency and intelligibility. The use of diverse test sets and the detailed description of the evaluation metrics enhance the credibility of the results.
The paper lacks specific URLs for the code and models, which hinders reproducibility. While the methodology is described in detail, the absence of a public repository makes it difficult for other researchers to replicate the results. Providing access to the code and trained models would significantly improve the reproducibility of the findings.
One limitation is the reliance on high-quality training data, which may not be readily available for all languages and scenarios. Additionally, while the model supports multiple languages, the performance across less common languages is not thoroughly evaluated. The potential for biases in the voice cloning process, particularly with limited reference audio, is another area that could be explored further.
The implications of MOSS-TTSD are substantial, particularly in applications such as podcasts, dynamic commentary, and entertainment content. The ability to generate coherent and natural multi-party dialogues opens new avenues for automated content creation and enhances user interaction in various multimedia applications. The model's multilingual capabilities also contribute to its broader applicability in global contexts. MOSS-TTSD presents a significant advancement in spoken dialogue generation, effectively addressing key challenges in the field. The comprehensive evaluation framework and the model's capabilities for long-form synthesis and multi-party interactions mark a notable contribution to the audio processing landscape.
Human communication seamlessly integrates speech and bodily motion, where hand gestures naturally complement vocal prosody to express intent, emotion, and emphasis. While recent text-to-speech (TTS) systems have begun incorporating multimodal cues such as facial expressions or lip movements, the role of hand gestures in shaping prosody remains largely underexplored. We propose a novel multimodal TTS framework, Gesture2Speech, that leverages visual gesture cues to modulate prosody in synthesized speech. Motivated by the observation that confident and expressive speakers coordinate gestures with vocal prosody, we introduce a multimodal Mixture-of-Experts (MoE) architecture that dynamically fuses linguistic content and gesture features within a dedicated style extraction module. The fused representation conditions an LLM-based speech decoder, enabling prosodic modulation that is temporally aligned with hand movements. We further design a gesture-speech alignment loss that explicitly models their temporal correspondence to ensure fine-grained synchrony between gestures and prosodic contours. Evaluations on the PATS dataset show that Gesture2Speech outperforms state-of-the-art baselines in both speech naturalness and gesture-speech synchrony. To the best of our knowledge, this is the first work to utilize hand gesture cues for prosody control in neural speech synthesis. Demo samples are available at https://research.sri-media-analysis.com/aaai26-beeu-gesture2speech/
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of Gesture2Speech, a multimodal TTS framework that leverages hand gestures to enhance prosody in synthesized speech, showcasing a novel approach to integrating visual cues in speech synthesis. This work represents a significant step forward in the field of expressive speech synthesis, combining advanced machine learning techniques with insights from human communication to create more natural and engaging speech outputs.
The proposed Gesture2Speech framework introduces a novel multimodal TTS architecture that integrates hand gestures as dynamic control signals for prosody modulation in synthesized speech. The use of a Mixture-of-Experts (MoE) architecture to dynamically fuse linguistic and gesture features is innovative, allowing for flexible and context-aware speech synthesis. The introduction of a gesture-speech alignment loss to ensure temporal synchrony between gestures and prosodic contours is a significant methodological advancement. However, the paper could benefit from a more detailed explanation of the training process and the specific configurations of the MoE modules.
The experiments conducted on the PATS dataset demonstrate the effectiveness of the Gesture2Speech framework in improving speech naturalness and gesture-speech synchrony compared to state-of-the-art baselines. The use of both objective metrics (e.g., WER, CER, UTMOS) and subjective evaluations (Mean Opinion Scores) provides a comprehensive assessment of the model's performance. The results indicate that the proposed multimodal approach significantly enhances prosodic expressiveness and alignment, although further exploration of different datasets and real-world applications could strengthen the findings.
The paper provides a clear description of the experimental setup, including the dataset, model configurations, and evaluation metrics, which aids reproducibility. However, the lack of a publicly available code repository limits the ability for others to replicate the results directly. Including implementation details such as hyperparameters and training procedures would further enhance reproducibility.
One notable limitation is the reliance on the PATS dataset, which may not encompass a diverse range of cultural and emotional expressions. Additionally, the framework's performance in real-world scenarios, where full-body visibility or high-resolution hand tracking may not be feasible, remains uncertain. The paper also does not address potential computational overhead associated with the MoE architecture, which could impact deployment in resource-constrained environments.
The Gesture2Speech framework has significant implications for applications in areas such as virtual assistants, dubbing, and interactive storytelling, where expressive speech synthesis is crucial. By incorporating hand gestures into TTS systems, the research paves the way for more natural and engaging human-computer interactions. Furthermore, the findings could inspire future research into multimodal communication and the integration of additional non-verbal cues. The main contribution of this paper is the introduction of Gesture2Speech, a multimodal TTS framework that leverages hand gestures to enhance prosody in synthesized speech, showcasing a novel approach to integrating visual cues in speech synthesis. This work represents a significant step forward in the field of expressive speech synthesis, combining advanced machine learning techniques with insights from human communication to create more natural and engaging speech outputs.
While Large Audio-Language Models (LALMs) have been shown to exhibit degraded instruction-following capabilities, their ability to infer task patterns from in-context examples under audio conditioning remains unstudied. To address this gap, we present ALICE, a three-stage framework that progressively reduces textual guidance to systematically evaluate LALMs' in-context learning ability under audio conditioning. Evaluating six LALMs across four audio understanding tasks under two output constraint categories, we uncover a consistent asymmetry across all stages and LALMs: in-context demonstrations reliably improve format compliance but fail to improve, and often degrade, the core task performance. This suggests that LALMs can glean surface-level formatting patterns from demonstrations but may struggle to leverage cross-modal semantic grounding to reliably infer task objectives from audio-conditioned examples, highlighting potential limitations in current cross-modal integration.
Primary: National Taiwan University
All Institutions: National Taiwan University
This paper presents ALICE, a novel framework for evaluating the in-context learning ability of large audio-language models, revealing critical insights into their limitations in cross-modal integration and task inference. The methodology is robust, and the findings contribute meaningfully to the understanding of LALMs, although further exploration in more diverse settings is warranted.
The paper introduces a novel three-stage evaluation framework (ALICE) that systematically reduces textual guidance to assess the in-context learning (ICL) ability of large audio-language models (LALMs) under audio conditioning. The methodology is well-structured, allowing for controlled experiments that isolate the effects of textual cues on task performance and format compliance. The use of diverse audio understanding tasks and the careful selection of models enhances the robustness of the findings.
The experiments are comprehensive, evaluating six LALMs across four audio understanding tasks with two output constraint categories. The results reveal a consistent asymmetry where in-context demonstrations improve format compliance but do not enhance core task performance, providing valuable insights into the limitations of current LALMs. The evaluation metrics are appropriate, and the analysis of results is thorough, although the paper could benefit from more detailed statistical analysis to support the claims.
The paper provides a GitHub repository link for the inference code and related resources, which is a positive aspect for reproducibility. However, the paper lacks detailed implementation specifics regarding the models and datasets used, which may hinder full reproducibility.
The study is limited to format-constrained audio understanding tasks, which may not generalize to other domains or more complex tasks. Additionally, the reliance on surface-level pattern matching for format inference suggests that the models may not be fully leveraging the potential of cross-modal integration, indicating a gap in their capabilities.
The findings have significant implications for the development of LALMs and highlight the need for improved training paradigms that better integrate auditory information with task objectives. This research could inform future work in multimodal AI systems, particularly in enhancing instruction-following and task inference capabilities in audio-language models. This paper presents ALICE, a novel framework for evaluating the in-context learning ability of large audio-language models, revealing critical insights into their limitations in cross-modal integration and task inference. The methodology is robust, and the findings contribute meaningfully to the understanding of LALMs, although further exploration in more diverse settings is warranted.
With the advancements in AI speech synthesis, it is easier than ever before to generate realistic audio in a target voice. One only needs a few seconds of reference audio from the target, quite literally putting words in the target person's mouth. This imposes a new set of forensics-related challenges on speech-based authentication systems, videoconferencing, and audio-visual broadcasting platforms, where we want to detect synthetic speech. At the same time, leveraging AI speech synthesis can enhance the different modes of communication through features such as low-bandwidth communication and audio enhancements - leading to ever-increasing legitimate use-cases of synthetic audio. In this case, we want to verify if the synthesized voice is actually spoken by the user. This will require a mechanism to verify whether a given synthetic audio is driven by an authorized identity, or not. We term this task audio avatar fingerprinting. As a step towards audio forensics in these new and emerging situations, we analyze and extend an off-the-shelf speaker verification model developed outside of forensics context for the task of fake speech detection and audio avatar fingerprinting, the first experimentation of its kind. Furthermore, we observe that no existing dataset allows for the novel task of verifying the authorized use of synthetic audio - a limitation which we address by introducing a new speech forensics dataset for this novel task.
Primary: Fort George G. Meade MD
All Institutions: Fort George G. Meade MD
The main contribution of this paper is the introduction of a novel framework for verifying the authorized use of synthetic audio through audio avatar fingerprinting. This work addresses a critical need in the evolving landscape of AI-generated content and has the potential to significantly impact the fields of audio forensics and security.
The paper proposes a novel approach termed "audio avatar fingerprinting," which extends existing speaker verification models to detect synthetic audio. The methodology is well-structured, leveraging off-the-shelf models while introducing a new dataset specifically designed for the task. The authors provide a clear rationale for their approach, addressing a significant gap in the current literature regarding the verification of synthetic speech. However, the paper could benefit from a more detailed explanation of the model's architecture and the specific modifications made to the existing verification model.
The authors introduce a new dataset for the task, which is a crucial contribution as it enables future research in this area. The experiments conducted demonstrate the effectiveness of the proposed method in distinguishing between authorized and unauthorized synthetic audio. The results are promising, showcasing the potential of the approach, although the paper lacks a comprehensive comparison with other state-of-the-art methods in the domain of audio forensics.
The paper does not provide sufficient details regarding the implementation of the proposed methods or the dataset creation process, which may hinder reproducibility. Including code repositories or detailed experimental setups would enhance the ability of other researchers to replicate the findings.
One notable limitation is the reliance on a single dataset, which may not capture the full diversity of synthetic audio scenarios. Additionally, the paper does not address potential adversarial attacks on the proposed method, which could be a significant concern in real-world applications.
The implications of this research are substantial, particularly in the context of audio forensics and security. As synthetic audio becomes more prevalent, the ability to authenticate voice recordings is crucial for preventing misuse. This work could pave the way for more secure communication systems and enhance trust in audio-based interactions. The main contribution of this paper is the introduction of a novel framework for verifying the authorized use of synthetic audio through audio avatar fingerprinting. This work addresses a critical need in the evolving landscape of AI-generated content and has the potential to significantly impact the fields of audio forensics and security.