Achieving natural full-duplex interaction in spoken dialogue systems (SDS) remains a challenge due to the difficulty of accurately detecting user interruptions. Current solutions are polarized between "trigger-happy" VAD-based methods that misinterpret backchannels and robust end-to-end models that exhibit unacceptable response delays. Moreover, the absence of real-world benchmarks and holistic metrics hinders progress in the field. This paper presents a comprehensive frame-work to overcome these limitations. We first introduce SID-Bench, the first benchmark for semantic-aware interruption detection built entirely from real-world human dialogues. To provide a rigorous assessment of the responsiveness-robustness trade-off, we propose the Average Penalty Time (APT) metric, which assigns a temporal cost to both false alarms and late responses. Building on this framework, we design an LLM-based detection model optimized through a novel training paradigm to capture subtle semantic cues of intent. Experimental results show that our model significantly outperforms mainstream baselines, achieving a nearly threefold reduction in APT. By successfully resolving the long-standing tension between speed and stability, our work establishes a new state-of-the-art for intelligent interruption handling in SDS. To facilitate future research, SID-Bench and the associated code are available at: https://github.com/xkx-hub/SID-bench.
Primary: Qwen Team, Alibaba
All Institutions: Qwen Team, Alibaba, Independent Researcher
The main contribution of this paper is the introduction of SID-Bench and the APT metric, which together provide a comprehensive framework for evaluating and improving interruption detection in spoken dialogue systems. This work significantly enhances the understanding of user interruptions in conversational AI, offering a robust methodology and strong experimental validation that addresses key challenges in the field.
The paper introduces a novel framework for interruption detection in spoken dialogue systems, emphasizing the creation of SID-Bench, a benchmark based on real-world data, and the Average Penalty Time (APT) metric for evaluation. The methodology is robust, incorporating a two-stage training paradigm that effectively leverages large language models (LLMs) for semantic understanding, and a hybrid annotation approach that combines LLMs with forced alignment for precise interruption labeling. This innovative approach addresses the limitations of existing VAD-based systems and enhances the model's ability to discern genuine interruptions from backchannels.
The experimental results are comprehensive, demonstrating a significant performance improvement in the proposed model over existing baselines across various metrics, including APT, FIR, and IRL. The use of SID-Bench allows for a rigorous evaluation of the model's capabilities in real-world scenarios, and the results clearly illustrate the trade-off between responsiveness and robustness, validating the effectiveness of the proposed methods.
The paper provides sufficient details regarding the model architecture, training procedures, and evaluation metrics, which would allow for reproducibility. The availability of SID-Bench and the associated code on GitHub further enhances the potential for other researchers to replicate the study and build upon the findings.
One limitation is the reliance on a specific set of conversational data, which may not encompass all possible interaction scenarios. Additionally, while the model achieves significant improvements in APT, further exploration into its performance across diverse languages and dialects could be beneficial. The paper also does not address the computational resources required for training the LLM-based model, which may limit accessibility for some researchers.
The proposed framework and benchmark have the potential to significantly advance the field of spoken dialogue systems by providing a more nuanced understanding of interruption handling. This could lead to more natural and efficient human-computer interactions, with applications in customer service, virtual assistants, and other conversational AI systems. The introduction of SID-Bench sets a precedent for future research in this area, encouraging the development of more sophisticated models that can better understand human intent. The main contribution of this paper is the introduction of SID-Bench and the APT metric, which together provide a comprehensive framework for evaluating and improving interruption detection in spoken dialogue systems. This work significantly enhances the understanding of user interruptions in conversational AI, offering a robust methodology and strong experimental validation that addresses key challenges in the field.
Integrating Federated Learning (FL) with self-supervised learning (SSL) enables privacy-preserving fine-tuning for speech tasks. However, federated environments exhibit significant heterogeneity: clients differ in computational capacity, causing straggler effects under unified fine-tuning, while diverse downstream tasks require different representation depths, making full-model updates inefficient. To address these challenges, we propose an adaptive federated fine-tuning framework with early exits. Lightweight prediction heads are inserted at intermediate layers of the SSL backbone, allowing clients to terminate computation based on local constraints and task requirements. We further introduce a layer-wise, depth-aware partial aggregation strategy to better utilize representations from different network depths. Experiments show that the framework reduces edge overhead, supports heterogeneous hardware, and maintains competitive performance in resource-constrained federated environments.
Primary: University of Cambridge
All Institutions: University of Cambridge, Electronic Information School, Flower Labs, University of Auckland, University of Melbourne, Wuhan University
This paper presents a novel adaptive federated fine-tuning framework that effectively addresses the challenges of heterogeneous environments in self-supervised speech representation learning. The technical contributions, particularly in the areas of early exits and layer-wise aggregation, represent a meaningful advancement in the field of federated learning for audio applications.
The proposed adaptive federated fine-tuning framework introduces innovative mechanisms such as early exits and layer-wise partial aggregation, which effectively address the challenges posed by heterogeneity in federated learning environments. The methodology is well-structured, leveraging an elastic multi-branch architecture that allows clients to dynamically select their training depth based on local resources and task complexity. This approach not only enhances computational efficiency but also maintains performance across diverse speech tasks. The integration of lightweight prediction heads and depth-aware aggregation strategies is a significant advancement in federated learning for speech applications.
The experiments are comprehensive, covering five diverse downstream tasks that span various aspects of speech understanding. The results demonstrate the effectiveness of the proposed framework in reducing computational overhead while achieving competitive performance compared to centralized training. The evaluation metrics used, including word error rates and classification error rates, are appropriate for the tasks at hand. However, the paper could benefit from additional comparisons with existing state-of-the-art methods to further contextualize the results.
The paper provides a detailed description of the experimental setup, including datasets, model architectures, and training configurations, which aids reproducibility. However, the lack of a publicly available code repository limits the ease with which others can replicate the experiments. Including a link to the implementation would significantly enhance reproducibility.
One limitation is the reliance on a specific backbone model (Wav2Vec 2.0), which may not generalize to all speech tasks or architectures. Additionally, while the framework addresses resource constraints, it does not fully explore the implications of data heterogeneity beyond the basic partitioning strategy employed. The paper could also discuss potential trade-offs between performance and computational efficiency in more detail.
The proposed framework has significant implications for deploying speech recognition systems in privacy-sensitive environments, such as mobile devices and personal assistants. By enabling efficient fine-tuning without compromising user data privacy, this work contributes to the growing field of privacy-preserving machine learning. The methodology could be adapted to other domains where federated learning is applicable, potentially influencing future research in decentralized learning systems. This paper presents a novel adaptive federated fine-tuning framework that effectively addresses the challenges of heterogeneous environments in self-supervised speech representation learning. The technical contributions, particularly in the areas of early exits and layer-wise aggregation, represent a meaningful advancement in the field of federated learning for audio applications.
Recent advances in generative models, such as diffusion and flow matching, have shown strong performance in audio tasks. However, speech enhancement (SE) models are typically trained on limited datasets and evaluated under narrow conditions, limiting real-world applicability. To address this, we propose DiT-Flow, a flow matching-based SE framework built on the latent Diffusion Transformer (DiT) backbone and trained for robustness across diverse distortions, including noise, reverberation, and compression. DiT-Flow operates on compact variational auto-encoders (VAEs)-derived latent features. We validated our approach on StillSonicSet, a synthetic yet acoustically realistic dataset composed of LibriSpeech, FSD50K, FMA, and 90 Matterport3D scenes. Experiments show that DiT-Flow consistently outperforms state-of-the-art generative SE models, demonstrating the effectiveness of flow matching in multi-condition speech enhancement. Despite ongoing efforts to expand synthetic data realism, a persistent bottleneck in SE is the inevitable mismatch between training and deployment conditions. By integrating LoRA with the MoE framework, we achieve both parameter-efficient and high-performance training for DiT-Flow robust to multiple distortions with using 4.9% percentage of the total parameters to obtain a better performance on five unseen distortions.
Primary: Johns Hopkins University
All Institutions: Johns Hopkins University, Technion Israel Institute of Technology, University of Haifa
The main contribution of this paper is the development of DiT-Flow, a novel speech enhancement framework that effectively utilizes flow matching and latent representations to improve robustness against multiple distortions. This work represents a significant step forward in the field of audio processing, addressing common challenges faced in real-world applications and demonstrating the potential for future advancements in speech enhancement technologies.
The methodology of DiT-Flow is robust, leveraging flow matching and latent Diffusion Transformers to enhance speech under multiple distortions. The integration of LoRA with the Mixture-of-Experts framework is particularly innovative, allowing for parameter-efficient adaptation to varying acoustic conditions. The use of a synthetic dataset, StillSonicSet, designed to simulate realistic conditions, further strengthens the approach. However, the paper could benefit from clearer descriptions of hyperparameter choices and training procedures.
The experiments are comprehensive, validating DiT-Flow against state-of-the-art models across various conditions. The use of multiple evaluation metrics, including PESQ, ESTOI, and DNSMOS, provides a well-rounded assessment of performance. The results demonstrate significant improvements over baseline models, particularly in challenging scenarios, indicating the effectiveness of the proposed methods. However, the paper lacks detailed comparisons with a broader range of existing methods, which could provide more context for its contributions.
The paper includes sufficient detail regarding the model architecture and training process, but lacks a clear link to code or datasets, which hampers reproducibility. Providing access to the StillSonicSet dataset and the trained models would enhance reproducibility and facilitate further research.
One limitation is the reliance on synthetic data, which may not fully capture the complexities of real-world audio environments. Additionally, while the model shows robustness to multiple distortions, its performance in extreme or novel conditions remains to be tested. The computational efficiency of the model in real-time applications also needs further exploration.
The advancements in speech enhancement presented in this paper have significant implications for real-world applications, particularly in telecommunication, virtual meetings, and assistive technologies. The ability to enhance speech quality in diverse acoustic environments can improve communication clarity and accessibility for users in various settings. The main contribution of this paper is the development of DiT-Flow, a novel speech enhancement framework that effectively utilizes flow matching and latent representations to improve robustness against multiple distortions. This work represents a significant step forward in the field of audio processing, addressing common challenges faced in real-world applications and demonstrating the potential for future advancements in speech enhancement technologies.
Speech LLM-based ASR often struggles with named entities and long-tail words due to strong internal language-model priors. Retrieval-augmented biasing can help, but its effectiveness depends on accurate hotword localization in full-utterance speech under weak supervision. We propose CLAR, a dual-encoder speech-text retriever that uses Continuous Integrate-and-Fire (CIF) to learn monotonic token-level alignments without timestamps. With length-aware localized matching, CLAR anchors short-entity acoustic cues and reduces representation dilution and attention drift. The retriever is trained with a multi-granularity objective combining global and local segment-level contrastive losses and a CIF quantity constraint. At inference, top-ranked hotwords are injected as contextual prompts for the Speech LLM, improving recognition without shallow fusion. Experiments show that CLAR significantly improves hotword retrieval and reduces both CER and B-WER against strong contextual ASR baselines.
Primary: BRVoice Team
All Institutions: BRVoice Team
The paper presents CLAR, a novel dual-encoder retrieval system that enhances contextual ASR by effectively localizing hotwords without timestamp supervision. The innovative methodology and significant experimental results position CLAR as a valuable contribution to the field of speech recognition and retrieval-augmented systems.
The paper introduces CLAR, a dual-encoder speech-text retriever that utilizes a Continuous Integrate-and-Fire (CIF) mechanism for monotonic token-level alignment without timestamps. The methodology is innovative, particularly in its approach to localized matching and the multi-granularity training objective that combines various contrastive losses. The use of CIF for hotword retrieval under weak supervision is a significant advancement in addressing the challenges of named entity recognition in ASR systems.
The experiments are well-structured, utilizing relevant datasets (AISHELL-1 and AISHELL-2) and metrics (CER, B-WER, recall, F1 score) to evaluate the performance of CLAR against strong baselines. The results demonstrate significant improvements in hotword retrieval and ASR accuracy, validating the effectiveness of the proposed method. However, the paper could benefit from additional comparative analyses with more recent state-of-the-art methods.
The paper provides sufficient details regarding the architecture, training procedures, and evaluation metrics, which would allow for reproducibility. However, the lack of a publicly available code repository or demo limits accessibility for further validation by the research community.
The paper does not address potential limitations in terms of scalability to larger datasets or multilingual settings, which could affect the generalizability of the findings. Additionally, the reliance on weak supervision may introduce challenges in alignment accuracy, particularly in noisy environments.
The advancements presented in this paper have significant implications for improving ASR systems in real-world applications, particularly in domains requiring accurate recognition of low-frequency words and named entities. The modular nature of CLAR allows for integration with various Speech LLMs, potentially enhancing user interactions in conversational AI systems. The paper presents CLAR, a novel dual-encoder retrieval system that enhances contextual ASR by effectively localizing hotwords without timestamp supervision. The innovative methodology and significant experimental results position CLAR as a valuable contribution to the field of speech recognition and retrieval-augmented systems.
Turn-taking modeling is fundamental to spoken dialogue systems, yet its evaluation remains fragmented and often limited to binary boundary detection under narrow interaction settings. Such protocols hinder systematic comparison and obscure model weaknesses across conversational conditions. We present CoDeTT, a context-aware decision benchmark for turn-taking evaluation. CoDeTT formulates turn-taking as a structured decision problem and constructs a multi-scenario dataset with fine-grained decision categories and controlled context variations. Under a unified evaluation protocol, we assess representative existing models and observe substantial performance disparities across decision types and interaction scenarios. CoDeTT provides a standardized benchmark for systematic and context-aware evaluation of turn-taking systems. The benchmark dataset and evaluation toolkit are available at https://github.com/YingaoWang-casia/CoDeTT.github.io.
Primary: BRVoice Team
All Institutions: BRVoice Team
The main contribution of this paper is the introduction of CoDeTT, a context-aware decision benchmark for turn-taking evaluation that systematically addresses the limitations of existing evaluation protocols. This work is significant as it enhances the understanding of model performance in conversational systems and provides a foundation for future research in turn-taking dynamics.
The proposed methodology introduces CoDeTT, a structured decision benchmark for turn-taking evaluation that effectively captures the complexities of conversational interactions. By formulating turn-taking as a structured decision problem and creating a multi-scenario dataset with fine-grained decision categories, the authors provide a robust framework for evaluating existing models. The use of a hierarchical taxonomy and a Two-Stage Funnel Evaluation Protocol enhances the depth of analysis, allowing for both coarse and fine-grained assessments of model performance.
The experiments are comprehensive, utilizing a large bilingual dataset of over 300 hours of dialogue, which is well-annotated and balanced across various decision scenarios. The evaluation of existing models under the CoDeTT benchmark reveals significant performance disparities, highlighting the utility of the benchmark in exposing model weaknesses. The introduction of the Semantic Misalignment Rate (SMR) as a diagnostic metric is particularly noteworthy, as it provides insights into the underlying reasoning of models.
The paper provides a clear description of the dataset construction process and the evaluation protocol, which enhances reproducibility. The availability of the dataset and evaluation toolkit on GitHub further supports this aspect. However, the paper could benefit from more detailed implementation guidelines for replicating the experiments.
One limitation is the reliance on synthetic data generation for part of the dataset, which may introduce biases or artifacts that could affect model performance. Additionally, while the benchmark exposes decision-specific performance variations, it may not fully account for all contextual nuances present in real-world conversations.
The CoDeTT benchmark has the potential to significantly influence the development of more sophisticated and context-aware conversational agents, improving user experience in spoken dialogue systems. By providing a standardized evaluation framework, it encourages further research into turn-taking modeling and may lead to advancements in human-computer interaction. The main contribution of this paper is the introduction of CoDeTT, a context-aware decision benchmark for turn-taking evaluation that systematically addresses the limitations of existing evaluation protocols. This work is significant as it enhances the understanding of model performance in conversational systems and provides a foundation for future research in turn-taking dynamics.
Achieving natural full-duplex interaction in spoken dialogue systems (SDS) remains a challenge due to the difficulty of accurately detecting user interruptions. Current solutions are polarized between "trigger-happy" VAD-based methods that misinterpret backchannels and robust end-to-end models that exhibit unacceptable response delays. Moreover, the absence of real-world benchmarks and holistic metrics hinders progress in the field. This paper presents a comprehensive frame-work to overcome these limitations. We first introduce SID-Bench, the first benchmark for semantic-aware interruption detection built entirely from real-world human dialogues. To provide a rigorous assessment of the responsiveness-robustness trade-off, we propose the Average Penalty Time (APT) metric, which assigns a temporal cost to both false alarms and late responses. Building on this framework, we design an LLM-based detection model optimized through a novel training paradigm to capture subtle semantic cues of intent. Experimental results show that our model significantly outperforms mainstream baselines, achieving a nearly threefold reduction in APT. By successfully resolving the long-standing tension between speed and stability, our work establishes a new state-of-the-art for intelligent interruption handling in SDS. To facilitate future research, SID-Bench and the associated code are available at: https://github.com/xkx-hub/SID-bench.
Primary: Qwen Team, Alibaba
All Institutions: Qwen Team, Alibaba, Independent Researcher
The main contribution of this paper is the introduction of SID-Bench and the APT metric, which together provide a comprehensive framework for evaluating and improving interruption detection in spoken dialogue systems. This work significantly enhances the understanding of user interruptions in conversational AI, offering a robust methodology and strong experimental validation that addresses key challenges in the field.
The paper introduces a novel framework for interruption detection in spoken dialogue systems, emphasizing the creation of SID-Bench, a benchmark based on real-world data, and the Average Penalty Time (APT) metric for evaluation. The methodology is robust, incorporating a two-stage training paradigm that effectively leverages large language models (LLMs) for semantic understanding, and a hybrid annotation approach that combines LLMs with forced alignment for precise interruption labeling. This innovative approach addresses the limitations of existing VAD-based systems and enhances the model's ability to discern genuine interruptions from backchannels.
The experimental results are comprehensive, demonstrating a significant performance improvement in the proposed model over existing baselines across various metrics, including APT, FIR, and IRL. The use of SID-Bench allows for a rigorous evaluation of the model's capabilities in real-world scenarios, and the results clearly illustrate the trade-off between responsiveness and robustness, validating the effectiveness of the proposed methods.
The paper provides sufficient details regarding the model architecture, training procedures, and evaluation metrics, which would allow for reproducibility. The availability of SID-Bench and the associated code on GitHub further enhances the potential for other researchers to replicate the study and build upon the findings.
One limitation is the reliance on a specific set of conversational data, which may not encompass all possible interaction scenarios. Additionally, while the model achieves significant improvements in APT, further exploration into its performance across diverse languages and dialects could be beneficial. The paper also does not address the computational resources required for training the LLM-based model, which may limit accessibility for some researchers.
The proposed framework and benchmark have the potential to significantly advance the field of spoken dialogue systems by providing a more nuanced understanding of interruption handling. This could lead to more natural and efficient human-computer interactions, with applications in customer service, virtual assistants, and other conversational AI systems. The introduction of SID-Bench sets a precedent for future research in this area, encouraging the development of more sophisticated models that can better understand human intent. The main contribution of this paper is the introduction of SID-Bench and the APT metric, which together provide a comprehensive framework for evaluating and improving interruption detection in spoken dialogue systems. This work significantly enhances the understanding of user interruptions in conversational AI, offering a robust methodology and strong experimental validation that addresses key challenges in the field.
General audio understanding is a fundamental goal for large audio-language models, with audio captioning serving as a cornerstone task for their development. However, progress in this domain is hindered by existing datasets, which lack the scale and descriptive granularity required to train truly versatile models. To address this gap, we introduce ACAVCaps, a new large-scale, fine-grained, and multi-faceted audio captioning dataset. Derived from the ACAV100M collection, ACAVCaps is constructed using a multi-expert pipeline that analyzes audio from diverse perspectives-including speech, music, and acoustic properties-which are then synthesized into rich, detailed descriptions by a large language model. Experimental results demonstrate that models pre-trained on ACAVCaps exhibit substantially stronger generalization capabilities on various downstream tasks compared to those trained on other leading captioning datasets. The dataset is available at https://github.com/xiaomi-research/acavcaps.
Primary: Xiaomi Research
All Institutions: Xiaomi Research
The main contribution of this paper is the introduction of ACAVCaps, a novel large-scale audio captioning dataset that significantly enhances the granularity and diversity of audio understanding, thereby advancing the development of robust audio-language models. The methodology and experimental validation presented in this work position it as a valuable resource for future research in the field of audio processing and multimodal learning.
The methodology for constructing the ACAVCaps dataset is innovative, utilizing a multi-expert pipeline that integrates various analytical perspectives (speech, music, acoustic properties) and synthesizes detailed descriptions using a large language model (LLM). This approach addresses the limitations of existing datasets by ensuring both scale and descriptive granularity, which are crucial for training versatile audio models. The use of Chain-of-Thought (CoT) prompting for LLMs to generate diverse and semantically rich captions is particularly noteworthy, as it enhances the quality of the generated annotations.
The experimental evaluation is robust, demonstrating clear superiority of models trained on ACAVCaps across various downstream tasks compared to other datasets. The use of comprehensive benchmarks like MECAT-Caption and the detailed analysis of generalization performance across multiple audio domains (speech, sound events, music) provide strong evidence of the dataset's effectiveness. The results are quantitatively supported by metrics that emphasize both descriptive specificity and semantic similarity, reinforcing the dataset's intended impact.
The paper provides sufficient implementation details regarding the training process, model architecture, and evaluation metrics. However, the reproducibility may be limited by the lack of access to the specific expert models used in the multi-expert pipeline, which are crucial for generating the dataset. The dataset itself is available, which aids in reproducibility, but the exact configurations and parameters for the LLM and expert models could be better documented.
One limitation is the potential bias introduced by the expert models used for audio analysis, which may not capture all nuances of audio content. Additionally, while the dataset is large and diverse, it may still miss certain rare or unique audio events that could be important for comprehensive audio understanding. The reliance on automated processes for generating captions might also lead to inconsistencies in quality across different audio samples.
The introduction of ACAVCaps has significant implications for the field of audio understanding and multimodal AI. By providing a rich, large-scale dataset, it enables the development of more capable audio-language models that can generalize better across various tasks. This can lead to advancements in applications such as automatic audio transcription, sound event detection, and even creative audio generation, ultimately enhancing the capabilities of AI systems in understanding and interacting with the auditory world. The main contribution of this paper is the introduction of ACAVCaps, a novel large-scale audio captioning dataset that significantly enhances the granularity and diversity of audio understanding, thereby advancing the development of robust audio-language models. The methodology and experimental validation presented in this work position it as a valuable resource for future research in the field of audio processing and multimodal learning.
Reliable evaluation of modern zero-shot text-to-speech (TTS) models remains challenging. Subjective tests are costly and hard to reproduce, while objective metrics often saturate, failing to distinguish SOTA systems. To address this, we propose Iterate to Differentiate (I2D), an evaluation framework that recursively synthesizes speech using the model's own outputs as references. Higher-quality models exhibit greater resilience to the distributional shift induced by iterative synthesis, resulting in slower performance degradation. I2D exploits this differential degradation to amplify performance gaps and reveal robustness. By aggregating objective metrics across iterations, I2D improves discriminability and alignment with human judgments, increasing system-level SRCC from 0.118 to 0.464 for UTMOSv2. Experiments on 11 models across Chinese, English, and emotion datasets demonstrate that I2D enables more reliable automated evaluation for zero-shot TTS.
Primary: Hong Kong University of Science and Technology
All Institutions: Hong Kong University of Science and Technology, Nanjing University
The main contribution of this paper is the introduction of the I2D framework, which enhances the reliability and discriminability of zero-shot TTS evaluations through an innovative iterative synthesis approach. This work addresses critical challenges in TTS evaluation, providing a robust methodology that can significantly impact the field by enabling more accurate assessments of model performance.
The proposed I2D framework introduces an innovative approach to TTS evaluation by leveraging iterative synthesis to amplify performance differences among models. This methodology addresses critical issues of score saturation in traditional evaluation metrics, providing a more reliable means of assessing TTS systems. The recursive use of synthesized outputs as references is a novel strategy that effectively reveals robustness differences among models, which is a significant advancement in the field.
The experiments conducted on 11 TTS models across multiple datasets are comprehensive and well-structured. The paper demonstrates a clear correlation between the proposed evaluation method and human judgments, significantly improving the reliability of automated TTS assessments. The use of both objective and subjective metrics strengthens the findings, although the paper could benefit from more detailed statistical analyses to further validate the results.
The paper provides sufficient details on the datasets, evaluation metrics, and experimental setup, which supports reproducibility. However, the lack of a publicly accessible code repository limits the ability for others to directly replicate the results. Including a project URL would enhance reproducibility.
The paper acknowledges higher computational costs associated with the I2D framework and its potential bias towards model stability over diversity. Additionally, the evaluation's reliance on reference audio quality may introduce conflicts in assessing naturalness versus speaker similarity, particularly in zero-shot settings.
The I2D framework has significant implications for the TTS community, offering a scalable and practical solution for evaluating model performance. By improving the discriminability of evaluation metrics, it can facilitate advancements in TTS technology, leading to better user experiences in applications such as virtual assistants, audiobooks, and more. The main contribution of this paper is the introduction of the I2D framework, which enhances the reliability and discriminability of zero-shot TTS evaluations through an innovative iterative synthesis approach. This work addresses critical challenges in TTS evaluation, providing a robust methodology that can significantly impact the field by enabling more accurate assessments of model performance.
We propose Uni-ArrayDPS, a novel diffusion-based refinement framework for unified multi-channel speech enhancement and separation. Existing methods for multi-channel speech enhancement/separation are mostly discriminative and are highly effective at producing high-SNR outputs. However, they can still generate unnatural speech with non-linear distortions caused by the neural network and regression-based objectives. To address this issue, we propose Uni-ArrayDPS, which refines the outputs of any strong discriminative model using a speech diffusion prior. Uni-ArrayDPS is generative, array-agnostic, and training-free, and supports both enhancement and separation. Given a discriminative model's enhanced/separated speech, we use it, together with the noisy mixtures, to estimate the noise spatial covariance matrix (SCM). We then use this SCM to compute the likelihood required for diffusion posterior sampling of the clean speech source(s). Uni-ArrayDPS requires only a pre-trained clean-speech diffusion model as a prior and does not require additional training or fine-tuning, allowing it to generalize directly across tasks (enhancement/separation), microphone array geometries, and discriminative model backbones. Extensive experiments show that Uni-ArrayDPS consistently improves a wide range of discriminative models for both enhancement and separation tasks. We also report strong results on a real-world dataset. Audio demos are provided at \href{https://xzwy.github.io/Uni-ArrayDPS/}{https://xzwy.github.io/Uni-ArrayDPS/}.
Primary: University of Illinois at Urbana-Champaign
All Institutions: University of Illinois at Urbana-Champaign, Reality Labs Research at Meta
The main contribution of this paper is the introduction of Uni-ArrayDPS, a novel diffusion-based refinement framework that significantly enhances multi-channel speech enhancement and separation tasks by integrating generative and discriminative models. This approach addresses the limitations of existing methods, providing a robust solution that improves perceptual quality and intelligibility in various acoustic environments.
The proposed Uni-ArrayDPS framework utilizes a diffusion-based refinement approach that is both generative and array-agnostic, addressing the limitations of existing discriminative models in multi-channel speech enhancement and separation. The methodology is well-structured, leveraging a pre-trained clean-speech diffusion model and estimating the noise spatial covariance matrix (SCM) from the outputs of discriminative models. This allows for a seamless integration of generative and discriminative approaches, enhancing the quality of the output without requiring additional training. The paper also provides a thorough explanation of the diffusion process, likelihood computation, and the refinement pipeline, demonstrating a clear understanding of the underlying principles.
The experimental results are extensive, showcasing the effectiveness of Uni-ArrayDPS across various discriminative models and real-world datasets. The paper reports significant improvements in perceptual quality, intelligibility, and automatic speech recognition (ASR) metrics. The use of multiple evaluation metrics, including STOI, PESQ, and WER, provides a comprehensive assessment of the framework's performance. The ablation studies further validate the impact of different hyperparameters on the refinement process, offering insights into the balance between generative and discriminative outputs.
The paper includes detailed descriptions of the experimental setup, including hyperparameter configurations, dataset generation, and model training protocols. However, the lack of a publicly available code repository limits the reproducibility of the results. Future work should consider making the code and trained models accessible to facilitate further research and validation.
While the framework shows promising results, it is primarily evaluated on simulated datasets, which may not fully capture the complexities of real-world scenarios. Additionally, the reliance on a pre-trained diffusion model may limit the generalizability of the approach to different speech domains or environments. The paper acknowledges these limitations but does not provide extensive discussion on how they might be addressed in future work.
The proposed method has significant implications for various applications in speech processing, including telecommunications, hearing aids, and voice recognition systems. By improving the quality of speech enhancement and separation in noisy environments, the framework could enhance communication clarity and accessibility for users in challenging acoustic settings. The integration of generative models into traditional discriminative frameworks also opens new avenues for research in audio processing. The main contribution of this paper is the introduction of Uni-ArrayDPS, a novel diffusion-based refinement framework that significantly enhances multi-channel speech enhancement and separation tasks by integrating generative and discriminative models. This approach addresses the limitations of existing methods, providing a robust solution that improves perceptual quality and intelligibility in various acoustic environments.
Speaker verification at large scale remains an open challenge as fixed-margin losses treat all samples equally regardless of quality. We hypothesize that mislabeled or degraded samples introduce noisy gradients that disrupt compact speaker manifolds. We propose Curry (CURriculum Ranking), an adaptive loss that estimates sample difficulty online via Sub-center ArcFace: confidence scores from dominant sub-center cosine similarity rank samples into easy, medium, and hard tiers using running batch statistics, without auxiliary annotations. Learnable weights guide the model from stable identity foundations through manifold refinement to boundary sharpening. To our knowledge, this is the largest-scale speaker verification system trained to date. Evaluated on VoxCeleb1-O, and SITW, Curry reduces EER by 86.8\% and 60.0\% over the Sub-center ArcFace baseline, establishing a new paradigm for robust speaker verification on imperfect large-scale data.
Primary: Carnegie Mellon University
All Institutions: Carnegie Mellon University
The main contribution of this paper is the introduction of the CURriculum Ranking loss, which effectively addresses the challenges of large-scale speaker verification by dynamically adjusting the learning process based on sample difficulty. This innovative methodology, coupled with strong experimental results, positions the work as a significant advancement in the field of audio processing and speaker verification.
The proposed CURriculum Ranking (Curry) loss introduces an innovative approach to handling sample difficulty in speaker verification tasks. By utilizing Sub-center ArcFace for estimating sample difficulty and dynamically adjusting the learning process based on sample quality, the methodology stands out for its adaptability and lack of reliance on auxiliary annotations. This approach addresses a significant gap in existing loss functions that treat all samples uniformly, thereby enhancing the robustness of the model against noisy gradients from mislabeled or degraded samples.
The experiments conducted on large-scale datasets such as VoxCeleb1-O and SITW are rigorous and demonstrate substantial improvements in Equal Error Rate (EER) over the baseline. The reported reductions of 86.8% and 60.0% in EER are compelling, showcasing the effectiveness of the proposed method. However, further details on the experimental setup, including the specific configurations and hyperparameters used, would enhance the transparency and replicability of the results.
The paper lacks sufficient implementation details, such as code availability or specific configurations, which could hinder reproducibility. While the methodology is well-articulated, providing access to the code and datasets used would significantly bolster the paper's impact and allow other researchers to validate the findings.
One limitation is the potential overfitting to the specific datasets used for evaluation. The performance improvements may not generalize to other datasets or real-world scenarios without further validation. Additionally, the reliance on running batch statistics may introduce variability depending on batch sizes and compositions, which could affect the stability of the training process.
The proposed method has significant implications for speaker verification systems, particularly in real-world applications where data quality can vary widely. By improving robustness against mislabeled and degraded samples, this research could enhance the reliability of speaker verification in security, forensics, and personal assistant technologies. The approach could also inspire further research into adaptive loss functions across various machine learning domains. The main contribution of this paper is the introduction of the CURriculum Ranking loss, which effectively addresses the challenges of large-scale speaker verification by dynamically adjusting the learning process based on sample difficulty. This innovative methodology, coupled with strong experimental results, positions the work as a significant advancement in the field of audio processing and speaker verification.
Regenerating singing voices with altered lyrics while preserving melody consistency remains challenging, as existing methods either offer limited controllability or require laborious manual alignment. We propose YingMusic-Singer, a fully diffusion-based model enabling melody-controllable singing voice synthesis with flexible lyric manipulation. The model takes three inputs: an optional timbre reference, a melody-providing singing clip, and modified lyrics, without manual alignment. Trained with curriculum learning and Group Relative Policy Optimization, YingMusic-Singer achieves stronger melody preservation and lyric adherence than Vevo2, the most comparable baseline supporting melody control without manual alignment. We also introduce LyricEditBench, the first benchmark for melody-preserving lyric modification evaluation. The code, weights, benchmark, and demos are publicly available at https://github.com/ASLP-lab/YingMusic-Singer.
Primary: University
All Institutions: University
YingMusic-Singer presents a significant advancement in controllable singing voice synthesis, offering a novel approach to lyric manipulation while maintaining melody fidelity. The technical contributions, particularly in methodology and evaluation, position this work as a valuable asset in the field of audio and music technology.
The methodology presented in YingMusic-Singer is innovative, leveraging a fully diffusion-based model that synthesizes singing voices from minimal input without requiring manual alignment. The use of curriculum learning and Group Relative Policy Optimization (GRPO) is particularly noteworthy as it addresses the trade-off between melody adherence and lyric fidelity. The architecture integrates a Variational Autoencoder, a Melody Extractor, and an IPA Tokenizer, which collectively enhance the model's ability to generate high-quality outputs. The introduction of LyricEditBench as a benchmark for evaluating lyric modification is a significant contribution, providing a structured framework for future research.
The experiments are thorough, comparing YingMusic-Singer against Vevo2 across multiple tasks and languages. The results demonstrate clear improvements in performance metrics such as Phoneme Error Rate (PER), melody adherence (F0-CORR), and subjective evaluations (N-MOS, M-MOS). The comprehensive evaluation across six editing types and two languages provides a robust validation of the model's capabilities, showcasing its strength in maintaining melody while allowing for lyric modifications.
The paper provides sufficient implementation details, including architecture specifications, training protocols, and datasets used. The authors have made their code, model weights, and benchmark publicly available, which enhances reproducibility. However, the complexity of the model and the specific configurations used may still pose challenges for replication without additional guidance.
One limitation noted is the potential for increased phoneme error rates when the model is tasked with generating significantly altered phoneme sequences while preserving melody. Additionally, while the model shows promise, its performance may vary with different singing techniques or languages not covered in the training data. The reliance on large-scale singing data also raises questions about the generalizability of the model to diverse vocal styles.
The implications of YingMusic-Singer are substantial, as it opens avenues for practical applications in music production, personalized music generation, and cross-lingual adaptations. The ability to modify lyrics while preserving melody could revolutionize how artists approach song covers and adaptations, making the technology accessible to a broader audience. Furthermore, the introduction of a benchmark for lyric editing could stimulate further research in the field of singing voice synthesis. YingMusic-Singer presents a significant advancement in controllable singing voice synthesis, offering a novel approach to lyric manipulation while maintaining melody fidelity. The technical contributions, particularly in methodology and evaluation, position this work as a valuable asset in the field of audio and music technology.
Multi-channel speech enhancement aims to recover clean speech from noisy multi-channel recordings. Most deep learning methods employ discriminative training, which can lead to non-linear distortions from regression-based objectives, especially under challenging environmental noise conditions. Inspired by ArrayDPS for unsupervised multi-channel source separation, we introduce ArrayDPS-Refine, a method designed to enhance the outputs of discriminative models using a clean speech diffusion prior. ArrayDPS-Refine is training-free, generative, and array-agnostic. It first estimates the noise spatial covariance matrix (SCM) from the enhanced speech produced by a discriminative model, then uses this estimated noise SCM for diffusion posterior sampling. This approach allows direct refinement of any discriminative model's output without retraining. Our results show that ArrayDPS-Refine consistently improves the performance of various discriminative models, including state-of-the-art waveform and STFT domain models. Audio demos are provided at https://xzwy.github.io/ArrayDPSRefineDemo/.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of ArrayDPS-Refine, a novel generative refinement method for multi-channel speech enhancement that significantly improves the performance of existing discriminative models without retraining. This work represents a meaningful advancement in the field of audio processing, particularly in enhancing speech intelligibility and quality in challenging acoustic environments.
The methodology presented in the paper is innovative, introducing ArrayDPS-Refine as a generative refinement technique that enhances outputs from discriminative models without the need for retraining. This is achieved through the estimation of the noise spatial covariance matrix (SCM) and the application of diffusion posterior sampling. The approach is well-structured, leveraging existing techniques in multi-channel speech enhancement while addressing the limitations of previous methods. The training-free aspect is particularly noteworthy, as it allows for flexibility across different models and configurations.
The experimental evaluation is comprehensive, demonstrating the effectiveness of ArrayDPS-Refine across various discriminative models, including state-of-the-art techniques. The use of multiple metrics such as STOI, eSTOI, PESQ, SI-SDR, and WER provides a robust framework for assessing performance improvements. The results indicate significant enhancements in intelligibility and perceptual quality, validating the proposed method's effectiveness. However, the paper could benefit from more detailed comparisons with baseline models to further contextualize the improvements.
The paper provides a detailed account of the experimental setup, including configurations for the diffusion model and the datasets used. However, the lack of a publicly available code repository limits reproducibility. Future work should consider releasing the code and models to facilitate validation of results by the community.
One limitation of the proposed method is its reliance on the quality of the initial discriminative model outputs. If the initial outputs are significantly distorted, the refinement process may not yield optimal results. Additionally, the method's performance in highly complex noise environments or with multiple overlapping speakers remains to be fully explored.
The implications of this work are significant for applications in speech recognition, telecommunications, and assistive technologies. By improving speech enhancement techniques, the proposed method could enhance communication in noisy environments, benefiting users in various real-world scenarios. The training-free nature of the method also suggests potential for broader adoption across different devices and applications. The main contribution of this paper is the introduction of ArrayDPS-Refine, a novel generative refinement method for multi-channel speech enhancement that significantly improves the performance of existing discriminative models without retraining. This work represents a meaningful advancement in the field of audio processing, particularly in enhancing speech intelligibility and quality in challenging acoustic environments.
Speech Emotion Recognition (SER) in real-world scenarios remains challenging due to severe class imbalance and the prevalence of spontaneous, natural speech. While recent approaches leverage self-supervised learning (SSL) representations and multimodal fusion of speech and text, most existing methods apply supervision only at the final classification layer, limiting the discriminative power of intermediate representations. In this work, we propose Crab (Contrastive Representation and Multimodal Aligned Bottleneck), a bimodal Cross-Modal Transformer architecture that integrates speech representations from WavLM and textual representations from RoBERTa, together with a novel \textit{Multi Layer Contrastive Supervision} (MLCS) strategy. MLCS injects multi-positive contrastive learning signals at multiple layers of the network, encouraging emotionally discriminative representations throughout the model without introducing additional parameters at inference time. To further address data imbalance, we adopt weighted cross-entropy during training. We evaluate the proposed approach on three benchmark datasets covering different degrees of emotional naturalness: IEMOCAP, MELD, and MSP-Podcast 2.0. Experimental results demonstrate that Crab consistently outperforms strong unimodal and multimodal baselines across all datasets, with particularly large gains under naturalistic and highly imbalanced conditions. These findings highlight the effectiveness of \textit{Multi Layer Contrastive Supervision} as a general and robust strategy for SER. Official implementation can be found in https://github.com/AI-Unicamp/Crab.
Primary: Universidade Estadual de Campinas (UNICAMP)
All Institutions: Universidade Estadual de Campinas (UNICAMP), MCTI, CAPES, FAPESP
The paper presents Crab, a multimodal SER framework that effectively integrates speech and text representations through a novel contrastive learning strategy, achieving significant performance improvements in emotion recognition tasks. The innovative approach and rigorous evaluation contribute meaningfully to the field of speech emotion recognition, addressing key challenges in real-world applications.
The proposed methodology introduces a novel Cross-Modal Transformer architecture, integrating speech and text representations while employing Multi Layer Contrastive Supervision (MLCS) to enhance emotion recognition. This approach is innovative as it applies contrastive learning at multiple layers, which is not common in existing SER frameworks. The use of weighted cross-entropy to address class imbalance further strengthens the methodology, making it robust for real-world applications.
The experimental evaluation is comprehensive, utilizing three benchmark datasets (IEMOCAP, MELD, and MSP-Podcast 2.0) that vary in emotional naturalness. The results consistently demonstrate superior performance of the Crab model compared to strong baselines, particularly in challenging naturalistic scenarios with class imbalance. The use of multiple evaluation metrics (UAR and WAR) provides a well-rounded assessment of model performance.
The paper includes a link to the official implementation on GitHub, which is crucial for reproducibility. However, specific implementation details such as hyperparameters and training configurations could be more explicitly stated to facilitate easier replication by other researchers.
One limitation is the reliance on specific datasets, which may not fully capture the diversity of emotional expressions in real-world scenarios. Additionally, while the model shows robustness to class imbalance, the performance on unseen speakers in naturalistic conditions could be further explored.
The findings have significant implications for applications in human-computer interaction, customer service, and online education, where understanding emotional cues can enhance user experience. The proposed model's ability to handle class imbalance makes it particularly valuable for deploying SER systems in real-world contexts. The paper presents Crab, a multimodal SER framework that effectively integrates speech and text representations through a novel contrastive learning strategy, achieving significant performance improvements in emotion recognition tasks. The innovative approach and rigorous evaluation contribute meaningfully to the field of speech emotion recognition, addressing key challenges in real-world applications.
We introduce Echoes, a new dataset for music deepfake detection designed for training and benchmarking detectors under realistic and provider-diverse conditions. Echoes comprises 3,577 tracks (110 hours of audio) spanning multiple genres (pop, rock, electronic), and includes content generated by ten popular AI music generation systems. To prevent shortcut learning and promote robust generalization, the dataset is deliberately constructed to be challenging, enforcing semantic-level alignment between spoofed audio and bona fide references. This alignment is achieved by conditioning generated audio samples directly on bona-fide waveforms or song descriptors. We evaluate Echoes in a cross-dataset setting against three existing AI-generated music datasets using state-of-the-art Wav2Vec2 XLS-R 2B representations. Results show that (i) Echoes is the hardest in-domain dataset; (ii) detectors trained on existing datasets transfer poorly to Echoes; (iii) training on Echoes yields the strongest generalization performance. These findings suggest that provider diversity and semantic alignment help learn more transferable detection cues.
Primary: National University of Science and Technology POLITEHNICA Bucharest
All Institutions: Fraunhofer AISEC, National University of Science and Technology POLITEHNICA Bucharest
The paper presents Echoes, a semantically-aligned dataset for AI-generated music detection, which significantly enhances the benchmarking landscape for music deepfake detection by addressing key challenges in data diversity and shortcut learning.
The methodology is robust, focusing on generating a diverse dataset that emphasizes semantic alignment between real and AI-generated music. The use of LLMs to derive song descriptors for conditioning the generation process is innovative and addresses the challenges of shortcut learning in deepfake detection. The dataset's design, including the variety of music genres and the inclusion of multiple AI music generation systems, enhances its applicability and relevance.
The experimental evaluation is thorough, demonstrating the dataset's effectiveness through cross-dataset testing. The results highlight the difficulty of the Echoes dataset and its ability to promote generalization in detection models. The use of Wav2Vec2 XLS-R 2B for feature extraction and the evaluation metrics employed (EER) are appropriate for the task at hand.
The paper provides sufficient details about the dataset generation process, including the selection of bona fide tracks and the conditioning methods used for AI-generated samples. However, the lack of a direct link to the code or model training details limits full reproducibility.
One limitation is the reliance on a specific set of AI music generation systems, which may not encompass the full spectrum of current technologies. Additionally, the dataset may not cover all possible music genres or styles, potentially limiting its generalizability. The paper also mentions that future work will explore more complex scenarios, indicating that the current evaluation may not fully capture real-world conditions.
The dataset has significant implications for the music industry, particularly in addressing the challenges posed by AI-generated music. By providing a benchmark for deepfake detection, it can help improve the integrity of music platforms and support the development of more reliable detection systems. This work also opens avenues for further research in audio forensics and the ethical implications of AI in creative fields. The paper presents Echoes, a semantically-aligned dataset for AI-generated music detection, which significantly enhances the benchmarking landscape for music deepfake detection by addressing key challenges in data diversity and shortcut learning.
Self-supervised learning (SSL) has advanced speech processing. However, existing speech SSL methods typically assume a single sampling rate and struggle with mixed-rate data due to temporal resolution mismatch. To address this limitation, we propose MSRHuBERT, a multi-sampling-rate adaptive pre-training method. Building on HuBERT, we replace its single-rate downsampling CNN with a multi-sampling-rate adaptive downsampling CNN that maps raw waveforms from different sampling rates to a shared temporal resolution without resampling. This design enables unified mixed-rate pre-training and fine-tuning. In experiments spanning 16 to 48 kHz, MSRHuBERT outperforms HuBERT on speech recognition and full-band speech reconstruction, preserving high-frequency detail while modeling low-frequency semantic structure. Moreover, MSRHuBERT retains HuBERT's mask-prediction objective and Transformer encoder, so existing analyses and improvements that were developed for HuBERT can apply directly.
Primary: Tianjin University
All Institutions: Tianjin University, Chinese Academy of Sciences, Huiyan Technology Company, Tianjin Key Laboratory of Cognitive Computing and Application
This paper presents MSRHuBERT, a self-supervised learning framework that effectively addresses the resolution mismatch problem in speech processing across multiple sampling rates. The technical contributions are substantial, with a well-justified methodology and promising experimental results, positioning this work as a notable advancement in the field of audio machine learning.
The proposed MSRHuBERT method introduces a novel multi-sampling-rate adaptive downsampling CNN that effectively addresses the resolution mismatch problem in self-supervised speech learning. By allowing the model to process audio at various sampling rates without resampling, it preserves high-frequency information critical for tasks like speech reconstruction while maintaining low-frequency semantic content for ASR. The methodology is well-structured, retaining the core elements of the HuBERT framework, which facilitates the integration of existing improvements and analyses. The approach is theoretically sound and presents a clear advancement over existing methods.
The experiments conducted span multiple sampling rates (16 kHz to 48 kHz) and evaluate both ASR and full-band speech reconstruction tasks. The results demonstrate that MSRHuBERT outperforms the baseline HuBERT model across various metrics, showcasing its effectiveness in preserving high-frequency details while maintaining low-frequency content. The use of diverse datasets and the systematic evaluation of performance across different sampling rates strengthens the findings. However, the paper could benefit from additional comparative analyses with other state-of-the-art models beyond HuBERT.
The paper provides a detailed description of the experimental setup, including the datasets used and the training configurations. However, the absence of a publicly available code repository or demo URL limits reproducibility. Future work should consider releasing the model and code to facilitate validation by the research community.
One limitation is the reliance on the HuBERT architecture, which may restrict the generalizability of the proposed method to other architectures. Additionally, while the paper addresses the resolution mismatch problem, it does not explore the implications of using the model in real-world applications where sampling rates may vary dynamically. The paper could also expand on potential computational costs associated with the multi-sampling-rate adaptive downsampling CNN.
The implications of this research are significant for the field of speech processing, particularly in applications requiring robust performance across varying audio qualities and sampling rates. The ability to handle mixed-rate data without loss of information can enhance the usability of speech models in diverse environments, potentially leading to improvements in voice recognition systems, virtual assistants, and other audio applications. This paper presents MSRHuBERT, a self-supervised learning framework that effectively addresses the resolution mismatch problem in speech processing across multiple sampling rates. The technical contributions are substantial, with a well-justified methodology and promising experimental results, positioning this work as a notable advancement in the field of audio machine learning.
This paper presents the Interspeech 2026 Audio Encoder Capability Challenge, a benchmark specifically designed to evaluate and advance the performance of pre-trained audio encoders as front-end modules for Large Audio Language Models (LALMs). While LALMs have shown remarkable understanding of complex acoustic scenes, their performance depends on the semantic richness of the underlying audio encoder representations. This challenge addresses the integration gap by providing a unified generative evaluation framework, XARES-LLM, which assesses submitted encoders across a diverse suite of downstream classification and generation tasks. By decoupling encoder development from LLM fine-tuning, the challenge establishes a standardized protocol for general-purpose audio representations that can effectively be used for the next generation of multimodal language models.
Primary: University of Surrey
All Institutions: University of Surrey, DataOcean AI Inc
The paper presents a novel benchmark for evaluating audio encoders in the context of LALMs, contributing to the advancement of multimodal machine learning. The introduction of XARES-LLM and the structured challenge framework represent significant steps forward in the evaluation of audio representations, with implications for future research and applications in audio understanding.
The paper introduces the Interspeech 2026 Audio Encoder Capability Challenge, which is a well-structured benchmark for evaluating audio encoders in the context of Large Audio Language Models (LALMs). The proposed methodology, XARES-LLM, effectively decouples encoder development from LLM fine-tuning, allowing for a more focused evaluation of audio representations. The challenge's design, which includes multiple tracks and a unified generative evaluation framework, is innovative and addresses existing gaps in the evaluation of audio encoders. The use of a single decoder model for diverse tasks is a significant methodological advancement.
The experiments conducted across four tracks provide a comprehensive assessment of the performance of various audio encoders. The inclusion of both public and hidden test sets enhances the robustness of the evaluation. The results indicate a clear performance advantage for encoders that leverage LALM alignment, showcasing the effectiveness of the proposed evaluation framework. The leaderboard results are well-documented, providing insights into the strengths of different approaches.
The paper emphasizes reproducibility by detailing the experimental setup, including the use of fixed random seeds and multiple hardware configurations. However, the absence of a publicly accessible code repository limits the ability for external validation of results.
One notable limitation is the reliance on proprietary audio encoders, which may restrict the generalizability of findings to publicly available models. Additionally, while the challenge addresses various tasks, the focus on generative outputs may not fully encompass all aspects of audio understanding.
The challenge has the potential to significantly advance the field of audio processing by establishing a standardized protocol for evaluating audio encoders. This could lead to improved performance in multimodal language models and broader applications in areas such as speech recognition, emotion detection, and audio classification. The paper presents a novel benchmark for evaluating audio encoders in the context of LALMs, contributing to the advancement of multimodal machine learning. The introduction of XARES-LLM and the structured challenge framework represent significant steps forward in the evaluation of audio representations, with implications for future research and applications in audio understanding.
Audio-Language Models (ALMs) are making strides in understanding speech and non-speech audio. However, domain-specialist Foundation Models (FMs) remain the best for closed-ended speech processing tasks such as Speech Emotion Recognition (SER). Using ALMs for Zero-shot SER is a popular choice, but their potential to work with specialists to achieve state-of-the-art (SOTA) performance remains unexplored. We propose ZS-Fuse, a late-fusion method that combines zero-shot emotion estimates from a dual-encoder ALM with specialist FMs. To handle ambiguity in emotions and sensitivity to prompt choice, 1) we use a simple prompt ensemble and 2) suggest a novel technique called prompt amplification, which repeats audio and text queries to discover stronger zero-shot capabilities. We demonstrate the efficacy of our technique by evaluating ZS-Fuse with three dual-encoder ALMs and two FMs, and report improvements over SOTA baselines, such as WavLM-Large, on three speech emotion recognition datasets.
Primary: Emory University
All Institutions: Emory University
The main contribution of this paper is the introduction of ZS-Fuse, a novel late-fusion method that combines zero-shot predictions from Audio-Language Models with specialist Foundation Models to improve Speech Emotion Recognition performance. This work showcases a promising direction in leveraging multimodal learning for enhanced emotion recognition, addressing both practical applications and theoretical advancements in the field.
The paper introduces ZS-Fuse, a late-fusion method that effectively combines zero-shot emotion estimates from dual-encoder Audio-Language Models (ALMs) with domain-specialist Foundation Models (FMs). The methodology is well-structured, employing prompt amplification and a simple prompt ensemble to enhance the zero-shot capabilities of ALMs. The choice of dual-encoder models is justified, and the approach to handle ambiguity in emotions is innovative, though the simplicity of the prompt engineering could be seen as a limitation in exploring more complex interactions.
The experiments are comprehensive, evaluating the proposed method across three datasets (RAVDESS, MSP-Podcast, and IEMOCAP) and multiple ALM and FM combinations. The results demonstrate significant improvements over state-of-the-art baselines, particularly with the CLSP model. However, the paper could benefit from more detailed statistical analysis and discussion of the results, such as confidence intervals or significance testing.
The paper provides sufficient details regarding the training process, including the choice of optimizers, batch sizes, and the number of epochs. However, the lack of a public repository or demo URL limits the reproducibility of the results, as external researchers cannot easily validate the findings or replicate the experiments.
One major limitation is the reliance on prompt amplification, which can lead to unpredictable performance, as indicated by the results showing that some configurations degrade performance. Additionally, the paper does not explore the implications of using larger or more complex prompt ensembles, which could enhance the results further.
The proposed method has significant implications for the development of emotion-aware systems, such as empathetic virtual assistants and customer service applications. The integration of ALMs with FMs could lead to advancements in various fields, including mental health monitoring and interactive dialogue systems. The main contribution of this paper is the introduction of ZS-Fuse, a novel late-fusion method that combines zero-shot predictions from Audio-Language Models with specialist Foundation Models to improve Speech Emotion Recognition performance. This work showcases a promising direction in leveraging multimodal learning for enhanced emotion recognition, addressing both practical applications and theoretical advancements in the field.
Integrating Federated Learning (FL) with self-supervised learning (SSL) enables privacy-preserving fine-tuning for speech tasks. However, federated environments exhibit significant heterogeneity: clients differ in computational capacity, causing straggler effects under unified fine-tuning, while diverse downstream tasks require different representation depths, making full-model updates inefficient. To address these challenges, we propose an adaptive federated fine-tuning framework with early exits. Lightweight prediction heads are inserted at intermediate layers of the SSL backbone, allowing clients to terminate computation based on local constraints and task requirements. We further introduce a layer-wise, depth-aware partial aggregation strategy to better utilize representations from different network depths. Experiments show that the framework reduces edge overhead, supports heterogeneous hardware, and maintains competitive performance in resource-constrained federated environments.
Primary: University of Cambridge
All Institutions: University of Cambridge, Electronic Information School, Flower Labs, University of Auckland, University of Melbourne, Wuhan University
This paper presents a novel adaptive federated fine-tuning framework that effectively addresses the challenges of heterogeneous environments in self-supervised speech representation learning. The technical contributions, particularly in the areas of early exits and layer-wise aggregation, represent a meaningful advancement in the field of federated learning for audio applications.
The proposed adaptive federated fine-tuning framework introduces innovative mechanisms such as early exits and layer-wise partial aggregation, which effectively address the challenges posed by heterogeneity in federated learning environments. The methodology is well-structured, leveraging an elastic multi-branch architecture that allows clients to dynamically select their training depth based on local resources and task complexity. This approach not only enhances computational efficiency but also maintains performance across diverse speech tasks. The integration of lightweight prediction heads and depth-aware aggregation strategies is a significant advancement in federated learning for speech applications.
The experiments are comprehensive, covering five diverse downstream tasks that span various aspects of speech understanding. The results demonstrate the effectiveness of the proposed framework in reducing computational overhead while achieving competitive performance compared to centralized training. The evaluation metrics used, including word error rates and classification error rates, are appropriate for the tasks at hand. However, the paper could benefit from additional comparisons with existing state-of-the-art methods to further contextualize the results.
The paper provides a detailed description of the experimental setup, including datasets, model architectures, and training configurations, which aids reproducibility. However, the lack of a publicly available code repository limits the ease with which others can replicate the experiments. Including a link to the implementation would significantly enhance reproducibility.
One limitation is the reliance on a specific backbone model (Wav2Vec 2.0), which may not generalize to all speech tasks or architectures. Additionally, while the framework addresses resource constraints, it does not fully explore the implications of data heterogeneity beyond the basic partitioning strategy employed. The paper could also discuss potential trade-offs between performance and computational efficiency in more detail.
The proposed framework has significant implications for deploying speech recognition systems in privacy-sensitive environments, such as mobile devices and personal assistants. By enabling efficient fine-tuning without compromising user data privacy, this work contributes to the growing field of privacy-preserving machine learning. The methodology could be adapted to other domains where federated learning is applicable, potentially influencing future research in decentralized learning systems. This paper presents a novel adaptive federated fine-tuning framework that effectively addresses the challenges of heterogeneous environments in self-supervised speech representation learning. The technical contributions, particularly in the areas of early exits and layer-wise aggregation, represent a meaningful advancement in the field of federated learning for audio applications.
Recent advances in generative models, such as diffusion and flow matching, have shown strong performance in audio tasks. However, speech enhancement (SE) models are typically trained on limited datasets and evaluated under narrow conditions, limiting real-world applicability. To address this, we propose DiT-Flow, a flow matching-based SE framework built on the latent Diffusion Transformer (DiT) backbone and trained for robustness across diverse distortions, including noise, reverberation, and compression. DiT-Flow operates on compact variational auto-encoders (VAEs)-derived latent features. We validated our approach on StillSonicSet, a synthetic yet acoustically realistic dataset composed of LibriSpeech, FSD50K, FMA, and 90 Matterport3D scenes. Experiments show that DiT-Flow consistently outperforms state-of-the-art generative SE models, demonstrating the effectiveness of flow matching in multi-condition speech enhancement. Despite ongoing efforts to expand synthetic data realism, a persistent bottleneck in SE is the inevitable mismatch between training and deployment conditions. By integrating LoRA with the MoE framework, we achieve both parameter-efficient and high-performance training for DiT-Flow robust to multiple distortions with using 4.9% percentage of the total parameters to obtain a better performance on five unseen distortions.
Primary: Johns Hopkins University
All Institutions: Johns Hopkins University, Technion Israel Institute of Technology, University of Haifa
The main contribution of this paper is the development of DiT-Flow, a novel speech enhancement framework that effectively utilizes flow matching and latent representations to improve robustness against multiple distortions. This work represents a significant step forward in the field of audio processing, addressing common challenges faced in real-world applications and demonstrating the potential for future advancements in speech enhancement technologies.
The methodology of DiT-Flow is robust, leveraging flow matching and latent Diffusion Transformers to enhance speech under multiple distortions. The integration of LoRA with the Mixture-of-Experts framework is particularly innovative, allowing for parameter-efficient adaptation to varying acoustic conditions. The use of a synthetic dataset, StillSonicSet, designed to simulate realistic conditions, further strengthens the approach. However, the paper could benefit from clearer descriptions of hyperparameter choices and training procedures.
The experiments are comprehensive, validating DiT-Flow against state-of-the-art models across various conditions. The use of multiple evaluation metrics, including PESQ, ESTOI, and DNSMOS, provides a well-rounded assessment of performance. The results demonstrate significant improvements over baseline models, particularly in challenging scenarios, indicating the effectiveness of the proposed methods. However, the paper lacks detailed comparisons with a broader range of existing methods, which could provide more context for its contributions.
The paper includes sufficient detail regarding the model architecture and training process, but lacks a clear link to code or datasets, which hampers reproducibility. Providing access to the StillSonicSet dataset and the trained models would enhance reproducibility and facilitate further research.
One limitation is the reliance on synthetic data, which may not fully capture the complexities of real-world audio environments. Additionally, while the model shows robustness to multiple distortions, its performance in extreme or novel conditions remains to be tested. The computational efficiency of the model in real-time applications also needs further exploration.
The advancements in speech enhancement presented in this paper have significant implications for real-world applications, particularly in telecommunication, virtual meetings, and assistive technologies. The ability to enhance speech quality in diverse acoustic environments can improve communication clarity and accessibility for users in various settings. The main contribution of this paper is the development of DiT-Flow, a novel speech enhancement framework that effectively utilizes flow matching and latent representations to improve robustness against multiple distortions. This work represents a significant step forward in the field of audio processing, addressing common challenges faced in real-world applications and demonstrating the potential for future advancements in speech enhancement technologies.
Animal vocalizations provide crucial insights for wildlife assessment, particularly in complex environments such as forests, aiding species identification and ecological monitoring. Recent advances in deep learning have enabled automatic species classification from their vocalizations. However, classifying species unseen during training remains challenging. To address this limitation, we introduce AnimalCLAP, a taxonomy-aware language-audio framework comprising a new dataset and model that incorporate hierarchical biological information. Specifically, our vocalization dataset consists of 4,225 hours of recordings covering 6,823 species, annotated with 22 ecological traits. The AnimalCLAP model is trained on this dataset to align audio and textual representations using taxonomic structures, improving the recognition of unseen species. We demonstrate that our proposed model effectively infers ecological and biological attributes of species directly from their vocalizations, achieving superior performance compared to CLAP. Our dataset, code, and models will be publicly available at https://dahlian00.github.io/AnimalCLAP_Page/.
Primary: quadInstitute of Science Tokyo
All Institutions: quadInstitute of Science Tokyo, The University of Osaka, The University of Tokyo
The main contribution of this paper is the introduction of AnimalCLAP, a taxonomy-aware language-audio pretraining framework that significantly improves species recognition and trait inference from animal vocalizations. This work represents a meaningful advancement in the application of machine learning to ecological monitoring, with a robust methodology and promising results that could influence future research and practices in wildlife assessment.
The methodology presented in AnimalCLAP is innovative, leveraging a taxonomy-aware framework that integrates hierarchical biological information into the model's training process. The authors introduce a substantial dataset of animal vocalizations, which is a critical asset for training and evaluating the model. The alignment of audio and textual representations through taxonomic structures is a novel approach that enhances the model's ability to generalize to unseen species, which is a significant challenge in the field. The use of contrastive learning techniques is well-justified and effectively applied to the task of species recognition and trait inference.
The experiments are comprehensive, utilizing a large dataset of 4,225 hours of recordings from 6,823 species, which is a considerable contribution to the field. The results demonstrate that AnimalCLAP outperforms existing models, including CLAP, in recognizing unseen species and inferring ecological traits. The evaluation metrics used are appropriate, and the authors provide a clear comparison of their model's performance against baseline methods, showcasing the effectiveness of their approach.
The authors commit to making their dataset, code, and models publicly available, which is crucial for reproducibility. However, the paper would benefit from a more detailed description of the experimental setup, including hyperparameter settings and training procedures, to facilitate replication by other researchers.
One limitation of the study is the potential bias in the dataset, which may not cover all ecological contexts or species diversity adequately. Additionally, the model's performance on edge cases or species with very similar vocalizations may not be thoroughly addressed. The reliance on taxonomic structures may also limit the model's applicability in more complex ecological scenarios where such hierarchies are not well defined.
The implications of this research are significant for wildlife conservation and ecological monitoring, as it provides a tool for non-invasive species identification and trait inference from vocalizations. This could enhance biodiversity assessments and inform conservation strategies. The methodology could also be adapted for other domains where audio classification is relevant, such as environmental monitoring or even human-related vocalizations. The main contribution of this paper is the introduction of AnimalCLAP, a taxonomy-aware language-audio pretraining framework that significantly improves species recognition and trait inference from animal vocalizations. This work represents a meaningful advancement in the application of machine learning to ecological monitoring, with a robust methodology and promising results that could influence future research and practices in wildlife assessment.
Audio-Visual Semantic Segmentation (AVSS) aligns audio and video at the pixel level but requires costly per-frame annotations. We introduce Weakly Supervised Audio-Visual Semantic Segmentation (WSAVSS), which uses only video-level labels to generate per-frame semantic masks of sounding objects. We decompose WSAVSS into looking, listening, and segmentation, and propose Progressive Cross-modal Alignment for Semantics (PCAS) with two modules: *Looking-before-Listening* and *Listening-before-Segmentation*. PCAS builds a classification task to train the audio-visual encoder using video labels, injects visual semantic prompts to enhance frame-level audio understanding, and then applies progressive contrastive alignment to map audio categories to image regions without mask annotations. Experiments show PCAS achieves state-of-the-art performance among weakly supervised methods on AVS and remains competitive with fully supervised baselines on AVSS, validating its effectiveness.
Primary: Beijing Institute of Technology
All Institutions: Beijing Institute of Technology
The main contribution of this paper is the introduction of a novel weakly supervised framework for audio-visual semantic segmentation that effectively aligns audio and visual features without requiring dense annotations. This work represents a significant step forward in the field of audio-visual understanding, providing a robust methodology and promising results that could influence future research and applications.
The methodology presented in this paper is innovative, particularly in its decomposition of the WSAVSS task into three distinct phases: looking, listening, and segmentation. The introduction of Temporal Visual Prompting (TVP) to enhance audio understanding through visual cues is a novel approach that leverages the inherent relationships between audio and visual modalities. The Progressive Cross-modal Alignment for Semantics (PCAS) framework, which combines instance-wise and token-wise contrastive learning, is well-conceived and addresses the challenge of aligning audio and visual features without requiring dense annotations. This progressive alignment strategy is a significant advancement over existing methods, making it a valuable contribution to the field.
The experiments are comprehensive, demonstrating the effectiveness of the proposed method through comparisons with both weakly supervised and fully supervised baselines. The use of multiple datasets and the reporting of mean IoU and F-score metrics provide a robust evaluation of the model's performance. The ablation studies effectively highlight the contributions of each module within the proposed framework, reinforcing the claims of improved performance. However, the absence of a demo or project URL limits the accessibility of the results for further validation by the community.
While the paper provides a detailed description of the methodology and experimental setup, it lacks specific implementation details such as code availability or dataset access instructions. The absence of these resources may hinder reproducibility. Clearer guidelines on how to replicate the experiments would enhance the paper's impact.
One limitation of the study is the reliance on video-level labels, which, while reducing annotation costs, may not capture the full complexity of audio-visual interactions. Additionally, the paper does not address potential biases in the datasets used, which could affect the generalizability of the results. The performance on more complex scenes with overlapping sounds and visuals could also be explored further.
The proposed WSAVSS framework has significant implications for applications in multimedia content analysis, human-computer interaction, and assistive technologies. By reducing the need for extensive annotations, this research can facilitate advancements in real-time audio-visual processing systems, enhancing accessibility and user experience in various domains. The approach could also inspire further research into weakly supervised learning paradigms across different modalities. The main contribution of this paper is the introduction of a novel weakly supervised framework for audio-visual semantic segmentation that effectively aligns audio and visual features without requiring dense annotations. This work represents a significant step forward in the field of audio-visual understanding, providing a robust methodology and promising results that could influence future research and applications.
This paper presents SelfTTS, a text-to-speech (TTS) model designed for cross-speaker style transfer that eliminates the need for external pre-trained speaker or emotion encoders. The architecture achieves emotional expressivity in neutral speakers through an explicit disentanglement strategy utilizing Gradient Reversal Layers (GRL) combined with cosine similarity loss to decouple speaker and emotion information. We introduce Multi Positive Contrastive Learning (MPCL) to induce clustered representations of speaker and emotion embeddings based on their respective labels. Furthermore, SelfTTS employs a self-refinement strategy via Self-Augmentation, exploiting the model's voice conversion capabilities to enhance the naturalness of synthesized speech. Experimental results demonstrate that SelfTTS achieves superior emotional naturalness (eMOS) and robust stability in target timbre and emotion compared to state-of-the-art baselines.
Primary: Universidade Estadual de Campinas (UNICAMP)
All Institutions: Universidade Estadual de Campinas (UNICAMP)
The main contribution of this paper is the development of SelfTTS, a robust TTS framework that achieves high-quality cross-speaker style transfer through innovative embedding disentanglement and self-refinement strategies. This work represents a meaningful advancement in the field of speech synthesis, addressing key challenges related to emotional expressivity and speaker identity while providing a solid experimental foundation to support its claims.
The paper introduces SelfTTS, a novel TTS framework that effectively decouples speaker and emotion embeddings without relying on external encoders. The methodology employs Gradient Reversal Layers (GRL) and Multi Positive Contrastive Learning (MPCL) to achieve disentanglement and clustering of embeddings, which is a significant advancement over existing methods that often suffer from speaker leakage. The self-refinement strategy through Self-Augmentation is particularly innovative, leveraging the model’s voice conversion capabilities to enhance the naturalness of synthesized speech. This approach is well-justified and clearly articulated, demonstrating a solid understanding of the challenges in TTS.
The experimental setup is robust, utilizing both subjective (eMOS, nMOS, sMOS) and objective metrics (UTMOS, WER, SECS, EECS) to evaluate performance. The results indicate that SelfTTS outperforms state-of-the-art models in emotional naturalness and stability, which is a crucial aspect of TTS systems. The use of cross-corpus experiments adds to the credibility of the findings, although the paper could benefit from more extensive comparisons with additional baselines.
The paper provides adequate implementation details, including the architecture, training procedures, and evaluation metrics, which facilitate reproducibility. The authors have made their code publicly available, enhancing the likelihood that other researchers can replicate the results. However, some hyperparameters and specific configurations could be more explicitly detailed to ensure complete clarity.
One limitation noted is the model's performance in cross-corpus scenarios, where emotional adherence is lower due to the differences in recording conditions. Additionally, while the Self-Augmentation strategy shows promise, its effectiveness may vary based on the quality of synthetic samples generated, which could introduce artifacts into the training process.
The advancements presented in SelfTTS have significant implications for the development of expressive TTS systems, particularly in applications requiring emotional expressivity and speaker identity preservation. This work could benefit various fields, including virtual assistants, audiobooks, and gaming, where natural and emotionally engaging speech synthesis is essential. The main contribution of this paper is the development of SelfTTS, a robust TTS framework that achieves high-quality cross-speaker style transfer through innovative embedding disentanglement and self-refinement strategies. This work represents a meaningful advancement in the field of speech synthesis, addressing key challenges related to emotional expressivity and speaker identity while providing a solid experimental foundation to support its claims.
Speech emotion recognition (SER) systems can exhibit gender-related performance disparities, but how such bias manifests in multilingual speech LLMs across languages and modalities is unclear. We introduce a novel multilingual, multimodal benchmark built on MELD-ST, spanning English, Japanese, and German, to quantify language-specific SER performance and gender gaps. We find bias is strongly language-dependent, and multimodal fusion does not reliably improve fairness. To address these, we propose ERM-MinMaxGAP, a fairness-informed training objective, which augments empirical risk minimization (ERM) with a proposed adaptive fairness weight mechanism and a novel MinMaxGAP regularizer on the maximum male-female loss gap within each language and modality. Building upon the Qwen2-Audio backbone, our ERM-MinMaxGAP approach improves multilingual SER performance by 5.5% and 5.0% while reducing the overall gender bias gap by 0.1% and 1.4% in the unimodal and multimodal settings, respectively.
Primary: Kyoto University
All Institutions: Agency for Science, Kyoto University, A*STAR
This paper presents a novel benchmark and a fairness-aware training objective for mitigating gender bias in multilingual multimodal speech emotion recognition systems. The technical contributions and methodology are robust, addressing a pressing issue in the field of machine learning and AI.
The proposed methodology, ERM-MinMaxGAP, is a significant advancement in addressing gender bias in multilingual multimodal speech emotion recognition (SER). The integration of empirical risk minimization with a fairness regularization term that focuses on the maximum male-female loss gap is innovative. The adaptive fairness weight mechanism further enhances the robustness of the training process, allowing for dynamic adjustments based on the model's performance. The detailed description of the MinMaxGAP regularizer and its implementation demonstrates a thorough understanding of the complexities involved in SER tasks, particularly in a multilingual context.
The experimental setup is well-structured, utilizing the MELD-ST dataset to benchmark the proposed method against existing models. The results indicate that ERM-MinMaxGAP not only improves SER performance but also reduces gender disparity effectively across different languages and modalities. The ablation studies provide valuable insights into the contributions of each component of the proposed method, reinforcing the effectiveness of the MinMaxGAP regularization approach.
The paper states that all code, data, and models will be released upon acceptance, which is a positive aspect for reproducibility. However, specific implementation details regarding the training process, hyperparameters, and dataset preparation are provided, which aids in replicating the experiments. The clarity in methodology and results presentation supports reproducibility.
One limitation is that while the proposed method shows improvements in SER and fairness, it does not achieve the minimum post-hoc gender gap in every setting, indicating that the approach may not be universally applicable across all datasets or languages. Additionally, the reliance on a specific dataset (MELD-ST) may limit the generalizability of the findings.
The implications of this research are significant, as it addresses a critical issue of fairness in AI systems, particularly in emotion recognition, which has applications in various fields such as mental health assessment, customer service, and human-computer interaction. By improving fairness in SER systems, this work contributes to the development of more equitable AI technologies that can better serve diverse populations. This paper presents a novel benchmark and a fairness-aware training objective for mitigating gender bias in multilingual multimodal speech emotion recognition systems. The technical contributions and methodology are robust, addressing a pressing issue in the field of machine learning and AI.
Composing coherent long-form music remains a significant challenge due to the complexity of modeling long-range dependencies and the prohibitive memory and computational requirements associated with lengthy audio representations. In this work, we propose a simple yet powerful trick: we assume that AI models can understand and generate time-accelerated (speeded-up) audio at rates such as 2x, 4x, or even 8x. By first generating a high-speed version of the music, we greatly reduce the temporal length and resource requirements, making it feasible to handle long-form music that would otherwise exceed memory or computational limits. The generated audio is then restored to its original speed, recovering the full temporal structure. This temporal speed-up and slow-down strategy naturally follows the principle of hierarchical generation from abstract to detailed content, and can be conveniently applied to existing music generation models to enable long-form music generation. We instantiate this idea in SqueezeComposer, a framework that employs diffusion models for generation in the accelerated domain and refinement in the restored domain. We validate the effectiveness of this approach on two tasks: long-form music generation, which evaluates temporal-wise control (including continuation, completion, and generation from scratch), and whole-song singing accompaniment generation, which evaluates track-wise control. Experimental results demonstrate that our simple temporal speed-up trick enables efficient, scalable, and high-quality long-form music generation. Audio samples are available at https://SqueezeComposer.github.io/.
Primary: Peking University
All Institutions: Peking University, The State Key Laboratory of Multimedia Information Processing, Hong Kong SAR, The Hong Kong University of Science and Technology
The main contribution of this paper is the introduction of SqueezeComposer, a novel framework for long-form music generation that utilizes temporal speed-up to enhance computational efficiency while preserving musical coherence. This work represents a significant advancement in the field of audio generation, addressing key challenges and opening avenues for future research in scalable music composition.
The methodology presented in SqueezeComposer is innovative, leveraging a temporal speed-up approach to address the challenges of long-form music generation. By generating music in an accelerated domain and restoring it to normal speed, the authors effectively reduce computational requirements while maintaining musical coherence. The hierarchical generation paradigm is well-structured, allowing for both abstract and detailed content generation. The use of diffusion models for both generation and refinement is a strong choice, aligning with current trends in audio synthesis. However, the paper could benefit from a more detailed explanation of the implementation specifics and the choice of hyperparameters.
The experiments are comprehensive, utilizing a variety of datasets and evaluation metrics, including Fréchet Audio Distance (FAD) and AudioBox-Aesthetics metrics. The results demonstrate that SqueezeComposer outperforms existing methods in terms of generation efficiency and quality, particularly in long-form music generation tasks. The comparison against established baselines is robust, showcasing the framework's effectiveness across different music generation scenarios. However, further qualitative assessments through user studies could enhance the evaluation of generated audio quality.
The paper provides a clear algorithmic description of the SqueezeComposer framework, but it lacks detailed implementation specifics, such as the exact architectures used for the diffusion models and the training process. Including code or a more thorough description of the experimental setup would improve reproducibility.
One limitation is the potential degradation in audio quality when using accelerated audio representations, which could affect the fidelity of the generated music. Additionally, while the framework shows promise for long-form music generation, the scalability to even longer compositions or more complex musical structures is not fully explored. The reliance on existing vocoders without retraining may also limit the potential for achieving the highest audio quality.
SqueezeComposer has the potential to significantly impact the field of music generation by enabling efficient production of long-form compositions, which could be beneficial for various applications in music production, film scoring, and interactive media. The approach could also inspire further research into hierarchical generation techniques and the use of accelerated representations in other domains of generative modeling. The main contribution of this paper is the introduction of SqueezeComposer, a novel framework for long-form music generation that utilizes temporal speed-up to enhance computational efficiency while preserving musical coherence. This work represents a significant advancement in the field of audio generation, addressing key challenges and opening avenues for future research in scalable music composition.
Multimodal Large Language Models (MLLMs) excel in Open-Vocabulary (OV) emotion recognition but often neglect fine-grained acoustic modeling. Existing methods typically use global audio encoders, failing to capture subtle, local temporal dynamics like micro-prosody and intonation shifts within individual utterances. To address this, we propose AcoustEmo, a time-sensitive MLLM featuring a novel Utterance-Aware Acoustic Q-Former. Our approach utilizes a timestamp-synchronized sliding window to dynamically extract segment-level audio tokens instead of coarse global representations. This enables the model to explicitly trace the temporal evolution of subtle acoustic clues and capture deep contextual dependencies in dialogues. Experiments on the Explainable Multimodal Emotion Recognition (EMER) task show that AcoustEmo significantly enhances complex emotion reasoning, outperforming baselines while maintaining robust contextual accuracy.
Primary: The University of Osaka
All Institutions: The University of Osaka, The University of Tokyo
The paper presents AcoustEmo, a time-sensitive MLLM that significantly enhances open-vocabulary emotion reasoning by capturing local acoustic dynamics through a novel Utterance-Aware Acoustic Q-Former. This work is a meaningful contribution to the field of multimodal emotion recognition, addressing critical gaps in existing methodologies and demonstrating substantial technical advancements.
The proposed methodology introduces a novel architecture, AcoustEmo, which leverages an Utterance-Aware Acoustic Q-Former to address the limitations of traditional global audio encoders in capturing fine-grained acoustic details. The use of a timestamp-synchronized sliding window for dynamic extraction of segment-level audio tokens is innovative, allowing the model to maintain semantic coherence between audio and text modalities. This approach is well-justified and effectively targets the nuances of emotion conveyed through micro-prosody and intonation shifts, which are critical for accurate emotion recognition.
The experiments conducted on the Explainable Multimodal Emotion Recognition (EMER) task demonstrate the effectiveness of the proposed model. The paper provides a comprehensive evaluation against multiple baseline models, showcasing significant improvements in performance metrics. The ablation studies further validate the necessity of the proposed components, reinforcing the claims made regarding the importance of local acoustic dynamics and timestamp synchronization.
The paper includes sufficient implementation details, including the architecture, optimization strategies, and dataset descriptions, which facilitate reproducibility. However, the absence of a publicly available code repository or demo URL limits the ability for other researchers to directly replicate the results.
While the model shows promising results, it occasionally misclassifies ambiguous emotional states, particularly in sarcastic utterances. Additionally, the performance can degrade in low-SNR scenarios due to background noise interference. These limitations highlight areas for future improvement, particularly in enhancing robustness against challenging acoustic conditions.
The advancements presented in AcoustEmo have significant implications for applications in empathetic conversational agents, mental health monitoring, and human-computer interaction. By improving the accuracy of emotion recognition in multimodal contexts, the model can contribute to more socially aware AI systems, enhancing user experiences in various interactive settings. The paper presents AcoustEmo, a time-sensitive MLLM that significantly enhances open-vocabulary emotion reasoning by capturing local acoustic dynamics through a novel Utterance-Aware Acoustic Q-Former. This work is a meaningful contribution to the field of multimodal emotion recognition, addressing critical gaps in existing methodologies and demonstrating substantial technical advancements.
Recent advancements in text-to-speech technologies enable generating high-fidelity synthetic speech nearly indistinguishable from real human voices. While recent studies show the efficacy of self-supervised learning-based speech encoders for deepfake detection, these models struggle to generalize across unseen speakers. Our quantitative analysis suggests these encoder representations are substantially influenced by speaker information, causing detectors to exploit speaker-specific correlations rather than artifact-related cues. We call this phenomenon speaker entanglement. To mitigate this reliance, we introduce SNAP, a speaker-nulling framework. We estimate a speaker subspace and apply orthogonal projection to suppress speaker-dependent components, isolating synthesis artifacts within the residual features. By reducing speaker entanglement, SNAP encourages detectors to focus on artifact-related patterns, leading to state-of-the-art performance.
Primary: NAVER Cloud
All Institutions: NAVER Cloud, clubsuit, clubsuit Kyudan Jung, clubsuit Jihwan Kim, dagger Cheonbok Park
The main contribution of this paper is the introduction of the SNAP framework, which effectively mitigates speaker entanglement in deepfake detection by employing orthogonal projection techniques to isolate synthesis artifacts. This innovative approach not only achieves state-of-the-art performance but also demonstrates robust generalization capabilities across unseen speakers and TTS models, marking a significant advancement in the field of audio deepfake detection.
The proposed SNAP framework introduces a novel approach to disentangle speaker identity from synthetic speech detection by employing orthogonal projection techniques. This mathematical decomposition of the feature space into speaker-dependent and artifact subspaces is innovative and effectively addresses the identified issue of speaker entanglement. The use of a simple logistic regression classifier on the refined features demonstrates a practical application of the method, emphasizing efficiency without compromising performance.
The experiments are well-structured, utilizing established datasets such as ASVspoof 2019 and 2021, and the In-the-Wild benchmark. The results show a clear improvement in detection performance, with significant reductions in equal error rates (EER) across various conditions, including unseen speakers and TTS models. The quantitative analysis of speaker entanglement through silhouette scores adds depth to the evaluation, reinforcing the effectiveness of the SNAP method.
The paper provides a clear description of the methodology, including feature extraction, subspace projection, and classification processes. However, the absence of a publicly available code repository or demo limits reproducibility. Future work should consider sharing implementation details to facilitate independent validation of results.
While the SNAP framework shows promising results, it primarily focuses on speaker nulling, which may not address other potential confounding factors in deepfake detection. Additionally, the reliance on logistic regression may limit the exploration of more complex models that could further enhance performance. The generalization to unseen TTS models is commendable, but the robustness across all possible variations in synthetic speech generation remains to be fully evaluated.
The implications of this research extend beyond deepfake detection, as the speaker-nulling framework could be applied to other areas of audio processing, such as emotion recognition and speaker-independent speech recognition. The ability to isolate artifacts from speaker identity can enhance the reliability of various speech-related applications, contributing to the development of more secure and trustworthy audio technologies. The main contribution of this paper is the introduction of the SNAP framework, which effectively mitigates speaker entanglement in deepfake detection by employing orthogonal projection techniques to isolate synthesis artifacts. This innovative approach not only achieves state-of-the-art performance but also demonstrates robust generalization capabilities across unseen speakers and TTS models, marking a significant advancement in the field of audio deepfake detection.
Large Language Models (LLMs) have advanced audio generation through discrete representation learning. However, most existing neural codecs focus on speech and emphasize reconstruction fidelity, overlooking unified low frame rate modeling across diverse audio domains, including speech, music, and general sound. Moreover, high reconstruction quality does not necessarily yield semantically informative representations, limiting effectiveness in downstream generation tasks. We propose OmniCodec, a universal neural audio codec tailored for low frame rate. It adopts a hierarchical multi-codebook design with semantic-acoustic decoupling by leveraging the audio encoder of the pre-trained understanding model, along with a self-guidance strategy to improve codebook utilization and reconstruction. Compared with the Mimi codec, experiments show that OmniCodec achieves outstanding performance at the same bitrate, delivering superior reconstruction quality while also providing more semantically informative representations that benefit downstream generation tasks. Our model and code will be open-sourced. Our demo page is available.
Primary: Northwestern Polytechnical University
All Institutions: Northwestern Polytechnical University, Shanghai Lingguang Zhaxian Technology
The main contribution of this paper is the introduction of OmniCodec, a universal neural audio codec that effectively combines low frame rate modeling with semantic-acoustic decoupling, achieving superior reconstruction quality and semantic representation across diverse audio domains. This work significantly advances the state of audio codecs, particularly in their application to large language models and generative tasks, and sets a foundation for future research in audio representation learning.
The methodology presented in this paper is innovative, particularly with its hierarchical multi-codebook design and the semantic-acoustic decoupling approach. The use of a pre-trained understanding model's audio encoder to enhance semantic representation is a novel contribution that addresses the limitations of existing codecs. The self-guidance strategy to improve codebook utilization is also a noteworthy addition, demonstrating a thoughtful approach to enhancing reconstruction quality while maintaining low frame rates.
The experiments are robust, utilizing a comprehensive dataset of approximately 160,000 hours of audio across various domains (speech, music, and general sound). The evaluation metrics are well-chosen, including both objective measures (PESQ, STOI, Mel distance) and subjective assessments (N-MOS, S-MOS). The results indicate that OmniCodec outperforms existing models, particularly in the music and general sound domains, which validates the effectiveness of the proposed architecture.
The paper provides sufficient implementation details, including model architecture, training procedures, and hyperparameters, which facilitates reproducibility. The open-sourcing of the model and code further enhances the potential for other researchers to replicate and build upon this work.
One limitation noted is the performance disparity in the speech domain compared to other models, which may be attributed to the structure of the WavLM model used for semantic supervision. Additionally, the paper acknowledges challenges in achieving optimal semantic decoupling for speech, suggesting that future work will be needed to address these issues.
The proposed OmniCodec has significant implications for audio generation tasks across various domains, including speech synthesis and music generation. Its ability to provide semantically informative representations can enhance applications in multimedia content creation, real-time audio processing, and interactive systems. The open-source nature of the project encourages further exploration and innovation in the field. The main contribution of this paper is the introduction of OmniCodec, a universal neural audio codec that effectively combines low frame rate modeling with semantic-acoustic decoupling, achieving superior reconstruction quality and semantic representation across diverse audio domains. This work significantly advances the state of audio codecs, particularly in their application to large language models and generative tasks, and sets a foundation for future research in audio representation learning.