Audio ML Papers

Beyond Forced Modality Balance: Intrinsic Information Budgets for Multimodal Learning

Zechang Xiong, Da Li, Kexin Tang ... · ICME 2026

Multimodal models often converge to a dominant-modality solution, in which a stronger, faster-converging modality overshadows weaker ones. This modality imbalance causes suboptimal performance. Existing methods attempt to balance different modalities by reweighting gradients or l...

Multimodal models often converge to a dominant-modality solution, in which a stronger, faster-converging modality overshadows weaker ones. This modality imbalance causes suboptimal performance. Existing methods attempt to balance different modalities by reweighting gradients or losses. However, they overlook the fact that each modality has finite information capacity. In this work, we propose IIBalance, a multimodal learning framework that aligns the modality contributions with Intrinsic Information Budgets (IIB). We propose a task-grounded estimator of each modality's IIB, transforming its capacity into a global prior over modality contributions. Anchored by the highest-budget modality, we design a prototype-based relative alignment mechanism that corrects semantic drift only when weaker modalities deviate from their budgeted potential, rather than forcing imitation. During inference, we propose a probabilistic gating module that integrates the global budgets with sample-level uncertainty to generate calibrated fusion weights. Experiments on three representative benchmarks demonstrate that IIBalance consistently outperforms state-of-the-art balancing methods and achieves better utilization of complementary modality cues. Our code is available at: https://github.com/XiongZechang/IIBalance.

Institutional Affiliations

Primary: Alibaba Group

All Institutions: Alibaba Group, Beijing Jiaotong University

GitHub

ML Relevance Analysis (83)

The main contribution of this paper is the introduction of IIBalance, a multimodal learning framework that utilizes Intrinsic Information Budgets to optimize modality contributions, leading to improved performance in scenarios with imbalanced modalities. This work significantly advances the understanding of modality interplay in multimodal systems and offers a practical solution to a common challenge in the field.

Comprehensive Analysis

Methodology Assessment

The paper introduces a novel framework, IIBalance, that addresses the issue of modality dominance in multimodal learning by proposing the concept of Intrinsic Information Budgets (IIB). This approach emphasizes the importance of recognizing each modality's information capacity and adapting their contributions accordingly. The methodology is well-structured, with a clear two-stage process that includes prototype-guided relative alignment and uncertainty-aware Bayesian fusion. The use of a dataset-level prior for modality contributions is particularly innovative, allowing for a more nuanced understanding of how different modalities should contribute based on their intrinsic capabilities.

Experimental Evaluation

The experimental validation is robust, employing three representative benchmarks (Kinetics-Sounds, CREMA-D, and AVE) to demonstrate the effectiveness of IIBalance. The results indicate consistent improvements over state-of-the-art methods, showcasing not only higher overall accuracy but also better performance in weaker modalities. The paper provides a thorough analysis of the contributions of various components of the proposed method, reinforcing the value of the IIB concept and its implementation.

Reproducibility

The paper includes sufficient implementation details, such as training procedures, model architectures, and hyperparameter settings, which facilitate reproducibility. The authors have also made their code publicly available, further enhancing the potential for others to replicate and build upon their work.

Limitations

While the proposed method shows promising results, the paper does not extensively discuss the scalability of the approach to more complex multimodal scenarios or its performance in real-world applications. Additionally, the reliance on a fixed IIB prior during training may limit adaptability in dynamic environments where modality reliability can change rapidly.

Broader Impact

The implications of this work extend to various applications in audio-visual recognition, human-computer interaction, and any domain where multimodal data is prevalent. By improving how models leverage complementary information from different modalities, this research could enhance the robustness and accuracy of systems in fields such as robotics, surveillance, and multimedia content analysis. The main contribution of this paper is the introduction of IIBalance, a multimodal learning framework that utilizes Intrinsic Information Budgets to optimize modality contributions, leading to improved performance in scenarios with imbalanced modalities. This work significantly advances the understanding of modality interplay in multimodal systems and offers a practical solution to a common challenge in the field.

Analysis: Full Paper • Full text: 25,643 characters

Multi-Source Evidence Fusion for Audio Question Answering

Aivo Olev, Tanel Alumäe · arXiv

Large audio language models (LALMs) can answer questions about speech, music, and environmental sounds, yet their internal reasoning is largely opaque and difficult to validate. We describe TalTech's solution to the Agent Track of the Interspeech 2026 Audio Reasoning Challenge, i...

Large audio language models (LALMs) can answer questions about speech, music, and environmental sounds, yet their internal reasoning is largely opaque and difficult to validate. We describe TalTech's solution to the Agent Track of the Interspeech 2026 Audio Reasoning Challenge, in which systems are evaluated on reasoning process quality, specifically the factual accuracy, logical soundness, and completeness of their reasoning chains. Our multi-source ensemble pipeline uses two LALMs that generate independent observations, while a separate text-only reasoning model cross-checks these against outputs from 25 acoustic tools organized into reliability tiers. By grounding every inference step in explicit, reliability-tagged evidence, the system produces dense, verifiable reasoning chains. Our system ranked first in the challenge, outperforming all competing systems by a wide margin in challenge's reasoning quality metric.

Institutional Affiliations

Primary: Tallinn University of Technology

All Institutions: Tallinn University of Technology

Demo

ML Relevance Analysis (83)

The paper presents a novel multi-source evidence fusion approach for audio question answering, achieving top performance in reasoning quality while addressing the challenges of reliability and transparency in LALMs. The comprehensive methodology and strong experimental results contribute significantly to the field of audio understanding and reasoning, paving the way for future advancements in multimodal AI systems.

Comprehensive Analysis

Methodology Assessment

The paper presents a robust multi-source ensemble pipeline that effectively combines two large audio language models (LALMs) with a tiered reliability framework for acoustic tools. The methodology emphasizes dual-source evidence fusion and a structured contradiction detection mechanism, which enhances the reasoning quality of the system. The approach of grounding in reliability-tagged evidence is innovative and addresses the common issue of hallucination in LALMs, making the reasoning process more transparent and verifiable.

Experimental Evaluation

The evaluation is conducted on the Interspeech 2026 Audio Reasoning Challenge dataset, which is comprehensive and includes a diverse range of audio scenarios. The reported results demonstrate a strong performance, with the system achieving the highest reasoning quality score and competitive accuracy. Ablation studies provide statistical significance to the improvements gained from the dual-source evidence fusion, reinforcing the effectiveness of the proposed methodology.

Reproducibility

The paper provides detailed implementation details, including the models and tools used, which enhances reproducibility. However, the reliance on empirical tuning of reliability weights and confidence caps without a data-driven approach may pose challenges for complete reproducibility in other contexts.

Limitations

The system's end-to-end latency of 8-10 minutes per sample limits its applicability in real-time scenarios. Additionally, while the architecture is well-suited for the challenge, its generalizability to other reasoning tasks remains to be fully validated. The empirical tuning of parameters may also restrict the adaptability of the system to different datasets or tasks.

Broader Impact

The proposed system has significant implications for audio understanding and reasoning, particularly in applications such as automated audio analysis, content moderation, and interactive audio systems. By improving the transparency and reliability of audio question answering, it opens avenues for more trustworthy AI applications in various domains, including education, entertainment, and accessibility. The paper presents a novel multi-source evidence fusion approach for audio question answering, achieving top performance in reasoning quality while addressing the challenges of reliability and transparency in LALMs. The comprehensive methodology and strong experimental results contribute significantly to the field of audio understanding and reasoning, paving the way for future advancements in multimodal AI systems.

Analysis: Full Paper • Full text: 22,137 characters

Making Separation-First Multi-Stream Audio Watermarking Feasible via Joint Training

Houmin Sun, Zi Hu, Linxi Li ... · arXiv

Modern audio is created by mixing stems from different sources, raising the question: can we independently watermark each stem and recover all watermarks after separation? We study a separation-first, multi-stream watermarking framework-embedding distinct information into stems u...

Modern audio is created by mixing stems from different sources, raising the question: can we independently watermark each stem and recover all watermarks after separation? We study a separation-first, multi-stream watermarking framework-embedding distinct information into stems using unique keys but a shared structure, mixing, separating, and decoding from each output. A naive pipeline (robust watermarking + off-the-shelf separation) yields poor bit recovery, showing robustness to generic distortions does not ensure robustness to separation artifacts. To enable this, we jointly train the watermark system and the separator in an end-to-end manner, encouraging the separator to preserve watermark cues while adapting embedding to separation-specific distortions. Experiments on speech+music and vocal+accompaniment mixtures show substantial gains in post-separation recovery while maintaining perceptual quality.

Institutional Affiliations

Primary: Duke Kunshan University

All Institutions: Duke Kunshan University, The Chinese University of Hong Kong

Demo

ML Relevance Analysis (83)

The paper presents a novel approach to multi-stream audio watermarking that effectively addresses the challenges posed by source separation. By jointly training the watermarking and separation systems, the authors demonstrate substantial improvements in watermark recovery while maintaining audio quality, marking a significant contribution to the field of audio processing and copyright protection.

Comprehensive Analysis

Methodology Assessment

The paper introduces a novel separation-first, multi-stream audio watermarking framework that jointly trains a watermarking system and a source separator in an end-to-end manner. This approach addresses the challenge of preserving watermark cues during the separation process, which is often overlooked in traditional watermarking methods. The methodology is well-structured, with a clear problem setup and a detailed description of the joint training pipeline, including the use of a key-conditioned Conformer architecture for watermarking and the Demucs separator for audio separation. The approach is innovative in its integration of watermarking and separation, which is a significant advancement in the field.

Experimental Evaluation

The experiments are comprehensive, utilizing multiple datasets and evaluating the performance of the proposed method against several baselines. The results demonstrate substantial improvements in post-separation watermark recovery, with a significant reduction in bit error rates compared to existing methods. The evaluation metrics used, including average bit error rate and perceptual quality measures (e.g., SNR and ViSQOL), provide a robust assessment of the method's effectiveness. The experiments also highlight the importance of joint training in enhancing both watermark robustness and separation integrity.

Reproducibility

The paper provides sufficient implementation details, including the architecture of the watermarking system and the separation network, as well as the training setup and loss functions. However, the reproducibility could be improved by providing access to the code and detailed instructions for replicating the experiments. The mention of hardware specifications and training duration is helpful, but a public repository would enhance transparency.

Limitations

One limitation is that the framework is currently limited to two-stem mixtures, which may restrict its applicability in more complex audio scenarios. Additionally, while the joint training approach improves robustness, it may introduce trade-offs in terms of the imperceptibility of the watermark, as indicated by the results showing that separation-aware models do not outperform single-carrier baselines in direct encoding/decoding settings.

Broader Impact

The proposed method has significant implications for copyright protection and content authenticity in the age of AI-generated audio. As audio content becomes increasingly mixed and generated from multiple sources, the ability to independently watermark and recover information from different stems is crucial. This research could pave the way for more secure and reliable audio watermarking techniques, potentially influencing industry standards in digital rights management. The paper presents a novel approach to multi-stream audio watermarking that effectively addresses the challenges posed by source separation. By jointly training the watermarking and separation systems, the authors demonstrate substantial improvements in watermark recovery while maintaining audio quality, marking a significant contribution to the field of audio processing and copyright protection.

Analysis: Full Paper • Full text: 20,588 characters

Robust Generative Audio Quality Assessment: Disentangling Quality from Spurious Correlations

Kuan-Tang Huang, Chien-Chun Wang, Cheng-Yeh Yang ... · IEEE ICME 2026

The rapid proliferation of AI-Generated Content (AIGC) has necessitated robust metrics for perceptual quality assessment. However, automatic Mean Opinion Score (MOS) prediction models are often compromised by data scarcity, predisposing them to learn spurious correlations-- such ...

The rapid proliferation of AI-Generated Content (AIGC) has necessitated robust metrics for perceptual quality assessment. However, automatic Mean Opinion Score (MOS) prediction models are often compromised by data scarcity, predisposing them to learn spurious correlations-- such as dataset-specific acoustic signatures-- rather than generalized quality features. To address this, we leverage domain adversarial training (DAT) to disentangle true quality perception from these nuisance factors. Unlike prior works that rely on static domain priors, we systematically investigate domain definition strategies ranging from explicit metadata-driven labels to implicit data-driven clusters. Our findings reveal that there is no "one-size-fits-all" domain definition; instead, the optimal strategy is highly dependent on the specific MOS aspect being evaluated. Experimental results demonstrate that our aspect-specific domain strategy effectively mitigates acoustic biases, significantly improving correlation with human ratings and achieving superior generalization on unseen generative scenarios.

Institutional Affiliations

Primary: National Taiwan Normal University

All Institutions: National Taiwan Normal University, Academia Sinica, E.SUN Financial Holding Co., Ltd., United Link Co., Ltd.

GitHub

ML Relevance Analysis (83)

The paper presents a novel approach to audio quality assessment by leveraging domain adversarial training to disentangle quality perception from spurious correlations, significantly enhancing the reliability of automatic MOS prediction models. The comprehensive methodology and rigorous experimental validation contribute to its significance in the field, addressing a pressing challenge in evaluating AI-generated audio content.

Comprehensive Analysis

Methodology Assessment

The paper introduces a robust framework for Mean Opinion Score (MOS) prediction using Domain Adversarial Training (DAT) to mitigate spurious correlations in audio quality assessment. The methodology is well-structured, employing three distinct domain definition strategies: explicit metadata-driven labels, implicit K-means clustering, and random assignment, which are systematically analyzed for their effectiveness. The use of a pre-trained SSL feature extractor and a MultiGauss backbone for quality prediction adds depth to the approach, ensuring that the model captures intrinsic quality features while remaining invariant to domain-specific biases.

Experimental Evaluation

The experiments are comprehensive, utilizing the AES-Natural dataset with a well-defined split protocol for training, validation, and evaluation. The results demonstrate significant improvements in correlation with human ratings across various aspects of audio quality, showcasing the effectiveness of the proposed domain strategies. The statistical significance of the results is validated through rigorous testing, including t-tests, which strengthens the findings.

Reproducibility

The paper provides sufficient details regarding the model architecture, training setup, and evaluation metrics, which facilitates reproducibility. The inclusion of a GitHub repository for code access further enhances the potential for others to replicate the study.

Limitations

While the study presents a robust framework, it may be limited by the dataset used, which could affect the generalizability of the findings. Additionally, the reliance on specific domain definitions may not universally apply to all audio quality assessment scenarios, suggesting a need for further exploration of domain strategies across diverse datasets.

Broader Impact

The implications of this research are significant, as it addresses a critical challenge in the evaluation of AI-generated audio content, which is increasingly relevant in the context of content creation and multimedia applications. The findings could influence the development of more reliable audio quality assessment tools, potentially impacting industries such as entertainment, broadcasting, and AI content generation. The paper presents a novel approach to audio quality assessment by leveraging domain adversarial training to disentangle quality perception from spurious correlations, significantly enhancing the reliability of automatic MOS prediction models. The comprehensive methodology and rigorous experimental validation contribute to its significance in the field, addressing a pressing challenge in evaluating AI-generated audio content.

Analysis: Full Paper • Full text: 27,289 characters

Shared Representation Learning for Reference-Guided Targeted Sound Detection

Shubham Gupta, Adarsh Arigala, B. R. Dilleswari ... · IEEE ICASSP 2026

Human listeners exhibit the remarkable ability to segregate a desired sound from complex acoustic scenes through selective auditory attention, motivating the study of Targeted Sound Detection (TSD). The task requires detecting and localizing a target sound in a mixture when a ref...

Human listeners exhibit the remarkable ability to segregate a desired sound from complex acoustic scenes through selective auditory attention, motivating the study of Targeted Sound Detection (TSD). The task requires detecting and localizing a target sound in a mixture when a reference audio of that sound is provided. Prior approaches, rely on generating a sound-discriminative conditional embedding vector for the reference and pairing it with a mixture encoder, jointly optimized with a multi-task learning approach. In this work, we propose a unified encoder architecture that processes both the reference and mixture audio within a shared representation space, promoting stronger alignment while reducing architectural complexity. This design choice not only simplifies the overall framework but also enhances generalization to unseen classes. Following the multi-task training paradigm, our method achieves substantial improvements over prior approaches, surpassing existing methods and establishing a new state-of-the-art benchmark for targeted sound detection, with a segment-level F1 score of 83.15% and an overall accuracy of 95.17% on the URBAN-SED dataset.

Institutional Affiliations

Primary: Indian Institute of Technology Hyderabad

All Institutions: Indian Institute of Technology Hyderabad

ML Relevance Analysis (83)

The paper presents a unified encoder framework for reference-guided targeted sound detection, achieving state-of-the-art performance and demonstrating robustness in real-world applications. The methodology and results contribute meaningfully to the field of audio machine learning, particularly in enhancing sound event detection capabilities.

Comprehensive Analysis

Methodology Assessment

The proposed methodology introduces a unified encoder architecture that processes both reference and mixture audio in a shared representation space. This approach reduces architectural complexity and enhances feature alignment, which is a significant improvement over previous dual-branch designs. The methodology is well-structured, leveraging ConvNeXt for representation extraction and employing diverse fusion strategies, which are systematically evaluated. The multi-task learning paradigm further strengthens the model's performance by combining clip-level classification with frame-level detection, showcasing a comprehensive understanding of the task requirements.

Experimental Evaluation

The experimental evaluation is robust, utilizing well-defined datasets (URBAN-SED and UrbanSound8K) and establishing new benchmarks for performance metrics, particularly segment-level F1 scores. The results demonstrate substantial improvements over prior methods, indicating the effectiveness of the proposed approach. The evaluation also includes cross-domain generalization tests, which add depth to the findings and confirm the model's resilience to distributional shifts.

Reproducibility

The paper provides sufficient implementation details, including architecture specifications, training configurations, and data augmentation strategies, which facilitate reproducibility. However, the absence of a public code repository or demo URL limits the ease of reproduction for external researchers.

Limitations

One identified limitation is the reliance on specific datasets, which may not fully represent the diversity of real-world acoustic environments. Additionally, while the model shows strong performance on the benchmark datasets, its effectiveness in more complex or noisy real-world scenarios remains to be thoroughly validated.

Broader Impact

The research has significant implications for various applications, including surveillance, multimedia retrieval, and smart assistants, where targeted sound detection is crucial. The ability to generalize to unseen classes enhances the model's applicability in real-world scenarios, potentially leading to advancements in audio processing technologies. The paper presents a unified encoder framework for reference-guided targeted sound detection, achieving state-of-the-art performance and demonstrating robustness in real-world applications. The methodology and results contribute meaningfully to the field of audio machine learning, particularly in enhancing sound event detection capabilities.

Analysis: Full Paper • Full text: 13,918 characters

Diffusion Models for Joint Audio-Video Generation

Alejandro Paredes La Torre · arXiv

Multimodal generative models have shown remarkable progress in single-modality video and audio synthesis, yet truly joint audio-video generation remains an open challenge. In this paper, I explore four key contributions to advance this field. First, I release two high-quality, pa...

Multimodal generative models have shown remarkable progress in single-modality video and audio synthesis, yet truly joint audio-video generation remains an open challenge. In this paper, I explore four key contributions to advance this field. First, I release two high-quality, paired audio-video datasets. The datasets consisting on 13 hours of video-game clips and 64 hours of concert performances, each segmented into consistent 34-second samples to facilitate reproducible research. Second, I train the MM-Diffusion architecture from scratch on our datasets, demonstrating its ability to produce semantically coherent audio-video pairs and quantitatively evaluating alignment on rapid actions and musical cues. Third, I investigate joint latent diffusion by leveraging pretrained video and audio encoder-decoders, uncovering challenges and inconsistencies in the multimodal decoding stage. Finally, I propose a sequential two-step text-to-audio-video generation pipeline: first generating video, then conditioning on both the video output and the original prompt to synthesize temporally synchronized audio. My experiments show that this modular approach yields high-fidelity generations of audio video generation.

Institutional Affiliations

Primary: Duke University

All Institutions: Duke University

GitHub

ML Relevance Analysis (78)

This paper presents a significant advancement in the field of joint audio-video generation through the introduction of novel methodologies and high-quality datasets. The contributions are well-aligned with current challenges in multimodal generative models, making it a valuable addition to the literature.

Comprehensive Analysis

Methodology Assessment

The paper proposes a novel MM-Diffusion architecture trained from scratch on newly released datasets, which is a significant methodological contribution. The sequential two-step text-to-audio-video generation pipeline is particularly innovative, as it addresses the challenge of synchronizing audio and video outputs effectively. The use of pretrained encoder-decoder models for joint latent diffusion adds depth to the methodology, although the paper could benefit from a more detailed explanation of the architecture and the training process.

Experimental Evaluation

The experiments are well-structured, utilizing high-quality datasets that enhance the validity of the results. The quantitative evaluation of alignment between audio and video is a strong point, although the paper could improve by including more comprehensive qualitative assessments, such as user studies or comparisons with existing state-of-the-art methods. The results demonstrate high fidelity in generated outputs, which is promising for future applications.

Reproducibility

The paper mentions the release of datasets and code, which is crucial for reproducibility. However, it lacks detailed implementation specifics, such as hyperparameter settings and training configurations, which would aid other researchers in replicating the experiments effectively.

Limitations

One limitation is the reliance on the quality of the datasets, which may not generalize well across different types of audio-video content. Additionally, the paper does not address potential biases in the datasets, nor does it explore the scalability of the proposed methods to larger or more diverse datasets. The challenges uncovered in the multimodal decoding stage could also benefit from more in-depth analysis.

Broader Impact

The potential applications of this research are significant, particularly in entertainment, gaming, and educational content generation. By improving joint audio-video generation, the work could enhance user experiences in multimedia applications. However, ethical considerations around content generation and potential misuse should be addressed in future work. This paper presents a significant advancement in the field of joint audio-video generation through the introduction of novel methodologies and high-quality datasets. The contributions are well-aligned with current challenges in multimodal generative models, making it a valuable addition to the literature.

Analysis: Full Paper • Full text: 1,986 characters

WhispSynth: Scaling Multilingual Whisper Corpus through Real Data Curation and A Novel Pitch-free Generative Framework

Tianyi Tan, Jiaxin Ye, Yuanming Zhang ... · arXiv

Whisper generation is constrained by the difficulty of data collection. Because whispered speech has low acoustic amplitude, high-fidelity recording is challenging. In this paper, we introduce WhispSynth, a large-scale multilingual corpus constructed via a novel high-fidelity gen...

Whisper generation is constrained by the difficulty of data collection. Because whispered speech has low acoustic amplitude, high-fidelity recording is challenging. In this paper, we introduce WhispSynth, a large-scale multilingual corpus constructed via a novel high-fidelity generative framework. Specifically, we propose a pipeline integrating Differentiable Digital Signal Processing (DDSP)-based pitch-free method with Text-to-Speech (TTS) models. This framework refines a comprehensive collection of resources, including our newly constructed WhispNJU dataset, into 118 hours of high-fidelity whispered speech from 479 speakers. Unlike standard synthetic or noisy real data, our data engine faithfully preserves source vocal timbre and linguistic content while ensuring acoustic consistency, providing a robust foundation for text-to-whisper research. Experimental results demonstrate that WhispSynth exhibits significantly higher quality than existing corpora. Moreover, our CosyWhisper, tuned with WhispSynth, achieves speech naturalness on par with ground-truth samples. The official implementation and related resources are available at https://github.com/tan90xx/cosywhisper.

Institutional Affiliations

Primary: MIT

All Institutions: MIT

GitHub

ML Relevance Analysis (92)

The paper introduces WhispSynth, a novel framework for generating high-fidelity whispered speech, addressing critical data scarcity issues in whisper research and significantly advancing the state of the art in TTS systems. The comprehensive methodology, rigorous experimental evaluation, and potential for broader applications underscore its importance in the field.

Comprehensive Analysis

Methodology Assessment

The methodology presented in the paper is innovative, particularly in the integration of Differentiable Digital Signal Processing (DDSP) with Text-to-Speech (TTS) models to create a pitch-free whisper generation framework. The authors effectively address the challenges of whisper synthesis by developing a robust pipeline that combines existing datasets with their newly constructed WhispNJU dataset. The use of adversarial training and semi-supervised dual-focus training strategies enhances the model's ability to generate high-fidelity whispered speech, demonstrating a thoughtful approach to overcoming limitations in existing TTS systems.

Experimental Evaluation

The experimental evaluation is comprehensive, with a clear focus on both subjective and objective metrics to assess the quality of the synthesized whispers. The authors provide detailed comparisons with existing datasets and methods, showcasing significant improvements in naturalness and intelligibility. The use of metrics such as DNSMOS and UTMOS, along with rigorous testing across multiple languages, strengthens the validity of their findings. However, the paper could benefit from more extensive ablation studies to further clarify the contributions of each component in their proposed framework.

Reproducibility

The paper includes a link to the official implementation on GitHub, which is crucial for reproducibility. The authors provide sufficient details about their training and evaluation processes, including dataset splits and training settings. However, the paper could improve by including more specific hyperparameter settings and a clearer description of the data preprocessing steps to facilitate easier replication of their results.

Limitations

The authors acknowledge the limitations related to the influence of non-linguistic variations on model performance and the potential impact of the audio watermarking on synthesized audio quality. Additionally, the dataset's reliance on existing corpora may introduce biases that could affect the generalizability of the findings. The lack of a systematic assessment of hardware differences in real-world applications is another notable limitation.

Broader Impact

The work has significant implications for the fields of speech synthesis and audio processing, particularly in applications requiring whisper generation, such as ASMR content creation and secure communication systems. By providing an open-source resource and framework for whisper synthesis, the authors contribute to advancing research in this niche area, potentially enabling further developments in multilingual speech technologies. The paper introduces WhispSynth, a novel framework for generating high-fidelity whispered speech, addressing critical data scarcity issues in whisper research and significantly advancing the state of the art in TTS systems. The comprehensive methodology, rigorous experimental evaluation, and potential for broader applications underscore its importance in the field.

Analysis: Full Paper • Full text: 41,203 characters

Modeling and Benchmarking Spoken Dialogue Rewards with Modality and Colloquialness

Jingyu Lu, Yuhan Wang, Fan Zhuo ... · arXiv

The rapid evolution of end-to-end spoken dialogue systems demands transcending mere textual semantics to incorporate paralinguistic nuances and the spontaneous nature of human conversation. However, current methods struggle with two critical gaps: the modality gap, involving pros...

The rapid evolution of end-to-end spoken dialogue systems demands transcending mere textual semantics to incorporate paralinguistic nuances and the spontaneous nature of human conversation. However, current methods struggle with two critical gaps: the modality gap, involving prosody and emotion, and the colloquialness gap, distinguishing written scripts from natural speech. To address these challenges, we introduce SDiaReward, an end-to-end multi-turn reward model trained on SDiaReward-Dataset, a novel collection of episode-level preference pairs explicitly targeting these gaps. It operates directly on full multi-turn speech episodes and is optimized with pairwise preference supervision, enabling joint assessment of modality and colloquialness in a single evaluator. We further establish ESDR-Bench, a stratified benchmark for robust episode-level evaluation. Experiments demonstrate that SDiaReward achieves state-of-the-art pairwise preference accuracy, significantly outperforming general-purpose audio LLMs. Further analysis suggests that SDiaReward captures relative conversational expressiveness beyond superficial synthesis cues, improving generalization across domains and recording conditions. Code, data, and demos are available at https://sdiareward.github.io/.

Institutional Affiliations

Primary: unknown

All Institutions: unknown

Demo · GitHub

ML Relevance Analysis (88)

The main contribution of this paper is the introduction of a novel reward modeling framework, SDiaReward, which significantly improves the evaluation of spoken dialogue systems by addressing modality and colloquialness gaps through a data-driven approach. This work is pivotal in advancing the state-of-the-art in dialogue systems, providing a comprehensive methodology and robust experimental validation that could influence future research and applications in the field.

Comprehensive Analysis

Methodology Assessment

The paper introduces a novel reward modeling framework, SDiaReward, which effectively addresses the modality and colloquialness gaps in spoken dialogue systems. The methodology is well-structured, utilizing a pairwise preference learning approach to train on a specifically curated dataset (SDiaReward-Dataset) that captures the nuances of natural speech. The integration of multimodal LLMs for reward prediction, along with the establishment of ESDR-Bench for benchmarking, showcases a comprehensive approach to improving dialogue evaluation metrics.

Experimental Evaluation

The experimental evaluation is robust, demonstrating the superiority of SDiaReward over existing general-purpose audio LLMs through a series of well-defined metrics, including pairwise preference accuracy across various datasets. The results indicate significant improvements in capturing paralinguistic features and conversational spontaneity, which are critical for realistic spoken dialogue systems. The use of both micro and macro averages for accuracy assessment provides a nuanced understanding of model performance across different data regimes.

Reproducibility

The paper includes detailed implementation specifics, including model architecture, training procedures, and dataset construction methods. The availability of code and data enhances reproducibility, although the actual primary institution is not specified, which could limit the ability to verify institutional affiliations and resources.

Limitations

The paper acknowledges limitations related to the dataset's focus on "in-the-wild" recordings, which may affect the model's robustness in more controlled environments. Additionally, the potential for domain-specific biases in reward scoring is noted, suggesting that further refinements are necessary for broader applicability.

Broader Impact

The work has significant implications for the development of more natural and effective spoken dialogue systems, which can enhance human-AI interactions across various applications, including customer service, education, and entertainment. By addressing critical gaps in existing models, this research paves the way for future advancements in dialogue systems that can better understand and generate human-like speech. The main contribution of this paper is the introduction of a novel reward modeling framework, SDiaReward, which significantly improves the evaluation of spoken dialogue systems by addressing modality and colloquialness gaps through a data-driven approach. This work is pivotal in advancing the state-of-the-art in dialogue systems, providing a comprehensive methodology and robust experimental validation that could influence future research and applications in the field.

Analysis: Full Paper • Full text: 46,333 characters

NV-Bench: Benchmark of Nonverbal Vocalization Synthesis for Expressive Text-to-Speech Generation

Qinke Ni, Huan Liao, Dekun Chen ... · arXiv

While recent text-to-speech (TTS) systems increasingly integrate nonverbal vocalizations (NVs), their evaluations lack standardized metrics and reliable ground-truth references. To bridge this gap, we propose NV-Bench, the first benchmark grounded in a functional taxonomy that tr...

While recent text-to-speech (TTS) systems increasingly integrate nonverbal vocalizations (NVs), their evaluations lack standardized metrics and reliable ground-truth references. To bridge this gap, we propose NV-Bench, the first benchmark grounded in a functional taxonomy that treats NVs as communicative acts rather than acoustic artifacts. NV-Bench comprises 1,651 multi-lingual, in-the-wild utterances with paired human reference audio, balanced across 14 NV categories. We introduce a dual-dimensional evaluation protocol: (1) Instruction Alignment, utilizing the proposed paralinguistic character error rate (PCER) to assess controllability, (2) Acoustic Fidelity, measuring the distributional gap to real recordings to assess acoustic realism. We evaluate diverse TTS models and develop two baselines. Experimental results demonstrate a strong correlation between our objective metrics and human perception, establishing NV-Bench as a standardized evaluation framework.

Institutional Affiliations

Primary: The Chinese University of Hong Kong

All Institutions: The Chinese University of Hong Kong

Demo

ML Relevance Analysis (83)

The paper presents NV-Bench, a pioneering benchmark for nonverbal vocalization synthesis in expressive TTS generation, establishing a standardized evaluation framework that enhances the assessment of TTS models. The comprehensive methodology, extensive experimental evaluation, and potential for broader impact underscore its significance in advancing the field of machine learning and audio synthesis.

Comprehensive Analysis

Methodology Assessment

The methodology employed in NV-Bench is robust and well-structured, focusing on the creation of a comprehensive benchmark for nonverbal vocalization synthesis in TTS systems. The authors introduce a dual-dimensional evaluation protocol that effectively separates instruction alignment and acoustic fidelity, which is a significant advancement over existing methods that often conflate these aspects. The use of a functional taxonomy to categorize NVs as communicative acts adds depth to the evaluation process, allowing for a more nuanced understanding of NVs in TTS systems. The data collection process is thorough, utilizing a multi-lingual NVASR model and a rigorous filtering pipeline that ensures high-quality, human-verified ground truth data. This attention to detail in methodology enhances the credibility of the benchmark.

Experimental Evaluation

The experimental evaluation is comprehensive, involving a diverse set of TTS models and a well-balanced dataset of 1,651 utterances across 14 NV categories. The results demonstrate a strong correlation between the proposed objective metrics and human perception, validating the effectiveness of NV-Bench as a standardized evaluation framework. The benchmarking of state-of-the-art TTS models provides valuable insights into their performance regarding controllability and acoustic fidelity, highlighting the practical utility of the proposed framework. The inclusion of both single-label and multi-label subsets allows for a thorough assessment of model capabilities under varying conditions.

Reproducibility

The paper provides sufficient details regarding the implementation of the NVASR model and the data collection process, which aids in reproducibility. However, the lack of a publicly available code repository may hinder full reproducibility for some researchers. The authors could enhance reproducibility by providing access to the models and datasets used in their experiments, as well as detailed training configurations.

Limitations

One limitation of the study is the potential bias introduced by the selection of the dataset, which is curated from online audiovisual media. While efforts are made to ensure diversity and balance, the long-tail distribution of NV events may still lead to underrepresentation of certain NV categories. Additionally, the reliance on human verification for ground truth may introduce variability based on annotator subjectivity, which could affect the consistency of the evaluation metrics.

Broader Impact

The introduction of NV-Bench has the potential to significantly impact the field of text-to-speech synthesis by providing a standardized framework for evaluating NVs, which are crucial for enhancing the expressiveness of TTS systems. This benchmark can facilitate fair comparisons among different TTS models, driving advancements in the development of more human-like speech synthesis. The implications extend beyond academic research, as improved TTS systems can enhance applications in virtual assistants, gaming, and other interactive media where expressive communication is essential. The paper presents NV-Bench, a pioneering benchmark for nonverbal vocalization synthesis in expressive TTS generation, establishing a standardized evaluation framework that enhances the assessment of TTS models. The comprehensive methodology, extensive experimental evaluation, and potential for broader impact underscore its significance in advancing the field of machine learning and audio synthesis.

Analysis: Full Paper • Full text: 17,757 characters

Something from Nothing: Data Augmentation for Robust Severity Level Estimation of Dysarthric Speech

Jaesung Bae, Xiuwen Zheng, Minje Kim ... · arXiv

Dysarthric speech quality assessment (DSQA) is critical for clinical diagnostics and inclusive speech technologies. However, subjective evaluation is costly and difficult to scale, and the scarcity of labeled data limits robust objective modeling. To address this, we propose a th...

Dysarthric speech quality assessment (DSQA) is critical for clinical diagnostics and inclusive speech technologies. However, subjective evaluation is costly and difficult to scale, and the scarcity of labeled data limits robust objective modeling. To address this, we propose a three-stage framework that leverages unlabeled dysarthric speech and large-scale typical speech datasets to scale training. A teacher model first generates pseudo-labels for unlabeled samples, followed by weakly supervised pretraining using a label-aware contrastive learning strategy that exposes the model to diverse speakers and acoustic conditions. The pretrained model is then fine-tuned for the downstream DSQA task. Experiments on five unseen datasets spanning multiple etiologies and languages demonstrate the robustness of our approach. Our Whisper-based baseline significantly outperforms SOTA DSQA predictors such as SpICE, and the full framework achieves an average SRCC of 0.761 across unseen test datasets.

Institutional Affiliations

Primary: University of Illinois Urbana-Champaign

All Institutions: University of Illinois Urbana-Champaign, Korea Advanced Institute of Science

ML Relevance Analysis (83)

The main contribution of this paper is the innovative three-stage framework for dysarthric speech quality assessment that leverages unlabeled data through pseudo-labeling and weakly supervised learning. This work represents a meaningful advancement in the field of speech processing, particularly in enhancing the robustness and scalability of models for assessing dysarthric speech, which is crucial for both clinical applications and the development of inclusive technologies.

Comprehensive Analysis

Methodology Assessment

The paper introduces a three-stage framework that effectively combines pseudo-labeling with weakly supervised pretraining and fine-tuning, which is a novel approach in the context of dysarthric speech quality assessment. The use of a teacher model for generating pseudo-labels is particularly innovative, as it allows the model to leverage unlabeled data effectively. The label-aware contrastive learning strategy is well-conceived, exposing the model to diverse acoustic conditions and speaker variations, which is crucial for robustness in speech tasks. However, the paper could benefit from a more detailed explanation of the contrastive learning strategy and its implementation specifics.

Experimental Evaluation

The experiments are comprehensive, covering five unseen datasets that span various languages and etiologies, which demonstrates the robustness of the proposed method. The reported performance metrics, particularly the average SRCC of 0.761, indicate a significant improvement over state-of-the-art methods. However, the paper lacks a thorough comparison with baseline methods beyond mentioning the Whisper-based baseline and SpICE, which would strengthen the claims of superiority.

Reproducibility

The paper does not provide sufficient implementation details or code availability, which raises concerns about reproducibility. While the methodology is described, the absence of a clear protocol or access to the trained models limits the ability of other researchers to replicate the results.

Limitations

One limitation is the reliance on pseudo-labeling, which can introduce noise if the teacher model is not sufficiently accurate. Additionally, while the framework shows promise across diverse datasets, the generalizability of the approach to other forms of speech disorders or languages remains untested. The paper could also discuss potential biases in the datasets used.

Broader Impact

The proposed framework has significant implications for clinical diagnostics and inclusive speech technologies, potentially improving accessibility for individuals with dysarthria. By reducing the reliance on subjective evaluations, this work could facilitate the development of scalable, objective assessment tools in speech therapy and assistive technologies. The approach could also inspire further research into unsupervised and semi-supervised learning methods in other areas of speech processing. The main contribution of this paper is the innovative three-stage framework for dysarthric speech quality assessment that leverages unlabeled data through pseudo-labeling and weakly supervised learning. This work represents a meaningful advancement in the field of speech processing, particularly in enhancing the robustness and scalability of models for assessing dysarthric speech, which is crucial for both clinical applications and the development of inclusive technologies.

Analysis: Full Paper • Full text: 1,142 characters

VorTEX: Various overlap ratio for Target speech EXtraction

Ro-hoon Oh, Jihwan Seol, Bugeun Kim · arXiv

Target speech extraction (TSE) aims to recover a target speaker's voice from a mixture. While recent text-prompted approaches have shown promise, most approaches assume fully overlapped mixtures, limiting insight into behavior across realistic overlap ratios. We introduce VorTEX ...

Target speech extraction (TSE) aims to recover a target speaker's voice from a mixture. While recent text-prompted approaches have shown promise, most approaches assume fully overlapped mixtures, limiting insight into behavior across realistic overlap ratios. We introduce VorTEX (Various overlap ratio for Target speech EXtraction), a text-prompted TSE architecture with a Decoupled Adaptive Multi-branch (DAM) Fusion block that separates primary extraction from auxiliary regularization pathways. To enable controlled analysis, we construct PORTE, a two-speaker dataset spanning overlap ratios from 0% to 100%. We further propose Suppression Ratio on Energy (SuRE), a diagnostic metric that detects suppression behavior not captured by conventional measures. Experiments show that existing models exhibit suppression or residual interference under overlap, whereas VorTEX achieves the highest separation fidelity across 20-100% overlap (e.g., 5.50 dB at 20% and 2.04 dB at 100%) while maintaining zero SuRE, indicating robust extraction without suppression-driven artifacts.

Institutional Affiliations

Primary: Chung-Ang University

All Institutions: Chung-Ang University

ML Relevance Analysis (83)

The main contribution of this work is the introduction of VorTEX, a robust text-prompted TSE model that effectively addresses the challenges of varying overlap ratios in speech mixtures, alongside the creation of the PORTE dataset for evaluating TSE performance. This research significantly advances the field of audio processing by providing a novel architecture and evaluation framework that can lead to improved speech extraction in practical applications.

Comprehensive Analysis

Methodology Assessment

The paper introduces VorTEX, a novel architecture for target speech extraction (TSE) that utilizes a Decoupled Adaptive Multi-branch (DAM) Fusion block to separate extraction and regularization pathways. This approach is innovative as it addresses the limitations of existing models that primarily focus on fully overlapped mixtures, thus enhancing the robustness of TSE across various overlap ratios. The proposed methodology is well-structured, with a clear explanation of the DAM architecture and its components, including Multi-Scale Fusion, Adaptive Fusion, and Dual Projection Fusion. The introduction of the PORTE dataset is a significant contribution, providing a controlled environment for evaluating TSE models under realistic conditions.

Experimental Evaluation

The experiments conducted are thorough, comparing VorTEX against established models in the field, such as AudioSep and DGMO, as well as other text-prompted TSE models like StyleTSE and LLM-TSE. The use of multiple evaluation metrics, including SISDR, PESQ, and the newly proposed SuRE metric, allows for a comprehensive assessment of model performance. The results demonstrate VorTEX's superior extraction fidelity and robustness, particularly in high-overlap scenarios, validating the effectiveness of the proposed architecture.

Reproducibility

The paper provides sufficient details regarding the architecture, training configuration, and evaluation metrics, which would allow for reproducibility. However, the lack of a public repository or demo URL limits the ease of access for other researchers to replicate the results.

Limitations

While the study presents significant advancements, it relies on a synthetic dataset (PORTE), which may not fully capture the complexities of real-world conversational audio. Additionally, the prompts used for TSE are limited to observable attributes, suggesting that future work could explore more complex prompt structures. The paper also acknowledges the need for further research to develop metrics that comprehensively assess extraction fidelity, perceptual quality, and speaker preservation.

Broader Impact

The findings of this research have the potential to improve applications in speech recognition, assistive technologies, and audio processing systems where clear target speech extraction is crucial. By addressing the challenges of overlapping speech, VorTEX could enhance user experiences in various real-world scenarios, such as in crowded environments or during multi-speaker conversations. The main contribution of this work is the introduction of VorTEX, a robust text-prompted TSE model that effectively addresses the challenges of varying overlap ratios in speech mixtures, alongside the creation of the PORTE dataset for evaluating TSE performance. This research significantly advances the field of audio processing by providing a novel architecture and evaluation framework that can lead to improved speech extraction in practical applications.

Analysis: Full Paper • Full text: 42,743 characters

AC-Foley: Reference-Audio-Guided Video-to-Audio Synthesis with Acoustic Transfer

Pengjun Fang, Yingqing He, Yazhou Xing ... · ICLR 2026

Existing video-to-audio (V2A) generation methods predominantly rely on text prompts alongside visual information to synthesize audio. However, two critical bottlenecks persist: semantic granularity gaps in training data, such as conflating acoustically distinct sounds under coars...

Existing video-to-audio (V2A) generation methods predominantly rely on text prompts alongside visual information to synthesize audio. However, two critical bottlenecks persist: semantic granularity gaps in training data, such as conflating acoustically distinct sounds under coarse labels, and textual ambiguity in describing micro-acoustic features. These bottlenecks make it difficult to perform fine-grained sound synthesis using text-controlled modes. To address these limitations, we propose AC-Foley, an audio-conditioned V2A model that directly leverages reference audio to achieve precise and fine-grained control over generated sounds. This approach enables fine-grained sound synthesis, timbre transfer, zero-shot sound generation, and improved audio quality. By directly conditioning on audio signals, our approach bypasses the semantic ambiguities of text descriptions while enabling precise manipulation of acoustic attributes. Empirically, AC-Foley achieves state-of-the-art performance for Foley generation when conditioned on reference audio, while remaining competitive with state-of-the-art video-to-audio methods even without audio conditioning.

Institutional Affiliations

Primary: [Institution not explicitly stated in the text]

All Institutions: [Institution not explicitly stated in the text]

ML Relevance Analysis (82)

The main contribution of this paper is the introduction of AC-Foley, a novel audio-conditioned framework for video-to-audio generation that enables precise acoustic control through direct audio conditioning. This work significantly advances the state of the art in audio synthesis by addressing key challenges in fine-grained sound generation and multimodal integration, paving the way for innovative applications in creative sound design.

Comprehensive Analysis

Methodology Assessment

The methodology presented in AC-Foley is robust, leveraging a two-stage training framework that effectively addresses the challenges of temporal alignment and acoustic fidelity in video-to-audio synthesis. The integration of reference audio as a conditioning mechanism is a significant innovation, allowing for precise control over generated sounds and overcoming the limitations of text-based prompts. The use of multimodal transformers to unify video, audio, and text modalities is well-justified and enhances the model's performance. The paper also provides a clear explanation of the conditional flow matching objective and the audio control module, which are critical to the success of the proposed method.

Experimental Evaluation

The experimental evaluation is comprehensive, utilizing a variety of metrics to assess performance across several dimensions, including distribution matching, semantic alignment, temporal synchronization, and spectral fidelity. The results demonstrate that AC-Foley outperforms existing methods in multiple aspects, showcasing its effectiveness in generating high-quality audio that is temporally and semantically aligned with video content. The inclusion of human studies adds a valuable subjective evaluation component, further validating the model's performance.

Reproducibility

The paper provides detailed implementation details, including training strategies, dataset descriptions, and evaluation metrics, which enhance reproducibility. However, the lack of a publicly available code repository or demo URL limits the ability of other researchers to replicate the findings directly.

Limitations

The paper acknowledges limitations in handling complex auditory environments, particularly when multiple sound sources overlap or when there are extreme temporal mismatches between reference sounds and visual content. These factors may hinder the model's ability to generate optimal audio in certain scenarios.

Broader Impact

The proposed AC-Foley framework has significant implications for the fields of sound design and multimedia content creation, enabling artists and creators to achieve precise audio synthesis that aligns closely with visual elements. Its potential applications extend to film, gaming, and virtual reality, where high-quality audio generation is crucial for immersive experiences. The main contribution of this paper is the introduction of AC-Foley, a novel audio-conditioned framework for video-to-audio generation that enables precise acoustic control through direct audio conditioning. This work significantly advances the state of the art in audio synthesis by addressing key challenges in fine-grained sound generation and multimodal integration, paving the way for innovative applications in creative sound design.

Analysis: Full Paper • Full text: 33,565 characters

Anchoring Emotions in Text: Robust Multimodal Fusion for Mimicry Intensity Estimation

Lingsi Zhu, Yuefeng Zou, Yunxiang Zhang ... · arXiv

Estimating Emotional Mimicry Intensity (EMI) in naturalistic environments is a critical yet challenging task in affective computing. The primary difficulty lies in effectively modeling the complex, nonlinear temporal dynamics across highly heterogeneous modalities, especially whe...

Estimating Emotional Mimicry Intensity (EMI) in naturalistic environments is a critical yet challenging task in affective computing. The primary difficulty lies in effectively modeling the complex, nonlinear temporal dynamics across highly heterogeneous modalities, especially when physical signals are corrupted or missing. To tackle this, we propose TAEMI (Text-Anchored Emotional Mimicry Intensity estimation), a novel multimodal framework designed for the 10th ABAW Competition. Motivated by the observation that continuous visual and acoustic signals are highly susceptible to transient environmental noise, we break the traditional symmetric fusion paradigm. Instead, we leverage textual transcript--which inherently encode a stable, time-independent semantic prior--as central anchors. Specifically, we introduce a Text-Anchored Dual Cross-Attention mechanism that utilizes these robust textual queries to actively filter out frame-level redundancies and align the noisy physical streams. Furthermore, to prevent catastrophic performance degradation caused by inevitably missing data in unconstrained real-world scenarios, we integrate Learnable Missing-Modality Tokens and a Modality Dropout strategy during training. Extensive experiments on the Hume-Vidmimic2 dataset demonstrate that TAEMI effectively captures fine-grained emotional variations and maintains robust predictive resilience under imperfect conditions. Our framework achieves a state-of-the-art mean Pearson correlation coefficient across six continuous emotional dimensions, significantly outperforming existing baseline methods.

Institutional Affiliations

Primary: unknown

All Institutions: unknown

ML Relevance Analysis (75)

The main contribution of this paper is the introduction of a robust multimodal framework for estimating emotional mimicry intensity that effectively integrates textual anchors to mitigate the impact of noisy signals. This work represents a meaningful advancement in the field of affective computing, particularly in its approach to handling real-world challenges in multimodal data processing.

Comprehensive Analysis

Methodology Assessment

The proposed TAEMI framework introduces a novel approach to emotional mimicry intensity estimation by leveraging a Text-Anchored Dual Cross-Attention mechanism. This method effectively addresses the challenges posed by noisy and missing data in multimodal inputs, which is a significant improvement over traditional symmetric fusion methods. The integration of Learnable Missing-Modality Tokens and Modality Dropout during training is particularly innovative, as it enhances the model's robustness in real-world scenarios. However, the paper could benefit from a more detailed explanation of the attention mechanism and how it specifically interacts with the different modalities.

Experimental Evaluation

The experiments conducted on the Hume-Vidmimic2 dataset are comprehensive, showcasing the framework's ability to capture fine-grained emotional variations. The reported state-of-the-art mean Pearson correlation coefficient across six emotional dimensions indicates strong performance. However, the paper lacks a thorough comparison with a wider range of baseline methods, which could provide a clearer context for the claimed improvements. Additionally, the absence of subjective evaluations or qualitative assessments of the model's outputs limits the understanding of its practical effectiveness.

Reproducibility

The paper does not provide sufficient details regarding the implementation, hyperparameter settings, or the training process, which raises concerns about reproducibility. Including a clear methodology section with code availability or supplementary materials would significantly enhance the reproducibility of the results.

Limitations

One notable limitation is the reliance on textual transcripts as anchors, which may not always be available or accurate in real-world applications. Additionally, while the model performs well under controlled conditions, its effectiveness in highly variable environments remains to be fully validated. The potential for overfitting to the training dataset is also a concern, particularly given the complexity of the model.

Broader Impact

The implications of this research are significant for affective computing and applications in human-computer interaction, where understanding emotional states is crucial. The framework could be applied in various domains, including mental health monitoring, social robotics, and interactive entertainment. However, ethical considerations regarding data privacy and the potential for misuse in surveillance or manipulation should be addressed. The main contribution of this paper is the introduction of a robust multimodal framework for estimating emotional mimicry intensity that effectively integrates textual anchors to mitigate the impact of noisy signals. This work represents a meaningful advancement in the field of affective computing, particularly in its approach to handling real-world challenges in multimodal data processing.

Analysis: Full Paper • Full text: 112 characters

Neural Network-Based Time-Frequency-Bin-Wise Linear Combination of Beamformers for Underdetermined Target Source Extraction

Changda Chen, Yichen Yang, Wei Liu ... · ICASSP 2026

Extracting a target source from underdetermined mixtures is challenging for beamforming approaches. Recently proposed time-frequency-bin-wise switching (TFS) and linear combination (TFLC) strategies mitigate this by combining multiple beamformers in each time-frequency (TF) bin a...

Extracting a target source from underdetermined mixtures is challenging for beamforming approaches. Recently proposed time-frequency-bin-wise switching (TFS) and linear combination (TFLC) strategies mitigate this by combining multiple beamformers in each time-frequency (TF) bin and choosing combination weights that minimize the output power. However, making this decision independently for each TF bin can weaken temporal-spectral coherence, causing discontinuities and consequently degrading extraction performance. In this paper, we propose a novel neural network-based time-frequency-bin-wise linear combination (NN-TFLC) framework that constructs minimum power distortionless response (MPDR) beamformers without explicit noise covariance estimation. The network encodes the mixture and beamformer outputs, and predicts temporally and spectrally coherent linear combination weights via a cross-attention mechanism. On dual-microphone mixtures with multiple interferers, NN-TFLC-MPDR consistently outperforms TFS/TFLC-MPDR and achieves competitive performance with TFS/TFLC built on the minimum variance distortionless response (MVDR) beamformers that require noise priors.

Institutional Affiliations

Primary: unknown

All Institutions: unknown

ML Relevance Analysis (75)

This paper presents a novel neural network-based framework for target source extraction from underdetermined mixtures, significantly advancing the field of audio signal processing. The methodology effectively combines traditional beamforming concepts with modern neural network techniques, yielding promising results that could enhance various audio applications.

Comprehensive Analysis

Methodology Assessment

The proposed NN-TFLC framework introduces a neural network-based approach to time-frequency-bin-wise linear combination of beamformers, addressing the limitations of traditional methods by utilizing a cross-attention mechanism to maintain temporal-spectral coherence. The methodology is well-structured, leveraging existing concepts in beamforming while innovatively applying neural networks to enhance performance in underdetermined scenarios. The use of inplace convolutional gated linear units and Bi-LSTM for temporal context modeling is particularly noteworthy, as it allows for effective feature extraction without losing time-frequency resolution.

Experimental Evaluation

The experiments are robust, employing a comprehensive dataset synthesized from clean utterances and simulating realistic acoustic environments. The paper provides a thorough comparison with baseline methods, demonstrating consistent improvements in SI-SDR and SI-SIR metrics across various scenarios. The results are well-presented, with clear tables and visualizations that effectively illustrate the advantages of the proposed method over existing techniques.

Reproducibility

While the paper outlines the methodology and experimental setup in detail, it lacks specific implementation details such as code availability or links to datasets, which could hinder reproducibility. The absence of a demo or project URL further limits the ability for others to validate the findings.

Limitations

One limitation is the reliance on dual-microphone setups, which may not generalize well to more complex array configurations. Additionally, the performance in real-world scenarios with varying noise conditions and dynamic environments remains to be evaluated. The model's scalability to larger numbers of microphones or sources also warrants further investigation.

Broader Impact

The proposed method has significant implications for real-time audio processing applications, such as telecommunications, hearing aids, and assistive listening devices, where effective source separation is crucial. Its ability to operate without explicit noise covariance estimation could simplify deployment in practical scenarios. The framework's adaptability to various input configurations also suggests potential for broader applications in multi-source audio environments. This paper presents a novel neural network-based framework for target source extraction from underdetermined mixtures, significantly advancing the field of audio signal processing. The methodology effectively combines traditional beamforming concepts with modern neural network techniques, yielding promising results that could enhance various audio applications.

Analysis: Full Paper • Full text: 22,082 characters

CodecMOS-Accent: A MOS Benchmark of Resynthesized and TTS Speech from Neural Codecs Across English Accents

Wen-Chin Huang, Nicholas Sanders, Erica Cooper · arXiv

We present the CodecMOS-Accent dataset, a mean opinion score (MOS) benchmark designed to evaluate neural audio codec (NAC) models and the large language model (LLM)-based text-to-speech (TTS) models trained upon them, especially across non-standard speech like accented speech. Th...

We present the CodecMOS-Accent dataset, a mean opinion score (MOS) benchmark designed to evaluate neural audio codec (NAC) models and the large language model (LLM)-based text-to-speech (TTS) models trained upon them, especially across non-standard speech like accented speech. The dataset comprises 4,000 codec resynthesis and TTS samples from 24 systems, featuring 32 speakers spanning ten accents. A large-scale subjective test was conducted to collect 19,600 annotations from 25 listeners across three dimensions: naturalness, speaker similarity, and accent similarity. This dataset does not only represent an up-to-date study of recent speech synthesis system performance but reveals insights including a tight relationship between speaker and accent similarity, the predictive power of objective metrics, and a perceptual bias when listeners share the same accent with the speaker. This dataset is expected to foster research on more human-centric evaluation for NAC and accented TTS.

Institutional Affiliations

Primary: National Institute Of Information And Communications Technology

All Institutions: Nagoya University, National Institute Of Information And Communications Technology, University of Edinburgh

ML Relevance Analysis (83)

The main contribution of this paper is the introduction of the CodecMOS-Accent dataset, which provides a comprehensive benchmark for evaluating neural audio codecs and TTS systems across various English accents. This work significantly advances the understanding of how these systems perform with non-standard speech, paving the way for more inclusive and effective speech synthesis technologies.

Comprehensive Analysis

Methodology Assessment

The methodology is robust, focusing on the creation of the CodecMOS-Accent dataset, which includes a large-scale subjective evaluation of neural audio codecs and TTS systems across various English accents. The authors employed a well-structured listening test design, ensuring a diverse representation of accents and a significant number of annotations, which enhances the credibility of their findings. The inclusion of both subjective and objective evaluation metrics is commendable, providing a comprehensive understanding of the performance of the evaluated systems.

Experimental Evaluation

The experiments are thorough, with a clear focus on evaluating the performance of various TTS and NAC systems using a well-defined dataset. The analysis of subjective scores against objective metrics offers valuable insights into the relationship between human perception and automated evaluations. The findings regarding the correlation between accent similarity and speaker identity are particularly noteworthy, indicating a deeper understanding of the nuances in speech synthesis.

Reproducibility

The paper lacks specific implementation details or links to the dataset and models used, which could hinder reproducibility. While the methodology is described in detail, providing access to the dataset and models would significantly enhance the ability of other researchers to replicate the study.

Limitations

One limitation is the potential bias introduced by the listener demographics, as most listeners were from the US, which may affect the generalizability of the results. Additionally, the reliance on subjective evaluations may introduce variability based on listener preferences and experiences. The authors acknowledge that the dataset will be made public, but the timeline for this release is not specified.

Broader Impact

This work has significant implications for the development of more human-centric evaluation methods in speech synthesis, particularly for accented speech. The findings could influence future research directions in TTS and NAC systems, promoting the need for diverse training data and evaluation metrics that account for cultural and linguistic variations. The dataset itself is expected to serve as a valuable resource for researchers aiming to improve the quality and naturalness of synthesized speech across different accents. The main contribution of this paper is the introduction of the CodecMOS-Accent dataset, which provides a comprehensive benchmark for evaluating neural audio codecs and TTS systems across various English accents. This work significantly advances the understanding of how these systems perform with non-standard speech, paving the way for more inclusive and effective speech synthesis technologies.

Analysis: Full Paper • Full text: 18,664 characters

Controllable Accent Normalization via Discrete Diffusion

Qibing Bai, Yuhan Du, Tom Ko ... · arXiv

Existing accent normalization methods do not typically offer control over accent strength, yet many applications-such as language learning and dubbing-require tunable accent retention. We propose DLM-AN, a controllable accent normalization system built on masked discrete diffusio...

Existing accent normalization methods do not typically offer control over accent strength, yet many applications-such as language learning and dubbing-require tunable accent retention. We propose DLM-AN, a controllable accent normalization system built on masked discrete diffusion over self-supervised speech tokens. A Common Token Predictor identifies source tokens that likely encode native pronunciation; these tokens are selectively reused to initialize the reverse diffusion process. This provides a simple yet effective mechanism for controlling accent strength: reusing more tokens preserves more of the original accent. DLM-AN further incorporates a flow-matching Duration Ratio Predictor that automatically adjusts the total duration to better match the native rhythm. Experiments on multi-accent English data show that DLM-AN achieves the lowest word error rate among all compared systems while delivering competitive accent reduction and smooth, interpretable accent strength control.

Institutional Affiliations

Primary: The Chinese University of Hong Kong

All Institutions: The Chinese University of Hong Kong, Nanjing University, School of Intelligence Science and Technology, Shenzhen Loop Area Institute, Tencent Ethereal Audio Lab

Demo

ML Relevance Analysis (83)

The paper presents DLM-AN, a controllable accent normalization system that effectively balances accent retention and content preservation through innovative methodologies. This contribution is significant for advancing the field of speech processing, particularly in applications requiring nuanced accent control.

Comprehensive Analysis

Methodology Assessment

The proposed DLM-AN system introduces a novel approach to accent normalization by leveraging masked discrete diffusion and a Common Token Predictor (CTP) to control accent strength. The methodology effectively combines self-supervised speech tokens with a flow-matching Duration Ratio Predictor, allowing for nuanced control over both accent retention and speech rhythm. The use of a bidirectional Transformer for token prediction and the iterative generation process enhances the model's ability to produce high-quality outputs while maintaining phonetic integrity. However, the reliance on a recognition-based token encoder may introduce errors that could affect performance on heavily accented inputs.

Experimental Evaluation

The experiments conducted on multi-accent English data demonstrate that DLM-AN achieves the lowest word error rate (WER) among competing systems while maintaining competitive naturalness and accent reduction. The evaluation metrics include both subjective assessments (MUSHRA tests for naturalness and accentedness) and objective measures (WER, Speaker Encoding Cosine Similarity, and phonetic posteriorgram distance), providing a comprehensive view of the system's performance. The results indicate that the proposed method effectively balances accent normalization and content preservation.

Reproducibility

The paper provides a detailed description of the experimental setup, including datasets, training procedures, and evaluation metrics, which enhances reproducibility. However, the lack of a publicly available code repository limits the ability of other researchers to replicate the results fully. The authors mention using specific models and datasets, but without access to the exact implementations, some aspects may be challenging to reproduce.

Limitations

The paper acknowledges several limitations, including the potential for recognition errors in the token encoder, which can degrade conversion quality for heavily accented inputs. Additionally, the current system relies on a K-Means tokenizer, which may not capture the full phonetic richness necessary for optimal performance. Future work could explore incorporating L2-accented data and improving the tokenizer to enhance robustness.

Broader Impact

The DLM-AN system has significant implications for applications in language learning, dubbing, and personalized text-to-speech systems, where controllable accent normalization is crucial. By enabling users to adjust accent strength, the system can facilitate better communication and understanding in multilingual contexts. The research contributes to the broader field of speech processing and accent conversion, paving the way for more sophisticated and user-friendly audio technologies. The paper presents DLM-AN, a controllable accent normalization system that effectively balances accent retention and content preservation through innovative methodologies. This contribution is significant for advancing the field of speech processing, particularly in applications requiring nuanced accent control.

Analysis: Full Paper • Full text: 39,831 characters

DASH: Dynamic Audio-Driven Semantic Chunking for Efficient Omnimodal Token Compression

Bingzhou Li, Tao Huang · arXiv

Omnimodal large language models (OmniLLMs) jointly process audio and visual streams, but the resulting long multimodal token sequences make inference prohibitively expensive. Existing compression methods typically rely on fixed window partitioning and attention-based pruning, whi...

Omnimodal large language models (OmniLLMs) jointly process audio and visual streams, but the resulting long multimodal token sequences make inference prohibitively expensive. Existing compression methods typically rely on fixed window partitioning and attention-based pruning, which overlook the piecewise semantic structure of audio-visual signals and become fragile under aggressive token reduction. We propose Dynamic Audio-driven Semantic cHunking (DASH), a training-free framework that aligns token compression with semantic structure. DASH treats audio embeddings as a semantic anchor and detects boundary candidates via cosine-similarity discontinuities, inducing dynamic, variable-length segments that approximate the underlying piecewise-coherent organization of the sequence. These boundaries are projected onto video tokens to establish explicit cross-modal segmentation. Within each segment, token retention is determined by a tri-signal importance estimator that fuses structural boundary cues, representational distinctiveness, and attention-based salience, mitigating the sparsity bias of attention-only selection. This structure-aware allocation preserves transition-critical tokens while reducing redundant regions. Extensive experiments on AVUT, VideoMME, and WorldSense demonstrate that DASH maintains superior accuracy while achieving higher compression ratios compared to prior methods. Code is available at: https://github.com/laychou666/DASH.

Institutional Affiliations

Primary: Shanghai Jiao Tong University

All Institutions: Shanghai Jiao Tong University, Tongji University

GitHub

ML Relevance Analysis (83)

The main contribution of this paper is the introduction of DASH, a training-free framework for dynamic audio-driven semantic chunking that enhances token compression in omnimodal large language models. This innovative approach effectively aligns compression with the inherent semantic structure of audio-visual signals, improving both accuracy and efficiency in processing multimodal data.

Comprehensive Analysis

Methodology Assessment

The proposed DASH framework introduces a novel approach to token compression in omnimodal large language models by leveraging audio embeddings as semantic anchors for dynamic segmentation. This method addresses the limitations of existing compression techniques that rely on fixed window partitioning and attention-based pruning, which often fail to preserve the semantic structure of audio-visual signals. The use of cosine-similarity discontinuities for boundary detection and the tri-signal importance estimator for token retention are innovative contributions that enhance the model's ability to maintain critical information during aggressive compression. The training-free nature of DASH further simplifies its integration into existing systems, making it a practical solution for real-world applications.

Experimental Evaluation

The experiments conducted on multiple benchmarks (AVUT, VideoMME, and WorldSense) demonstrate the effectiveness of DASH in maintaining accuracy while achieving higher compression ratios compared to prior methods. The results indicate that DASH not only preserves critical information but also improves inference efficiency, which is crucial for deploying omnimodal models in resource-constrained environments. The comprehensive evaluation across different datasets strengthens the validity of the findings and showcases the robustness of the proposed method.

Reproducibility

The paper provides sufficient implementation details, including hyperparameters and the experimental setup, which enhances the reproducibility of the results. The authors have made the code available on GitHub, allowing other researchers to replicate their findings and further explore the DASH framework.

Limitations

One limitation of the proposed method is its reliance on audio embeddings as the primary source for boundary detection, which may not always capture the full complexity of the multimodal signals. Additionally, while the framework is training-free, its performance may vary depending on the quality of the audio input and the specific characteristics of the video content. Future work could explore the integration of more sophisticated learning-based approaches for boundary detection and token selection.

Broader Impact

The DASH framework has significant implications for the field of multimodal machine learning, particularly in applications that require efficient processing of audio-visual data, such as video understanding, content creation, and interactive systems. By improving the efficiency of omnimodal models, DASH can facilitate advancements in real-time applications, enhance user experiences, and contribute to the development of more intelligent systems capable of understanding complex multimedia content. The main contribution of this paper is the introduction of DASH, a training-free framework for dynamic audio-driven semantic chunking that enhances token compression in omnimodal large language models. This innovative approach effectively aligns compression with the inherent semantic structure of audio-visual signals, improving both accuracy and efficiency in processing multimodal data.

Analysis: Full Paper • Full text: 34,438 characters

PulmoVec: A Two-Stage Stacking Meta-Learning Architecture Built on the HeAR Foundation Model for Multi-Task Classification of Pediatric Respiratory Sounds

Izzet Turkalp Akbasli, Oguzhan Serin · arXiv

Background: Respiratory diseases are a leading cause of childhood morbidity and mortality, yet lung auscultation remains subjective and limited by inter-listener variability, particularly in pediatric populations. Existing AI approaches are further constrained by small datasets a...

Background: Respiratory diseases are a leading cause of childhood morbidity and mortality, yet lung auscultation remains subjective and limited by inter-listener variability, particularly in pediatric populations. Existing AI approaches are further constrained by small datasets and single-task designs. We developed PulmoVec, a multi-task framework built on the Health Acoustic Representations (HeAR) foundation model for classification of pediatric respiratory sounds. Methods: In this retrospective analysis of the SPRSound database, 24,808 event-level annotated segments from 1,652 pediatric patients were analyzed. Three task-specific classifiers were trained for screening, sound-pattern recognition, and disease-group prediction. Their out-of-fold probability outputs were combined with demographic metadata in a LightGBM stacking meta-model, and event-level predictions were aggregated to the patient level using ensemble voting. Results: At the event level, the screening model achieved an ROC-AUC of 0.96 (95% CI, 0.95-0.97), the sound-pattern recognition model a macro ROC-AUC of 0.96 (95% CI, 0.96-0.97), and the disease-group prediction model a macro ROC-AUC of 0.94 (95% CI, 0.93-0.94). At the patient level, disease-group classification yielded an accuracy of 0.74 (95% CI, 0.71-0.77), a weighted F1-score of 0.73, and a macro ROC-AUC of 0.91 (95% CI, 0.90-0.93). Stacking improved performance across all tasks compared with base models alone. Conclusions: PulmoVec links event-level acoustic phenotyping with patient-level clinical classification, supporting the potential of foundation-model-based digital auscultation in pediatric respiratory medicine. Multi-center external validation across devices and real-world conditions remains essential.

Institutional Affiliations

Primary: Hacettepe University Faculty of Medicine

All Institutions: Hacettepe University Faculty of Medicine

GitHub

ML Relevance Analysis (83)

The paper presents PulmoVec, a novel multi-task framework that leverages a foundation model for pediatric respiratory sound classification, demonstrating significant potential for improving clinical decision support in pediatric respiratory medicine. The integration of demographic data and the innovative stacking approach contribute to its relevance and potential impact in the field.

Comprehensive Analysis

Methodology Assessment

The methodology is robust, employing a two-stage stacking meta-learning architecture that integrates a foundation model (HeAR) with task-specific classifiers. The use of LightGBM for stacking and the incorporation of demographic metadata is innovative, enhancing the model's clinical relevance. The approach to fine-tuning and the ensemble voting strategy for patient-level predictions are well-structured, although the absence of temporal modeling in the aggregation process could be a potential area for improvement.

Experimental Evaluation

The experiments are thorough, utilizing a substantial dataset (24,808 event-level segments from 1,652 patients) and providing detailed performance metrics across multiple tasks. The reported ROC-AUC values and accuracy scores indicate strong performance, particularly at the event level. However, the transition from event-level to patient-level predictions shows a decrease in accuracy, which is a critical observation that warrants further exploration.

Reproducibility

The authors have made their code and data publicly available, which supports reproducibility. The detailed description of the methods, including data preprocessing, model training, and evaluation metrics, enhances the transparency of the research. However, the reliance on a single-center dataset may limit the generalizability of the findings.

Limitations

Key limitations include the lack of external validation, which raises concerns about the model's performance in diverse clinical settings. The study also does not address the potential impact of annotation noise and the variability of respiratory sounds across different devices and environments. Additionally, the low recall for the Normal class at the patient level highlights a significant challenge in accurately classifying normal respiratory sounds.

Broader Impact

The findings have significant implications for pediatric respiratory medicine, particularly in enhancing diagnostic accuracy through AI-assisted auscultation. The potential for digital auscultation to improve clinical decision-making and patient outcomes is substantial, but further validation across diverse populations and settings is essential for real-world applicability. The paper presents PulmoVec, a novel multi-task framework that leverages a foundation model for pediatric respiratory sound classification, demonstrating significant potential for improving clinical decision support in pediatric respiratory medicine. The integration of demographic data and the innovative stacking approach contribute to its relevance and potential impact in the field.

Analysis: Full Paper • Full text: 28,086 characters

The Voice Behind the Words: Quantifying Intersectional Bias in SpeechLLMs

Shree Harsha Bokkahalli Satish, Christoph Minixhofer, Maria Teleki ... · arXiv

Speech Large Language Models (SpeechLLMs) process spoken input directly, retaining cues such as accent and perceived gender that were previously removed in cascaded pipelines. This introduces speaker identity dependent variation in responses. We present a large-scale intersection...

Speech Large Language Models (SpeechLLMs) process spoken input directly, retaining cues such as accent and perceived gender that were previously removed in cascaded pipelines. This introduces speaker identity dependent variation in responses. We present a large-scale intersectional evaluation of accent and gender bias in three SpeechLLMs using 2,880 controlled interactions across six English accents and two gender presentations, keeping linguistic content constant through voice cloning. Using pointwise LLM-judge ratings, pairwise comparisons, and Best-Worst Scaling with human validation, we detect consistent disparities. Eastern European-accented speech receives lower helpfulness scores, particularly for female-presenting voices. The bias is implicit: responses remain polite but differ in helpfulness. While LLM judges capture the directional trend of these biases, human evaluators exhibit significantly higher sensitivity, uncovering sharper intersectional disparities.

Institutional Affiliations

Primary: &M University

All Institutions: &M University, Centre for Speech Technology Research, Department of Speech, KTH Royal Institute of Technology, Music and Hearing, University of Edinburgh

GitHub

ML Relevance Analysis (82)

This paper presents a comprehensive analysis of intersectional bias in SpeechLLMs, revealing significant disparities in response quality based on accent and gender. The innovative methodology and rigorous experimental design contribute valuable insights into the challenges of bias in AI, emphasizing the need for sensitive evaluation methods in the field.

Comprehensive Analysis

Methodology Assessment

The methodology employed in this study is robust and innovative, utilizing a large-scale dataset of 2,880 interactions to evaluate intersectional bias in SpeechLLMs. The use of voice cloning to maintain linguistic content while varying accent and gender is a significant strength, allowing for a controlled analysis of bias. The combination of pointwise ratings, pairwise comparisons, and Best-Worst Scaling (BWS) provides a comprehensive approach to measuring bias, although the reliance on automated LLM judges alongside human evaluations raises questions about the sensitivity of the automated methods.

Experimental Evaluation

The experiments are well-structured, with clear preliminary checks to ensure the validity of the SpeechLLMs' outputs. The findings reveal significant biases, particularly against Eastern European-accented female voices, which are supported by both automated and human evaluations. However, the statistical significance of some results could be further clarified, especially in the context of the varying performance across different models.

Reproducibility

The authors have made efforts to ensure reproducibility by releasing their dataset and evaluation prompts. However, the paper could benefit from more detailed descriptions of the experimental setup, including the specific configurations of the SpeechLLMs used, to facilitate replication by other researchers.

Limitations

One limitation is the potential for bias in the human evaluators themselves, which could affect the results. Additionally, the study's focus on only three SpeechLLMs may limit the generalizability of the findings to other models. The reliance on synthetic speech may also introduce artifacts that do not reflect real-world interactions.

Broader Impact

This research has significant implications for the development of SpeechLLMs and their deployment in real-world applications. By highlighting the intersectional biases present in these models, the study underscores the importance of addressing such biases to ensure equitable AI systems. The findings could inform future research and development practices, leading to more inclusive and fair AI technologies. This paper presents a comprehensive analysis of intersectional bias in SpeechLLMs, revealing significant disparities in response quality based on accent and gender. The innovative methodology and rigorous experimental design contribute valuable insights into the challenges of bias in AI, emphasizing the need for sensitive evaluation methods in the field.

Analysis: Full Paper • Full text: 18,021 characters

Nudging Hidden States: Training-Free Model Steering for Chain-of-Thought Reasoning in Large Audio-Language Models

Lok-Lam Ieong, Chia-Chien Chen, Chih-Kai Yang ... · arXiv

Chain-of-thought (CoT) prompting has been extended to large audio-language models (LALMs) to elicit reasoning, yet enhancing its effectiveness without training remains challenging. We study inference-time model steering as a training-free approach to improve LALM reasoning. We in...

Chain-of-thought (CoT) prompting has been extended to large audio-language models (LALMs) to elicit reasoning, yet enhancing its effectiveness without training remains challenging. We study inference-time model steering as a training-free approach to improve LALM reasoning. We introduce three strategies using diverse information sources and evaluate them across four LALMs and four benchmarks. Results show general accuracy gains up to 4.4% over CoT prompting. Notably, we identify a cross-modal transfer where steering vectors derived from few text samples effectively guide speech-based reasoning, demonstrating high data efficiency. We also examine hyperparameter sensitivity to understand the robustness of these approaches. Our findings position model steering as a practical direction for strengthening LALM reasoning.

Institutional Affiliations

Primary: National Taiwan University

All Institutions: National Taiwan University

ML Relevance Analysis (78)

The paper presents a training-free model steering framework that enhances Chain-of-Thought reasoning in large audio-language models. Its innovative approach, comprehensive experimental evaluation, and potential for broader applications position it as a significant contribution to the field of machine learning and audio processing.

Comprehensive Analysis

Methodology Assessment

The paper introduces a novel approach to enhance Chain-of-Thought (CoT) reasoning in large audio-language models (LALMs) through a training-free model steering method. The methodology is well-structured, comprising two phases: extraction of steering vectors and their injection during inference. The three proposed strategies—Vanilla Steering, Speech-derived Generalized Steering (SGS), and Text-derived Generalized Steering (TGS)—are innovative in their approach to leverage existing data without requiring additional training. The use of generalized steering vectors for improving reasoning across different modalities is particularly noteworthy, showcasing a solid understanding of the limitations of current LALMs.

Experimental Evaluation

The experiments are comprehensive, involving four advanced LALMs and multiple benchmarks, which provide a robust evaluation of the proposed methods. The results demonstrate consistent improvements in accuracy over CoT prompting, with a maximum gain of 4.4%. The comparison with baselines, including self-consistency, is well-executed, highlighting the efficiency of the proposed methods. The analysis of hyperparameter sensitivity and data efficiency adds depth to the experimental evaluation, indicating a thorough investigation of the methods' robustness.

Reproducibility

While the paper provides a clear description of the methodology and experimental setup, it lacks specific implementation details that would facilitate reproducibility, such as code availability or links to datasets used. Providing such resources would significantly enhance the reproducibility of the findings.

Limitations

One limitation noted is the sensitivity of Vanilla Steering to hyperparameters, which can lead to instability in predictions. Additionally, the reliance on auxiliary datasets for SGS and TGS may limit the applicability of these methods in scenarios where such data is not readily available. The paper could benefit from a more detailed discussion on the potential trade-offs between accuracy and computational efficiency.

Broader Impact

The proposed methods have significant implications for the development of more efficient and effective reasoning capabilities in LALMs, which could enhance their applicability in various real-world tasks, such as interactive auditory intelligence and spoken reasoning applications. The findings could lead to advancements in multimodal AI systems, improving their ability to understand and process complex auditory information. The paper presents a training-free model steering framework that enhances Chain-of-Thought reasoning in large audio-language models. Its innovative approach, comprehensive experimental evaluation, and potential for broader applications position it as a significant contribution to the field of machine learning and audio processing.

Analysis: Full Paper • Full text: 15,012 characters

Evaluating Compositional Structure in Audio Representations

Chuyang Chen, Bea Steers, Brian McFee ... · ICASSP 2026

We propose a benchmark for evaluating compositionality in audio representations. Audio compositionality refers to representing sound scenes in terms of constituent sources and attributes, and combining them systematically. While central to auditory perception, this property is la...

We propose a benchmark for evaluating compositionality in audio representations. Audio compositionality refers to representing sound scenes in terms of constituent sources and attributes, and combining them systematically. While central to auditory perception, this property is largely absent from current evaluation protocols. Our framework adapts ideas from vision and language to audio through two tasks: A-COAT, which tests consistency under additive transformations, and A-TRE, which probes reconstructibility from attribute-level primitives. Both tasks are supported by large synthetic datasets with controlled variation in acoustic attributes, providing the first benchmark of compositional structure in audio embeddings.

Institutional Affiliations

Primary: New York University

All Institutions: New York University

GitHub

ML Relevance Analysis (84)

The paper presents a benchmark for evaluating compositionality in audio representations, introducing two complementary tasks that systematically assess the ability of audio encoders to capture complex sound structures. The methodology and experiments are robust, providing a valuable framework for future research in audio representation learning.

Comprehensive Analysis

Methodology Assessment

The paper introduces a benchmark for evaluating compositionality in audio representations through two novel tasks, A-COAT and A-TRE. The methodology is well-structured, adapting concepts from vision and language to the audio domain, which is innovative. The tasks are defined clearly, and the use of synthetic datasets allows for controlled experimentation. The approach to measuring compositionality is systematic and reproducible, making it a valuable contribution to the field.

Experimental Evaluation

The experiments are comprehensive, benchmarking a diverse set of pretrained audio encoders and providing detailed results that highlight the differences in compositionality across models. The use of statistical tests to analyze the results adds rigor to the evaluation. The paper effectively demonstrates how different training paradigms influence the ability to capture compositional structure in audio representations.

Reproducibility

The paper provides sufficient implementation details, including the dataset generation process and evaluation protocols, which enhances reproducibility. The availability of code and datasets on GitHub further supports the reproducibility of the experiments.

Limitations

One limitation is the reliance on synthetic datasets, which may not fully capture the complexities of real-world audio compositions. Additionally, while the benchmark is a significant step forward, its effectiveness in evaluating real-world audio representations remains to be tested.

Broader Impact

The proposed benchmark has the potential to influence future research in audio representation learning by providing a standardized way to evaluate compositionality, which is crucial for tasks involving complex audio scenes. This could lead to advancements in audio understanding, reasoning, and multimodal applications. The paper presents a benchmark for evaluating compositionality in audio representations, introducing two complementary tasks that systematically assess the ability of audio encoders to capture complex sound structures. The methodology and experiments are robust, providing a valuable framework for future research in audio representation learning.

Analysis: Full Paper • Full text: 21,475 characters

Causal Tracing of Audio-Text Fusion in Large Audio Language Models

Wei-Chih Chen, Chien-yu Huang, Hung-yi Lee · arXiv

Despite the strong performance of large audio language models (LALMs) in various tasks, exactly how and where they integrate acoustic features with textual context remains unclear. We adapt causal tracing to investigate the internal information flow of LALMs during audio comprehe...

Despite the strong performance of large audio language models (LALMs) in various tasks, exactly how and where they integrate acoustic features with textual context remains unclear. We adapt causal tracing to investigate the internal information flow of LALMs during audio comprehension. By conducting layer-wise and token-wise analyses across DeSTA, Qwen, and Voxtral, we evaluate the causal effects of individual hidden states. Layer-wise analysis identifies different fusion strategies, from progressive integration in DeSTA to abrupt late-stage fusion in Qwen. Token-wise analysis shows that the final sequence token acts as an informational bottleneck where the network decisively retrieves relevant information from the audio. We also observe an attention-like query mechanism at intermediate token positions that triggers the model to pull task-relevant audio context. These findings provide a clear characterization of when and where multi-modal integration occurs within LALMs.

Institutional Affiliations

Primary: Carnegie Mellon University

All Institutions: Carnegie Mellon University, National Taiwan University

ML Relevance Analysis (83)

The main contribution of this paper is the introduction of causal tracing to analyze audio-text fusion in LALMs, revealing critical insights into the integration mechanisms of these models. This work stands out for its innovative approach and potential to influence future research directions in multimodal machine learning.

Comprehensive Analysis

Methodology Assessment

The paper employs causal tracing to analyze the internal workings of large audio language models (LALMs), which is a novel approach in the context of audio-text fusion. The methodology includes both layer-wise and token-wise analyses, allowing for a comprehensive understanding of how acoustic features and textual context are integrated. This dual analysis is well-justified and provides a clear framework for understanding the model's behavior, making it a significant methodological contribution.

Experimental Evaluation

The experiments are well-structured, utilizing multiple LALMs (DeSTA, Qwen, and Voxtral) to validate the findings. The results indicate varying fusion strategies among models, which is a critical insight for future research. However, the paper could benefit from more extensive datasets or benchmarks to further substantiate the findings, particularly in real-world applications.

Reproducibility

The paper lacks detailed implementation specifics, which may hinder reproducibility. While the methodology is sound, the absence of code or datasets makes it challenging for other researchers to replicate the study. Providing a GitHub repository or supplementary materials would enhance reproducibility.

Limitations

One limitation is the focus on only three models, which may not represent the full spectrum of LALMs. Additionally, the analysis primarily focuses on the internal mechanisms without extensive evaluation of the models' performance on downstream tasks, which could provide a more holistic view of their utility.

Broader Impact

The findings have significant implications for the development of more effective multimodal models that can better integrate audio and textual information. This research could inform future designs of LALMs, leading to advancements in applications such as audio understanding, question answering, and other areas where audio and text converge. The main contribution of this paper is the introduction of causal tracing to analyze audio-text fusion in LALMs, revealing critical insights into the integration mechanisms of these models. This work stands out for its innovative approach and potential to influence future research directions in multimodal machine learning.

Analysis: Full Paper • Full text: 193 characters

Evaluating Semantic Fragility in Text-to-Audio Generation Systems Under Controlled Prompt Perturbations

Jiahui Wu · arXiv

Recent advances in text-to-audio generation enable models to translate natural-language descriptions into diverse musical output. However, the robustness of these systems under semantically equivalent prompt variations remains largely unexplored. Small linguistic changes may lead...

Recent advances in text-to-audio generation enable models to translate natural-language descriptions into diverse musical output. However, the robustness of these systems under semantically equivalent prompt variations remains largely unexplored. Small linguistic changes may lead to substantial variation in generated audio, raising concerns about reliability in practical use. In this study, we evaluate the semantic fragility of text-to-audio systems under controlled prompt perturbations. We selected MusicGen-small, MusicGen-large, and Stable Audio 2.5 as representative models, and we evaluated them under Minimal Lexical Substitution (MLS), Intensity Shifts (IS), and Structural Rephrasing (SR). The proposed dataset contains 75 prompt groups designed to preserve semantic intent while introducing localized linguistic variation. Generated outputs are compared through complementary spectral, temporal, and semantic similarity measures, enabling robustness analysis across multiple representational levels. Experimental results show that larger models achieve improved semantic consistency, with MusicGen-large reaching cosine similarities of 0.77 under MLS and 0.82 under IS. However, acoustic and temporal analyses reveal persistent divergence across all models, even when embedding similarity remains high. These findings indicate that fragility arises primarily during semantic-to-acoustic realization rather than multi-modal embedding alignment. Our study introduces a controlled framework for evaluating robustness in text-to-audio generation and highlights the need for multi-level stability assessment in generative audio systems.

Institutional Affiliations

Primary: Northwestern University

All Institutions: Northwestern University

ML Relevance Analysis (83)

This paper makes a meaningful contribution by systematically evaluating the robustness of text-to-audio generation systems under controlled prompt variations, revealing critical insights into model performance and the need for improved stability in generative audio systems. The comprehensive methodology and rigorous experimental design underscore its significance in advancing the field of machine learning and audio generation.

Comprehensive Analysis

Methodology Assessment

The paper introduces a novel framework for evaluating semantic fragility in text-to-audio generation systems, employing a systematic approach to assess model robustness under controlled prompt perturbations. The methodology is well-structured, utilizing three distinct perturbation categories (Minimal Lexical Substitution, Intensity Shifts, and Structural Rephrasing) to evaluate the models. The use of a dataset designed to maintain semantic intent while varying linguistic structure is a strong point, as it allows for a focused analysis of model sensitivity. The combination of multiple evaluation metrics (log-Mel spectrogram distance, MFCC-based Dynamic Time Warping, and CLAP embedding similarity) provides a comprehensive view of the models' performance across different dimensions of audio generation.

Experimental Evaluation

The experiments are rigorously conducted, with clear definitions of the perturbation types and a well-defined dataset. The results indicate that larger models tend to show improved robustness, which is consistent with existing literature on model scaling. The statistical analysis, including paired-sample t-tests and effect sizes, adds credibility to the findings. However, the paper could benefit from more extensive qualitative evaluations involving multiple listeners to complement the quantitative metrics.

Reproducibility

The paper provides a detailed description of the experimental setup, including the dataset construction process and evaluation metrics. However, the lack of publicly available code or datasets limits reproducibility. Future work should consider sharing the dataset and evaluation framework to facilitate further research in this area.

Limitations

One limitation of the study is the relatively small size of the dataset, which may not capture the full range of linguistic variations encountered in real-world applications. Additionally, while the paper highlights the importance of multi-level evaluation, it does not explore potential methods for improving robustness in audio generation systems, which could be a valuable direction for future research.

Broader Impact

The findings of this study have significant implications for the development of more reliable text-to-audio generation systems, particularly in creative industries where semantic fidelity is crucial. By identifying the fragility of current models, the research encourages the exploration of more robust architectures and evaluation frameworks, potentially leading to advancements in AI-assisted music production and interactive media. This paper makes a meaningful contribution by systematically evaluating the robustness of text-to-audio generation systems under controlled prompt variations, revealing critical insights into model performance and the need for improved stability in generative audio systems. The comprehensive methodology and rigorous experimental design underscore its significance in advancing the field of machine learning and audio generation.

Analysis: Full Paper • Full text: 32,967 characters

LLM-Guided Reinforcement Learning for Audio-Visual Speech Enhancement

Chih-Ning Chen, Jen-Cheng Hou, Hsin-Min Wang ... · arXiv

In existing Audio-Visual Speech Enhancement (AVSE) methods, objectives such as Scale-Invariant Signal-to-Noise Ratio (SI-SNR) and Mean Squared Error (MSE) are widely used; however, they often correlate poorly with perceptual quality and provide limited interpretability for optimi...

In existing Audio-Visual Speech Enhancement (AVSE) methods, objectives such as Scale-Invariant Signal-to-Noise Ratio (SI-SNR) and Mean Squared Error (MSE) are widely used; however, they often correlate poorly with perceptual quality and provide limited interpretability for optimization. This work proposes a reinforcement learning-based AVSE framework with a Large Language Model (LLM)-based interpretable reward model. An audio LLM generates natural language descriptions of enhanced speech, which are converted by a sentiment analysis model into a 1-5 rating score serving as the PPO reward for fine-tuning a pretrained AVSE model. Compared with scalar metrics, LLM-generated feedback is semantically rich and explicitly describes improvements in speech quality. Experiments on the 4th COG-MHEAR AVSE Challenge (AVSEC-4) dataset show that the proposed method outperforms a supervised baseline and a DNSMOS-based RL baseline in PESQ, STOI, neural quality metrics, and subjective listening tests.

Institutional Affiliations

Primary: National Taiwan University

All Institutions: National Taiwan University, University of California Irvine

ML Relevance Analysis (83)

The main contribution of this paper is the introduction of an LLM-guided reinforcement learning framework for audio-visual speech enhancement, which enhances interpretability and aligns model training with human perceptual quality assessments. This work represents a significant advancement in the field, combining innovative methodologies with rigorous experimental validation to address longstanding challenges in speech enhancement.

Comprehensive Analysis

Methodology Assessment

The proposed methodology introduces a novel reinforcement learning framework that leverages Large Language Models (LLMs) to generate interpretable reward signals for audio-visual speech enhancement. The integration of natural language descriptions as rewards represents a significant departure from traditional scalar metrics, enhancing the interpretability and alignment of the training process with human perception. The use of sentiment analysis to convert these descriptions into numerical scores for reinforcement learning is innovative and adds depth to the reward model design. However, the methodology could benefit from a more detailed exploration of the LLM's limitations and the potential for more advanced models to enhance the reward generation process.

Experimental Evaluation

The experimental evaluation is robust, utilizing the 4th COG-MHEAR AVSE Challenge dataset, which provides a comprehensive benchmark for assessing the proposed method's performance. The results demonstrate significant improvements over both a supervised baseline and a DNSMOS-based RL baseline across multiple objective and subjective metrics, including PESQ and STOI. The subjective listening tests further validate the effectiveness of the LLM-based reward model, showcasing its ability to enhance perceived speech quality. The experiments are well-structured, but additional comparisons with other state-of-the-art methods could strengthen the claims of superiority.

Reproducibility

The paper provides a clear description of the methodology, including the architecture of the AVSE model and the reinforcement learning framework. However, the lack of publicly available code or a demo URL limits reproducibility. Detailed hyperparameter settings and training procedures are mentioned, but sharing the model weights and training scripts would facilitate independent verification of the results.

Limitations

The primary limitation identified is the reliance on a specific LLM (SALMONN) for generating reward signals, which may not generalize well across different speech enhancement tasks or datasets. Additionally, the repetitive nature of the generated natural language descriptions could hinder the model's ability to capture subtle differences in speech quality. Future work should address these limitations by exploring more advanced LLMs and refining prompt engineering strategies.

Broader Impact

The proposed framework has the potential to significantly impact the field of audio-visual speech enhancement by providing a more interpretable and human-aligned approach to model training. The integration of LLMs into the reinforcement learning paradigm could pave the way for advancements in other multimodal applications, enhancing the interpretability and effectiveness of AI systems in various domains, including assistive technologies for hearing-impaired individuals. The main contribution of this paper is the introduction of an LLM-guided reinforcement learning framework for audio-visual speech enhancement, which enhances interpretability and aligns model training with human perceptual quality assessments. This work represents a significant advancement in the field, combining innovative methodologies with rigorous experimental validation to address longstanding challenges in speech enhancement.

Analysis: Full Paper • Full text: 19,460 characters

$τ$-Voice: Benchmarking Full-Duplex Voice Agents on Real-World Domains

Soham Ray, Keshav Dhandhania, Victor Barres ... · arXiv

Full-duplex voice agents--systems that listen and speak simultaneously--are rapidly moving from research to production. However, existing evaluations address conversational dynamics and task completion in isolation. We introduce $τ$-voice, a benchmark for evaluating voice agents ...

Full-duplex voice agents--systems that listen and speak simultaneously--are rapidly moving from research to production. However, existing evaluations address conversational dynamics and task completion in isolation. We introduce $τ$-voice, a benchmark for evaluating voice agents on grounded tasks with real-world complexity: agents must navigate complex multi-turn conversations, adhere to domain policies, and interact with the environment. The framework extends $τ^2$-bench into a novel voice agent benchmark combining verifiable completion of complex grounded tasks, full-duplex interaction, and realistic audio--enabling direct comparison between voice and text performance. A controllable and realistic voice user simulator provides diverse accents, realistic audio environments, and rich turn-taking dynamics; by decoupling simulation from wall-clock time, the user simulator can use the most capable LLM without real-time constraints. We evaluate task completion (pass@1) and voice interaction quality across 278 tasks: while GPT-5 (reasoning) achieves 85%, voice agents reach only 31--51% under clean conditions and 26--38% under realistic conditions with noise and diverse accents--retaining only 30--45% of text capability; qualitative analysis confirms 79--90% of failures stem from agent behavior, suggesting that observed failures primarily reflect agent behavior under our evaluation setup. $τ$-voice provides a reproducible testbed for measuring progress toward voice agents that are natural, conversational, and reliable.

Institutional Affiliations

Primary: Sierra.ai

All Institutions: Sierra.ai, Princeton University

GitHub

ML Relevance Analysis (82)

The main contribution of this paper is the introduction of the $τ$-voice benchmark, which provides a comprehensive framework for evaluating full-duplex voice agents in real-world scenarios. This work is significant as it highlights the current limitations of voice agents compared to text-based models and sets the stage for future research aimed at enhancing the capabilities of voice technology in complex environments.

Comprehensive Analysis

Methodology Assessment

The methodology proposed in this paper is robust, introducing the $τ$-voice benchmark that effectively combines multi-turn conversational dynamics with grounded task completion. The authors utilize a controllable voice user simulator that enhances realism by incorporating diverse accents and audio environments, which is a significant advancement over previous benchmarks. The decoupling of simulation from wall-clock time allows for the use of advanced LLMs without real-time constraints, which is a clever approach that enhances the evaluation process. However, the paper could benefit from a more detailed explanation of the simulation parameters and how they affect the results.

Experimental Evaluation

The experimental setup is comprehensive, evaluating task completion across 278 tasks under varying conditions. The results indicate a significant performance gap between voice agents and text-based models, which is critical for understanding the current limitations of voice technology. The quantitative results are backed by qualitative analysis, which adds depth to the findings. However, the paper could improve by providing more detailed statistics on the types of tasks where voice agents struggle the most.

Reproducibility

The paper includes a link to the project repository, which is a positive aspect for reproducibility. However, the details regarding the implementation of the voice user simulator and the specific configurations used in experiments are somewhat sparse. More thorough documentation and guidelines would enhance reproducibility for future researchers looking to build upon this work.

Limitations

The primary limitation identified is the performance disparity between voice agents and text-based models, which raises questions about the current capabilities of voice technology in real-world applications. Additionally, the focus on clean and realistic conditions may not fully capture the variability encountered in everyday use, potentially skewing the results. The authors also acknowledge that a significant portion of failures stems from agent behavior, indicating a need for further research into improving agent design.

Broader Impact

The $τ$-voice benchmark has the potential to significantly influence the development of voice agents, particularly in applications requiring natural and reliable interactions. By providing a structured evaluation framework, it encourages researchers to focus on improving conversational dynamics and task completion in voice systems, which could lead to advancements in various domains, including customer service, education, and accessibility. The main contribution of this paper is the introduction of the $τ$-voice benchmark, which provides a comprehensive framework for evaluating full-duplex voice agents in real-world scenarios. This work is significant as it highlights the current limitations of voice agents compared to text-based models and sets the stage for future research aimed at enhancing the capabilities of voice technology in complex environments.

Analysis: Full Paper • Full text: 742 characters

Beyond Two-stage Diffusion TTS: Joint Structure and Content Refinement via Jump Diffusion

Jiabao Ai, Minghui Zhao, Anton Ragni · arXiv

Diffusion and flow matching TTS faces a tension between discrete temporal structure and continuous spectral modeling. Two-stage models diffuse on fixed alignments, often collapsing to mean prosody; single-stage models avoid explicit durations but suffer alignment instability. We ...

Diffusion and flow matching TTS faces a tension between discrete temporal structure and continuous spectral modeling. Two-stage models diffuse on fixed alignments, often collapsing to mean prosody; single-stage models avoid explicit durations but suffer alignment instability. We propose a jump-diffusion framework where discrete jumps model temporal structure and continuous diffusion refines spectral content within one process. Even in its one-shot degenerate form, our framework achieves 3.37% WER vs. 4.38% for Grad-TTS with improved UTMOSv2 on LJSpeech. The full iterative UDD variant further enables adaptive prosody, autonomously inserting natural pauses in out-of-distribution slow speech rather than stretching uniformly. Audio samples are available at https://anonymousinterpseech.github.io/TTS_Demo/.

Institutional Affiliations

Primary: School of Computer Science

All Institutions: School of Computer Science

Demo

ML Relevance Analysis (82)

The paper presents a novel jump-diffusion framework that unifies discrete temporal structure modeling and continuous spectral refinement for TTS. This comprehensive analysis highlights the technical contributions, innovative methodology, and significant implications for the field of speech synthesis.

Comprehensive Analysis

Methodology Assessment

The proposed jump-diffusion framework effectively integrates discrete temporal structure modeling with continuous spectral refinement, addressing the limitations of existing two-stage and single-stage TTS models. The Upsample-Diffuse-Downsample (UDD) strategy is particularly innovative, allowing for efficient reuse of pretrained networks while maintaining performance. The methodology is well-structured, with clear definitions of the forward and reverse processes, and the use of a Location Predictor and Content Predictor enhances the model's flexibility in generating speech.

Experimental Evaluation

The experiments conducted on the LJSpeech dataset are thorough, comparing the proposed model against established baselines like Grad-TTS. The reported results, including a significant reduction in word error rate (WER) and improvements in naturalness metrics, demonstrate the effectiveness of the jump-diffusion framework. The adaptive prosody feature, particularly in out-of-distribution scenarios, showcases the model's practical applicability in real-world speech synthesis.

Reproducibility

The paper provides sufficient implementation details, including the architecture choices and training procedures, which facilitate reproducibility. However, the reliance on pretrained models and the absence of a public code repository may hinder some aspects of reproducibility for the broader research community.

Limitations

While the jump-diffusion framework shows promise, it currently focuses on temporal structure without addressing potential improvements in spectral content refinement. The model's performance in more complex scenarios, such as multi-speaker synthesis or spontaneous speech, remains to be fully evaluated. Additionally, the lack of a comprehensive comparison with more recent TTS models may limit the contextual understanding of its advantages.

Broader Impact

This research has the potential to significantly advance the field of text-to-speech synthesis by improving the naturalness and intelligibility of generated speech. The ability to adaptively insert pauses and handle varying speech rates could enhance user experiences in applications such as virtual assistants, audiobooks, and accessibility tools. The findings may also inspire further research into integrating discrete and continuous modeling approaches in other generative tasks. The paper presents a novel jump-diffusion framework that unifies discrete temporal structure modeling and continuous spectral refinement for TTS. This comprehensive analysis highlights the technical contributions, innovative methodology, and significant implications for the field of speech synthesis.

Analysis: Full Paper • Full text: 14,419 characters

Bounds on Agreement between Subjective and Objective Measurements

Jaden Pieper, Stephen D. Voran · arXiv

Objective estimators of multimedia quality are often judged by comparing estimates with subjective "truth data," most often via Pearson correlation coefficient (PCC) or mean-squared error (MSE). But subjective test results contain noise, so striving for a PCC of 1.0 or an MSE of ...

Objective estimators of multimedia quality are often judged by comparing estimates with subjective "truth data," most often via Pearson correlation coefficient (PCC) or mean-squared error (MSE). But subjective test results contain noise, so striving for a PCC of 1.0 or an MSE of 0.0 is neither realistic nor repeatable. Numerous efforts have been made to acknowledge and appropriately accommodate subjective test noise in objective-subjective comparisons, typically resulting in new analysis frameworks and figures-of-merit. We take a different approach. By making only basic assumptions, we derive bounds on PCC and MSE that can be expected for a subjective test. Consistent with intuition, these bounds are functions of subjective vote variance. When a subjective test includes vote variance information, the calculation of the bounds is easy, and in this case we say the resulting bounds are "fully data-driven." We provide two options for calculating bounds in cases where vote variance information is not available. One option is to use vote variance information from other subjective tests that do provide such information, and the second option is to use a model for subjective votes. Thus we introduce a binomial-based model for subjective votes (BinoVotes) that naturally leads to a mean opinion score (MOS) model, named BinoMOS, with multiple unique desirable properties. BinoMOS reproduces the discrete nature of MOS values and its dependence on the number of votes per file. This modeling provides vote variance information required by the PCC and MSE bounds and we compare this modeling with data from 18 subjective tests. The modeling yields PCC and MSE bounds that agree very well with those found from the data directly. These results allow one to set expectations for the PCC and MSE that might be achieved for any subjective test, even those where vote variance information is not available.

Institutional Affiliations

Primary: Institute for Telecommunication Sciences

All Institutions: Institute for Telecommunication Sciences, National Telecommunications and Information Administration

GitHub

ML Relevance Analysis (83)

This paper makes a meaningful contribution to the field of multimedia quality assessment by deriving actionable bounds for the performance of objective estimators based on subjective test results. The introduction of the BinoVotes model presents a novel perspective that enhances the understanding of the relationship between subjective and objective measurements, offering valuable insights for future research and applications in the area.

Comprehensive Analysis

Methodology Assessment

The paper presents a novel approach to deriving bounds on the Pearson correlation coefficient (PCC) and mean-squared error (MSE) for subjective tests in multimedia quality assessment. The authors introduce the BinoVotes model, which is based on the binomial distribution, to effectively capture the discrete nature of subjective ratings. This model allows for the derivation of bounds that are grounded in the intrinsic properties of the voting process, making the methodology both intuitive and mathematically sound. The approach is distinct from previous work that often relies on more complex models or assumptions about the distribution of votes.

Experimental Evaluation

The authors validate their theoretical bounds using data from 18 subjective tests, demonstrating that their BinoVotes and BinoMOS models yield results that align well with empirical data. The experiments are comprehensive, covering a range of multimedia types and subjective testing conditions, which strengthens the credibility of their findings. However, the paper could benefit from additional experiments that explore the performance of objective estimators across diverse datasets beyond the 18 tests analyzed.

Reproducibility

The paper provides a GitHub repository for the implementation of the BinoVotes model and the associated bounds, which enhances reproducibility. However, detailed descriptions of the datasets used in the experiments, including their characteristics and how they were processed, are somewhat limited. More thorough documentation would facilitate better reproducibility of the results.

Limitations

One limitation of the study is the reliance on the BinoVotes model for cases where vote variance information is not available, which may lead to overestimation of vote variance in some scenarios. Additionally, while the bounds derived are informative, they do not account for the potential biases introduced by the subjective nature of the voting process. Future work could explore incorporating individual subject biases into the model for a more nuanced understanding of vote variance.

Broader Impact

The findings of this paper have significant implications for the development of objective quality estimators in multimedia applications. By providing realistic performance bounds, researchers can better understand the limitations of their models and set achievable goals for improvement. The methodology could be applied to various domains beyond multimedia, including any field that relies on subjective assessments, thus broadening its impact. This paper makes a meaningful contribution to the field of multimedia quality assessment by deriving actionable bounds for the performance of objective estimators based on subjective test results. The introduction of the BinoVotes model presents a novel perspective that enhances the understanding of the relationship between subjective and objective measurements, offering valuable insights for future research and applications in the area.

Analysis: Full Paper • Full text: 44,848 characters

DAST: A Dual-Stream Voice Anonymization Attacker with Staged Training

Ridwan Arefeen, Xiaoxiao Miao, Rong Tong ... · arXiv

Voice anonymization masks vocal traits while preserving linguistic content, which may still leak speaker-specific patterns. To assess and strengthen privacy evaluation, we propose a dual-stream attacker that fuses spectral and self-supervised learning features via parallel encode...

Voice anonymization masks vocal traits while preserving linguistic content, which may still leak speaker-specific patterns. To assess and strengthen privacy evaluation, we propose a dual-stream attacker that fuses spectral and self-supervised learning features via parallel encoders with a three-stage training strategy. Stage I establishes foundational speaker-discriminative representations. Stage II leverages the shared identity-transformation characteristics of voice conversion and anonymization, exposing the model to diverse converted speech to build cross-system robustness. Stage III provides lightweight adaptation to target anonymized data. Results on the VoicePrivacy Attacker Challenge (VPAC) dataset demonstrate that Stage II is the primary driver of generalization, enabling strong attacking performance on unseen anonymization datasets. With Stage III, fine-tuning on only 10\% of the target anonymization dataset surpasses current state-of-the-art attackers in terms of EER.

Institutional Affiliations

Primary: Duke Kunshan University

All Institutions: Duke Kunshan University, NVIDIA AI Technology Centre, Singapore Institute of Technology

ML Relevance Analysis (83)

The main contribution of this paper is the introduction of DAST, a dual-stream voice anonymization attacker that utilizes a novel three-stage training strategy to enhance the robustness of voice anonymization systems against re-identification attacks. This work represents a meaningful advancement in the field of voice privacy, combining innovative methodology with rigorous experimental validation to address a critical issue in audio machine learning.

Comprehensive Analysis

Methodology Assessment

The proposed DAST architecture employs a dual-stream model that effectively combines spectral and self-supervised learning features through a three-stage training strategy. This approach is innovative as it not only leverages the strengths of both feature types but also introduces a structured training curriculum that progressively builds the model's capabilities. The staged training allows for robust generalization across different anonymization systems, which is a significant improvement over existing methods that typically focus on single-system training. The methodology is well-justified with theoretical backing and empirical validation through ablation studies.

Experimental Evaluation

The experiments are rigorously designed, utilizing the VoicePrivacy Attacker Challenge (VPAC) dataset to evaluate the model's performance. The results demonstrate that the DAST model outperforms existing state-of-the-art attackers, particularly in terms of equal error rates (EER). The systematic evaluation of each training stage provides clear insights into the contributions of the dual-stream architecture and the effectiveness of the three-stage training approach. The use of diverse datasets for training and testing further strengthens the validity of the results.

Reproducibility

The paper includes detailed descriptions of the experimental setup, including the datasets used, training configurations, and evaluation metrics. However, it lacks a direct link to code or models, which could hinder reproducibility. The authors mention plans to release pre-trained models upon acceptance, which is a positive step towards facilitating reproducibility.

Limitations

One limitation is the reliance on the specific datasets used for training and evaluation, which may not fully capture the diversity of real-world anonymization scenarios. Additionally, while the model shows strong performance on the VPAC dataset, its effectiveness on other anonymization systems or in practical applications remains to be fully assessed. The paper does not address potential ethical concerns related to the misuse of voice anonymization attacks.

Broader Impact

The research has significant implications for privacy protection in voice communication, particularly as voice technologies become more prevalent. By improving the robustness of voice anonymization systems against re-identification attacks, the work contributes to enhancing user privacy and security. The findings could influence future designs of anonymization systems and inform policy discussions around voice data privacy. The main contribution of this paper is the introduction of DAST, a dual-stream voice anonymization attacker that utilizes a novel three-stage training strategy to enhance the robustness of voice anonymization systems against re-identification attacks. This work represents a meaningful advancement in the field of voice privacy, combining innovative methodology with rigorous experimental validation to address a critical issue in audio machine learning.

Analysis: Full Paper • Full text: 19,801 characters

VoXtream2: Full-stream TTS with dynamic speaking rate control

Nikita Torgashov, Gustav Eje Henter, Gabriel Skantze · INTERSPEECH'26

Full-stream text-to-speech (TTS) for interactive systems must start speaking with minimal delay while remaining controllable as text arrives incrementally. We present VoXtream2, a zero-shot full-stream TTS model with dynamic speaking-rate control that can be updated mid-utterance...

Full-stream text-to-speech (TTS) for interactive systems must start speaking with minimal delay while remaining controllable as text arrives incrementally. We present VoXtream2, a zero-shot full-stream TTS model with dynamic speaking-rate control that can be updated mid-utterance on the fly. VoXtream2 combines a distribution matching mechanism over duration states with classifier-free guidance across conditioning signals to improve controllability and synthesis quality. Prompt-text masking enables textless audio prompting, removing the need for prompt transcription. Across standard zero-shot benchmarks and a dedicated speaking-rate test set, VoXtream2 achieves competitive objective and subjective results against public baselines despite a smaller model and less training data. In full-stream mode, it runs 4 times faster than real time with 74 ms first-packet latency on a consumer GPU.

Institutional Affiliations

Primary: KTH Royal Institute of Technology

All Institutions: KTH Royal Institute of Technology

Demo

ML Relevance Analysis (83)

VoXtream2 presents a significant advancement in full-stream TTS systems by introducing dynamic speaking rate control and improving synthesis quality through innovative methodologies. The combination of low latency, high intelligibility, and adaptability positions this work as a valuable contribution to the field of machine learning and audio synthesis, with potential applications in various interactive systems.

Comprehensive Analysis

Methodology Assessment

The methodology presented in VoXtream2 is innovative, particularly in its approach to dynamic speaking rate control and the integration of classifier-free guidance (CFG) for improved speech synthesis. The use of distribution matching over duration states and prompt text masking to enable textless audio prompting demonstrates a significant advancement in TTS systems, addressing key limitations of previous models. The architecture builds on the earlier VoXtream model but introduces critical enhancements that allow for real-time, low-latency speech generation while maintaining high intelligibility and voice cloning capabilities. The detailed description of the model architecture, including the use of the International Phonetic Alphabet (IPA) and the autoregressive Temporal Transformer (TT), illustrates a thoughtful design process aimed at achieving both speed and quality.

Experimental Evaluation

The experimental evaluation is robust, utilizing a variety of datasets, including the Emilia spontaneous speech dataset and the HiFiTTS-2 dataset. The results are compared against several state-of-the-art models, showcasing competitive performance in both objective and subjective metrics. The paper includes comprehensive evaluations of static and dynamic speaking rate control, with clear metrics for intelligibility, speaker similarity, and naturalness. The use of both human evaluations and objective metrics like WER, SPK-SIM, and UTMOS adds credibility to the findings. However, the reliance on specific datasets may limit the generalizability of the results.

Reproducibility

The paper provides sufficient detail regarding the model architecture, training procedures, and evaluation metrics, which supports reproducibility. However, the absence of a publicly available code repository limits the ease with which other researchers can replicate the results. The authors mention using a specific implementation of the Llama-3.2 transformer, which is beneficial, but further details on hyperparameters and training conditions would enhance reproducibility.

Limitations

The paper acknowledges several limitations, including the influence of the acoustic prompt's speaking rate on the generated speech rate and the complexity of the data preprocessing pipeline. The model's performance may degrade under certain conditions, particularly when generating speech at extreme speaking rates. Additionally, while the dynamic speaking rate control is a significant advancement, the model still exhibits some dependency on the prompt rate, which could be a barrier to achieving fully independent control.

Broader Impact

VoXtream2 has the potential to significantly impact the development of conversational agents and real-time TTS applications, particularly in scenarios requiring low latency and high adaptability. The ability to generate speech that closely mimics human-like dynamics in speaking rate could enhance user experience in voice-driven interfaces, making interactions more natural and engaging. Furthermore, the advancements in voice cloning and intelligibility could have applications in accessibility technologies and personalized voice synthesis. VoXtream2 presents a significant advancement in full-stream TTS systems by introducing dynamic speaking rate control and improving synthesis quality through innovative methodologies. The combination of low latency, high intelligibility, and adaptability positions this work as a valuable contribution to the field of machine learning and audio synthesis, with potential applications in various interactive systems.

Analysis: Full Paper • Full text: 38,613 characters

Understanding the strengths and weaknesses of SSL models for audio deepfake model attribution

Gabriel Pîrlogeanu, Adriana Stan, Horia Cucu · ICASSP 2026

Audio deepfake model attribution aims to mitigate the misuse of synthetic speech by identifying the source model responsible for generating a given audio sample, enabling accountability and informing vendors. The task is challenging, but self-supervised learning (SSL)-derived aco...

Audio deepfake model attribution aims to mitigate the misuse of synthetic speech by identifying the source model responsible for generating a given audio sample, enabling accountability and informing vendors. The task is challenging, but self-supervised learning (SSL)-derived acoustic features have demonstrated state-of-the-art attribution capabilities, yet the underlying factors driving their success and the limits of their discriminative power remain unclear. In this paper, we systematically investigate how SSL-derived features capture architectural signatures in audio deepfakes. By controlling multiple dimensions of the audio generation process we reveal how subtle perturbations in model checkpoints, text prompts, vocoders, or speaker identity influence attribution. Our results provide new insights into the robustness, biases, and limitations of SSL-based deepfake attribution, highlighting both its strengths and vulnerabilities in realistic scenarios.

Institutional Affiliations

Primary: Technical University of Cluj-Napoca

All Institutions: Technical University of Cluj-Napoca, POLITEHNICA Bucharest

ML Relevance Analysis (81)

The main contribution of this paper is a systematic investigation into the strengths and weaknesses of SSL-derived features for audio deepfake model attribution, revealing critical insights into the robustness and biases of these models. This work is significant as it addresses a pressing societal challenge by enhancing the understanding of audio deepfake attribution, thus contributing to the broader field of audio forensics and accountability in AI-generated content.

Comprehensive Analysis

Methodology Assessment

The paper employs a systematic approach to investigate the effectiveness of self-supervised learning (SSL) features in audio deepfake model attribution. The authors meticulously control various factors such as model checkpoints, text prompts, vocoders, and speaker identity, which is a significant methodological strength. The use of multiple architectures and the kNN-based attribution system allows for a clear analysis of how SSL features perform under different conditions. However, the reliance on a single dataset (LJSpeech) may limit the generalizability of the findings.

Experimental Evaluation

The experiments are well-structured, utilizing a comprehensive evaluation protocol that includes both in-domain and out-of-domain scenarios. The results are presented with clarity, showcasing the performance of the models across various conditions. The use of macro F1-scores for evaluation is appropriate, but the paper could benefit from additional metrics to provide a more nuanced understanding of the models' performance. The confusion matrices and detailed analysis of the results enhance the robustness of the findings.

Reproducibility

The paper provides sufficient details regarding the training protocols, architectures, and evaluation methods, which supports reproducibility. The authors mention that all trained models and generated audio samples are available upon request, which is a positive aspect for researchers looking to replicate or build upon this work. However, the lack of a public repository for the code and models limits immediate accessibility.

Limitations

One notable limitation is the focus on a single dataset, which may not capture the full variability present in real-world audio deepfakes. Additionally, while the authors explore various perturbations, the study does not address the potential impact of more complex factors such as background noise or emotional tone in the audio samples. The paper also acknowledges the need for further investigation into zero-shot voice cloning and other architectures, indicating that the current findings may not be exhaustive.

Broader Impact

The implications of this research are significant, particularly in the context of combating the misuse of synthetic speech in fraud and misinformation. By improving model attribution capabilities, the work contributes to the development of accountability measures in AI-generated content. The insights gained from this study could inform future research and practical applications in audio forensics, security, and ethical AI deployment. The main contribution of this paper is a systematic investigation into the strengths and weaknesses of SSL-derived features for audio deepfake model attribution, revealing critical insights into the robustness and biases of these models. This work is significant as it addresses a pressing societal challenge by enhancing the understanding of audio deepfake attribution, thus contributing to the broader field of audio forensics and accountability in AI-generated content.

Analysis: Full Paper • Full text: 19,900 characters

Speech-Worthy Alignment for Japanese SpeechLLMs via Direct Preference Optimization

Mengjie Zhao, Lianbo Liu, Yusuke Fujita ... · arXiv

SpeechLLMs typically combine ASR-trained encoders with text-based LLM backbones, leading them to inherit written-style output patterns unsuitable for text-to-speech synthesis. This mismatch is particularly pronounced in Japanese, where spoken and written registers differ substant...

SpeechLLMs typically combine ASR-trained encoders with text-based LLM backbones, leading them to inherit written-style output patterns unsuitable for text-to-speech synthesis. This mismatch is particularly pronounced in Japanese, where spoken and written registers differ substantially in politeness markers, sentence-final particles, and syntactic complexity. We propose a preference-based alignment approach to adapt Japanese SpeechLLMs for speech-worthy outputs: text that is concise, conversational, and readily synthesized as natural speech. To rigorously evaluate this task, we introduce SpokenElyza, a benchmark for Japanese speech-worthiness derived from ELYZA-tasks-100 with auditory verification by native experts. Experiments show that our approach achieves substantial improvement on SpokenElyza while largely preserving performance on the original written-style evaluation. We will release SpokenElyza to support future research on Japanese spoken dialog systems.

Institutional Affiliations

Primary: SB Intuitions

All Institutions: SB Intuitions

GitHub

ML Relevance Analysis (77)

The paper presents a novel approach to adapting Japanese SpeechLLMs for speech-worthy outputs, introducing the SpokenElyza benchmark and demonstrating the effectiveness of preference-based alignment. This contribution is significant for advancing the field of audio processing and spoken dialog systems, particularly in languages with distinct spoken and written forms.

Comprehensive Analysis

Methodology Assessment

The paper introduces a novel preference-based alignment approach for adapting Japanese SpeechLLMs to produce speech-worthy outputs, addressing a significant gap in the existing literature. The methodology is well-structured, utilizing Direct Preference Optimization (DPO) combined with supervised fine-tuning (SFT) to balance the generation of conversationally natural outputs while maintaining instruction-following capabilities. The construction of the SpokenElyza benchmark is a notable strength, as it incorporates human verification to ensure the quality of the generated speech-worthy text.

Experimental Evaluation

The experiments are comprehensive, comparing the performance of the proposed method against both the original written-style outputs and the newly created SpokenElyza dataset. The results demonstrate a clear improvement in speech-worthiness while preserving performance on traditional benchmarks, indicating the effectiveness of the proposed approach. The use of LLM-as-Judge for evaluation is a sound choice, although the paper could benefit from more detailed statistical analysis of the results.

Reproducibility

The paper provides a clear description of the model architecture and training procedures, which aids in reproducibility. However, the lack of publicly available code or a demo limits the ability for other researchers to replicate the experiments fully. The paper mentions the use of in-house datasets and models, which may not be accessible to the broader community.

Limitations

One limitation is the focus on the Japanese language, which may restrict the generalizability of the findings to other languages with different spoken and written divergences. Additionally, while the paper addresses the speech-worthiness of responses, it does not explore the potential impact of cultural nuances in conversational styles across different regions in Japan.

Broader Impact

The proposed methods and the SpokenElyza benchmark have significant implications for the development of more effective Japanese spoken dialog systems, which can enhance user interactions in various applications, such as virtual assistants and customer service bots. The approach could also inspire similar methodologies in other languages, potentially leading to advancements in multilingual speech synthesis systems. The paper presents a novel approach to adapting Japanese SpeechLLMs for speech-worthy outputs, introducing the SpokenElyza benchmark and demonstrating the effectiveness of preference-based alignment. This contribution is significant for advancing the field of audio processing and spoken dialog systems, particularly in languages with distinct spoken and written forms.

Analysis: Full Paper • Full text: 18,907 characters

Audio ML Papers

🏆 Top Papers This Week

Institutional Affiliations

ML Relevance Analysis (92)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (88)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (84)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (83)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (83)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (83)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (83)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (83)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (78)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (92)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility