Audio ML Papers

MelShield: Robust Mel-Domain Audio Watermarking for Provenance Attribution of AI Generated Synthesized Speech

Yutong Jin, Qi Li, Lingshuang Liu ... · ACISP 2026

In this paper, we propose MelShield, a robust, in-generation, keyed audio watermarking framework that embeds identifiable signals into AI-generated audio for copyright protection and reliable attribution. Specifically, MelShield operates in the Mel-spectrogram domain during the g...

In this paper, we propose MelShield, a robust, in-generation, keyed audio watermarking framework that embeds identifiable signals into AI-generated audio for copyright protection and reliable attribution. Specifically, MelShield operates in the Mel-spectrogram domain during the generation process, targeting intermediate acoustic representations in Mel-conditioned pipelines for text-to-speech (TTS) generation. The core idea is to treat the intermediate Mel-spectrogram as the host signal and embed a short binary payload via low-energy, keyed spread-spectrum perturbations distributed across carefully selected time-frequency regions prior to waveform synthesis. By performing watermarking before vocoder inference, MelShield remains plug-and-play for Mel-conditioned TTS architectures and does not require modification or retraining of the underlying TTS generation vocoder, such as DiffWave and HiFi-GAN. Moreover, the multi-user keyed construction enables scalable user-specific attribution, while the keyed verification mechanism limits unauthorized decoding, thereby reducing the risk of large-scale extractor probing and adversarial analysis. Extensive experiments on DiffWave and HiFi-GAN demonstrate that MelShield achieves reliable watermark extraction, approaching 100\% bit accuracy, even under signal distortions, e.g., compression and additive noise, while preserving high perceptual audio quality.

Institutional Affiliations

Primary: Queen's University

All Institutions: Queen's University, University of Waterloo

ML Relevance Analysis (83)

MelShield presents a novel in-generation audio watermarking framework that effectively integrates into TTS systems, enhancing copyright protection and attribution mechanisms. The comprehensive evaluation and innovative methodology position this work as a significant contribution to the field of audio processing and machine learning.

Comprehensive Analysis

Methodology Assessment

The methodology presented in MelShield is innovative, leveraging a keyed spread-spectrum approach for watermarking directly in the Mel-spectrogram domain of TTS systems. This is a significant advancement over traditional post-hoc watermarking methods, as it integrates watermarking seamlessly into the audio generation pipeline without requiring modifications to existing vocoders. The use of low-energy perturbations and adaptive masking to maintain audio quality while embedding watermarks is particularly noteworthy. The authors provide a clear and systematic approach to embedding and extracting watermarks, which is well-justified and theoretically sound.

Experimental Evaluation

The experimental evaluation is comprehensive, utilizing two prominent TTS vocoders (DiffWave and HiFi-GAN) and a robust dataset (LJSpeech 1.1). The results demonstrate high bit accuracy for watermark recovery under various conditions, including common signal distortions. The paper effectively compares MelShield against existing watermarking methods, showcasing its superior performance in terms of robustness and fidelity. The use of multiple evaluation metrics (PESQ, STOI, DNSMOS) adds credibility to the results, although the paper could benefit from more extensive user studies to assess perceptual quality in real-world scenarios.

Reproducibility

The paper provides a detailed description of the experimental setup, including the datasets, vocoder configurations, and watermark embedding parameters. However, it lacks a publicly accessible code repository or demo URL, which would enhance reproducibility and allow other researchers to validate the findings. Clearer documentation of the implementation would also aid in replicating the experiments.

Limitations

One limitation is the reliance on specific vocoders, which may not generalize to all TTS systems. While the authors claim model-agnostic deployment, the performance may vary with different architectures not tested in the study. Additionally, the paper does not address potential vulnerabilities to advanced adversarial attacks that could target the watermarking system. The scalability of the approach in high-demand real-world applications remains to be fully explored.

Broader Impact

The implications of this work are significant, particularly in the context of copyright protection and attribution for AI-generated audio. As deepfake technologies become more prevalent, robust watermarking solutions like MelShield can help mitigate risks associated with misinformation and unauthorized content distribution. The framework could be applied across various domains, including media production, digital rights management, and content verification systems. MelShield presents a novel in-generation audio watermarking framework that effectively integrates into TTS systems, enhancing copyright protection and attribution mechanisms. The comprehensive evaluation and innovative methodology position this work as a significant contribution to the field of audio processing and machine learning.

Analysis: Full Paper • Full text: 40,396 characters

MG-Former: A Transformer-Based Framework for Music-Driven 3D Conducting Gesture Generation

Ke Qiu, Yawen Qin, Tianzhi Jia ... · arXiv

Generating expressive conducting gestures from music is a challenging cross-modal motion synthesis problem: the output must follow long-range musical structure, preserve beat-level synchronization, and remain plausible as a fine-grained 3D human performance. Existing conducting-m...

Generating expressive conducting gestures from music is a challenging cross-modal motion synthesis problem: the output must follow long-range musical structure, preserve beat-level synchronization, and remain plausible as a fine-grained 3D human performance. Existing conducting-motion studies are often limited by sparse pose representations, small-scale data, or evaluation protocols that do not directly measure whether music and gesture are mutually aligned. This paper presents TransConductor, a Transformer-based framework for music-driven conducting gesture generation. We introduce ConductorMotion, a SMPL-parameter data construction pipeline that recovers detailed body motion from conducting videos and forms a dataset targeted at professional conducting gestures. Given acoustic descriptors extracted from audio and an initial pose, TransConductor uses a Trans-Temporal Music Encoder and a Trans-Temporal Conducting Gesture Decoder to autoregressively predict SMPL pose parameters. To better assess artistic correspondence, we further build a retrieval-based evaluation model that embeds music and gestures into a shared space and yields FID, modality distance, multi-modality distance, and diversity metrics. Experiments show that TransConductor outperforms dance-generation and conducting-generation baselines, while ablations verify the benefits of the Transformer backbone and the proposed alignment loss.

Institutional Affiliations

Primary: Beijing Jiaotong University

All Institutions: Beijing Jiaotong University, Malou Tech Inc, South-Central Minzu University, Fudan University, Renmin University of China

ML Relevance Analysis (83)

This paper presents a significant advancement in the field of music-driven motion synthesis through the introduction of a Transformer-based framework for generating conducting gestures. The methodology effectively combines detailed pose representation with a novel evaluation approach, setting a new standard for future research in this area.

Comprehensive Analysis

Methodology Assessment

The proposed methodology introduces a novel Transformer-based framework, TransConductor, which effectively addresses the challenge of generating conducting gestures from music. The use of SMPL parameters for detailed pose representation is a significant advancement over traditional sparse keypoint methods, allowing for a more nuanced and expressive depiction of conducting motions. The dual encoder-decoder architecture, comprising a Trans-Temporal Music Encoder and a Trans-Temporal Conducting Gesture Decoder, is well-conceived, leveraging the strengths of self-attention mechanisms to capture long-range dependencies in both music and gesture. The introduction of a retrieval-based evaluation model further enhances the methodology by providing a more meaningful assessment of the artistic correspondence between music and gestures, which is often overlooked in traditional metrics.

Experimental Evaluation

The experimental evaluation is robust, comparing the proposed model against established baselines in dance and conducting generation. The reported metrics (FID, M-Dist, MM-Dist, and diversity) indicate significant improvements in the quality and alignment of generated gestures with the corresponding music. The ablation studies convincingly demonstrate the contributions of the Transformer architecture and the alignment loss, supporting the claims of enhanced performance. The diversity in the dataset, covering various conducting styles and musical emotions, strengthens the validity of the results and showcases the model's adaptability.

Reproducibility

While the paper provides a detailed description of the methodology and experimental setup, it lacks specific implementation details such as code availability or dataset access, which are crucial for reproducibility. The absence of a demo or project URL further limits the ability of other researchers to validate and build upon this work.

Limitations

The paper acknowledges certain limitations, including the reliance on monocular reconstruction, which may not capture all nuances of conducting gestures, particularly baton motion and finger articulation. Additionally, the model struggles with very large gestures in energetic music and may lag during fast transitions. These limitations suggest areas for future research, such as incorporating hand-aware reconstruction techniques and exploring longer musical contexts.

Broader Impact

The implications of this work extend beyond academic interest; it has potential applications in music education, virtual performances, and intelligent tutoring systems. By automating the generation of conducting gestures, this research could enhance interactive music learning environments and provide valuable tools for musicians and educators. The framework could also inspire further exploration of cross-modal motion synthesis in other artistic domains, promoting a deeper understanding of the interplay between music and movement. This paper presents a significant advancement in the field of music-driven motion synthesis through the introduction of a Transformer-based framework for generating conducting gestures. The methodology effectively combines detailed pose representation with a novel evaluation approach, setting a new standard for future research in this area.

Analysis: Full Paper • Full text: 27,959 characters

MindMelody: A Closed-Loop EEG-Driven System for Personalized Music Intervention

Yimeng Zhang, Yueru Sun, Haoyu Gu · arXiv

Driven by the escalating global burden of mental health conditions, music-based interventions have attracted significant attention as a non-invasive, cost-effective modality for emotion regulation and psychological stress relief. However, current digital music services rely on st...

Driven by the escalating global burden of mental health conditions, music-based interventions have attracted significant attention as a non-invasive, cost-effective modality for emotion regulation and psychological stress relief. However, current digital music services rely on static preferences and fail to adapt to users' instantaneous psychological states. Furthermore, directly mapping electroencephalography (EEG) to music generation remains challenging due to severe paired-data scarcity and a lack of interpretability. To address these limitations, we propose MindMelody, a fully functional, closed-loop real-time system for EEG-driven personalized music intervention. MindMelody introduces an emotion-mediated semantic bridge. Specifically, a hybrid Transformer-GNN first decodes real-time EEG signals into global Valence-Arousal states and local temporal affect trajectories. These states are then fed into a Retrieval-Augmented Generation (RAG)-equipped Large Language Model (LLM) to formulate structured intervention plans. Subsequently, a novel Hierarchical EEG Controller injects global affect prefixes and local temporal guidance into a pretrained music backbone, enabling fine-grained controllable audio synthesis. Crucially, the system incorporates a continuous feedback loop that updates generation parameters on the fly based on the user's evolving EEG dynamics. Extensive experiments show that MindMelody improves control adherence and emotional alignment, and receives higher perceived helpfulness in a short-term listening setting, suggesting its promise as an adaptive affect-aware music generation framework.

Institutional Affiliations

Primary: South China University of Technology

All Institutions: South China University of Technology

ML Relevance Analysis (83)

MindMelody presents a novel approach to EEG-driven personalized music intervention, demonstrating a sophisticated integration of machine learning techniques that enhance the adaptability and effectiveness of music therapy. The paper's contributions to the field of affective computing and music generation are substantial, offering a promising direction for future research and applications in mental health.

Comprehensive Analysis

Methodology Assessment

The methodology presented in MindMelody is innovative, integrating a hybrid Transformer-GNN architecture for EEG decoding with a Retrieval-Augmented Generation (RAG) mechanism to formulate structured intervention plans. The use of a Hierarchical EEG Controller to modulate a pretrained music generation backbone is particularly noteworthy, as it allows for fine-grained control over the music output based on real-time EEG data. The closed-loop feedback mechanism that continuously adapts to user feedback enhances the system's responsiveness and personalization, which is a significant advancement over static music generation systems.

Experimental Evaluation

The experiments conducted are robust, utilizing established datasets like DEAP for EEG affect modeling and MusicCaps for controllable music generation. The paper provides comprehensive quantitative metrics, including FAD and various subjective evaluations (Nat.-MOS, Emo.-MOS, Help.), which demonstrate the system's effectiveness in emotional alignment and perceived helpfulness. The pilot user study adds valuable qualitative insights into user experience, although it is limited in scope.

Reproducibility

The paper includes detailed descriptions of the experimental setup, including hyperparameters and training procedures, which aids in reproducibility. However, the lack of publicly available code or a demo limits the ability for others to replicate the findings fully.

Limitations

One limitation is the reliance on a relatively small dataset for training, which may affect the generalizability of the model across diverse populations. Additionally, while the pilot study shows promising results, it is not a clinical validation, and further research is needed to establish long-term efficacy and safety in real-world applications.

Broader Impact

The potential applications of MindMelody are significant, particularly in mental health interventions, where personalized music therapy could provide non-invasive and cost-effective support for individuals experiencing emotional distress. The integration of EEG data with music generation could pave the way for more adaptive therapeutic tools in the field of affective computing. MindMelody presents a novel approach to EEG-driven personalized music intervention, demonstrating a sophisticated integration of machine learning techniques that enhance the adaptability and effectiveness of music therapy. The paper's contributions to the field of affective computing and music generation are substantial, offering a promising direction for future research and applications in mental health.

Analysis: Full Paper • Full text: 22,897 characters

Toward Fair Speech Technologies: A Comprehensive Survey of Bias and Fairness in Speech AI

Yi-Cheng Lin, Yun-Shao Tsai, Kuan-Yu Chen ... · arXiv

Speech technologies are deployed in high-stakes settings, yet fairness concerns remain fragmented across tasks and disciplines. Existing surveys either adopt a general machine-learning perspective that overlooks speech-specific properties or focus on a single task, missing failur...

Speech technologies are deployed in high-stakes settings, yet fairness concerns remain fragmented across tasks and disciplines. Existing surveys either adopt a general machine-learning perspective that overlooks speech-specific properties or focus on a single task, missing failure patterns shared across the speech domain. Synthesizing over 400 studies spanning generation and perception tasks and emerging speech-language models, this survey presents a unified framework that links formal fairness definitions to evaluation, diagnosis, and mitigation. We formalize seven fairness definitions adapted to the speech modality and organize the field's conceptual evolution through three paradigms: Robustness, Representation, and Governance. We then ground evaluation metrics in the mathematical cores of these definitions and offer a decision tree for metric selection. We diagnose bias sources along the speech processing pipeline, surfacing speech-specific mechanisms such as channel bias as a demographic proxy and annotation subjectivity in emotion labels. We systematize mitigation strategies across four intervention stages, mapping each to the diagnosed sources. Finally, we identify open challenges and propose directions for future research.

Institutional Affiliations

Primary: National Taiwan University

All Institutions: National Taiwan University, University of Southern California, NTU Artificial Intelligence Center of Research Excellence

ML Relevance Analysis (83)

This paper serves as a foundational survey that systematically addresses bias and fairness in speech AI, providing a comprehensive framework that can guide future research and development in this critical area. The authors' approach to synthesizing existing literature and formalizing fairness definitions is a significant contribution to the field, setting the stage for more equitable speech technologies.

Comprehensive Analysis

Methodology Assessment

The paper presents a comprehensive survey that synthesizes over 400 studies related to bias and fairness in speech AI, establishing a unified framework that links formal fairness definitions to evaluation metrics, bias diagnosis, and mitigation strategies. The authors formalize seven fairness definitions specifically adapted to the speech modality and provide a decision tree for metric selection, which is a novel contribution to the field. The methodology is robust, drawing on a wide range of literature and systematically addressing the unique challenges posed by the speech domain.

Experimental Evaluation

While the paper is primarily a survey and does not include original experimental results, it effectively reviews existing literature and identifies gaps in current methodologies. It categorizes bias sources along the speech processing pipeline and systematizes mitigation strategies, which could serve as a foundation for future empirical studies. The depth of analysis into bias mechanisms and fairness paradigms is commendable, although the lack of original experimental validation limits the immediate applicability of the findings.

Reproducibility

The survey does not present original experiments, thus reproducibility in the traditional sense does not apply. However, the clear organization of existing literature and the proposed frameworks allow for future researchers to build upon this work in a reproducible manner. The decision tree for metric selection is particularly useful for guiding future empirical studies.

Limitations

One limitation of the paper is its reliance on existing literature without presenting new empirical data or case studies to validate the proposed frameworks. Additionally, while the survey covers a wide range of topics, it may not address all nuances of bias and fairness in speech technologies, particularly in emerging areas of research. The authors also acknowledge the complexity of navigating fairness in sociotechnical contexts, which may not be fully captured in their framework.

Broader Impact

The implications of this work are significant, as it addresses critical issues of bias and fairness in speech technologies that are increasingly deployed in high-stakes environments. By highlighting the need for fairness as a core requirement rather than an afterthought, the paper encourages researchers and practitioners to consider the ethical implications of their technologies. This survey could influence future research directions and policy-making in the field of AI and speech technology. This paper serves as a foundational survey that systematically addresses bias and fairness in speech AI, providing a comprehensive framework that can guide future research and development in this critical area. The authors' approach to synthesizing existing literature and formalizing fairness definitions is a significant contribution to the field, setting the stage for more equitable speech technologies.

Analysis: Full Paper • Full text: 50,026 characters

Multimodal Confidence Modeling in Audio-Visual Quality Assessment

Mayesha Maliha R. Mithila, Mylene C. Q. Farias · ICIP 2026

Audio-visual quality assessment (AVQA) is essential for streaming, teleconferencing, and immersive media. In realistic streaming scenarios, distortions are often asymmetric, where one modality may be severely degraded while the other remains clean. Still, most contemporary AVQA m...

Audio-visual quality assessment (AVQA) is essential for streaming, teleconferencing, and immersive media. In realistic streaming scenarios, distortions are often asymmetric, where one modality may be severely degraded while the other remains clean. Still, most contemporary AVQA metrics treat audio and video as equally reliable, causing confidence-unaware fusion to emphasize unreliable signals. This paper proposes MCM-AVQA, a multimodal confidence-aware AVQA framework that explicitly estimates modality-specific confidence and injects it into a dedicated audio-visual mixer for cross-modal attention. The Audio-Visual Mixer utilizes frame-level, confidence-guided channel attention to gate fusion, modulating feature interaction between modalities so that high-confidence streams dominate while unreliable inputs are suppressed, preserving temporal degradation patterns. A multi-head visual confidence estimator turns frame-level artifact probabilities into temporally smoothed, clip-level visual confidence scores, while an audio confidence module derives confidence from speech-quality cues without requiring a clean reference. Experiments on multiple AVQA benchmarks show that MCM-AVQA, and specifically its confidence-guided Audio-Visual Mixer, improve correlation with human mean opinion scores and yield more interpretable behavior under real-world asymmetric audio-visual distortions.

Institutional Affiliations

Primary: Texas State University

All Institutions: Texas State University

ML Relevance Analysis (82)

The paper presents MCM-AVQA, a confidence-aware audio-visual quality assessment framework that improves the robustness of quality evaluation under asymmetric distortions. This work significantly advances the state of the art in AVQA by integrating modality-specific confidence into the fusion process, leading to more accurate and interpretable quality assessments.

Comprehensive Analysis

Methodology Assessment

The proposed MCM-AVQA framework introduces a novel approach to audio-visual quality assessment by explicitly modeling modality-specific confidence and integrating it into a dedicated Audio-Visual Mixer. This methodology allows for dynamic feature gating based on confidence levels, which is a significant advancement over traditional methods that treat audio and video as equally reliable. The use of a multi-head visual confidence estimator and an audio confidence module enhances the robustness of the model under asymmetric distortions, which is a common scenario in real-world applications. The architecture is well-structured, leveraging state-of-the-art transformer models and attention mechanisms, making it a strong contribution to the field.

Experimental Evaluation

The experiments conducted across multiple AVQA benchmarks (LIVE-SJTU, UnB-AV, UnB-AVQ) demonstrate the effectiveness of MCM-AVQA in improving correlation with human mean opinion scores. The results indicate that the model outperforms existing state-of-the-art methods, particularly in scenarios with asymmetric distortions. The ablation studies provide valuable insights into the contributions of each component of the model, reinforcing the importance of confidence-aware fusion. The use of statistical tests to validate performance improvements adds rigor to the evaluation.

Reproducibility

The paper provides sufficient details regarding the architecture, training procedures, and evaluation metrics, which supports reproducibility. However, the absence of publicly available code or datasets limits the ability for other researchers to replicate the results directly. Including a project URL or demo would significantly enhance reproducibility.

Limitations

One limitation of the study is the lack of a comprehensive comparison with more recent AVQA methods that may not have been included in the evaluation. Additionally, while the model shows robustness under asymmetric distortions, its performance in extreme distortion scenarios or with novel types of distortions remains untested. The reliance on subjective mean opinion scores for evaluation, while standard, could also introduce variability based on human judgment.

Broader Impact

The MCM-AVQA framework has significant implications for real-world applications in streaming, teleconferencing, and immersive media, where audio-visual quality is critical. By improving the accuracy of quality assessments in asymmetric distortion scenarios, this work can enhance user experiences in various multimedia applications. The approach could also be extended to other multimodal quality assessment tasks, potentially influencing future research directions in the field. The paper presents MCM-AVQA, a confidence-aware audio-visual quality assessment framework that improves the robustness of quality evaluation under asymmetric distortions. This work significantly advances the state of the art in AVQA by integrating modality-specific confidence into the fusion process, leading to more accurate and interpretable quality assessments.

Analysis: Full Paper • Full text: 20,060 characters

Fast Text-to-Audio Generation with One-Step Sampling via Energy-Scoring and Auxiliary Contextual Representation Distillation

Kuan-Po Huang, Bo-Ru Lu, Byeonggeun Kim ... · arXiv

Autoregressive (AR) models with diffusion heads have recently achieved strong text-to-audio performance, yet their iterative decoding and multi-step sampling process introduce high-latency issues. To address this bottleneck, we propose a one-step sampling framework that combines ...

Autoregressive (AR) models with diffusion heads have recently achieved strong text-to-audio performance, yet their iterative decoding and multi-step sampling process introduce high-latency issues. To address this bottleneck, we propose a one-step sampling framework that combines an energy-distance training objective with representation-level distillation. An energy-scoring head maps Gaussian noise directly to audio latents in one step, eliminating the need for a costly recursive diffusion sampling process, while distillation from a masked autoregressive (MAR) text-to-audio model preserves the strong conditioning learned during diffusion training. On the AudioCaps benchmark, our method consistently outperforms prior one-step baselines such as ConsistencyTTA, SoundCTM, AudioLCM and AudioTurbo, on both objective and subjective metrics, while substantially narrowing the quality gap to AR diffusion systems with multi-step sampling. Compared to the state-of-the-art AR diffusion system, IMPACT, our approach achieves up to $8.5$x faster batch inference with highly competitive audio quality. These results demonstrate that combining energy-distance training with representation-level distillation provides an effective recipe for fast, high-quality text-to-audio synthesis.

Institutional Affiliations

Primary: Amazon AGI

All Institutions: Amazon AGI, National Taiwan University

ML Relevance Analysis (83)

The paper presents a significant advancement in efficient generative media by introducing a one-step sampling framework that achieves substantially faster inference while maintaining high audio fidelity and semantic relevance. The innovative combination of energy-distance training and representation distillation represents a meaningful contribution to the field of machine learning, particularly in audio generation.

Comprehensive Analysis

Methodology Assessment

The proposed methodology introduces a novel one-step sampling framework for text-to-audio generation that integrates an energy-distance training objective with representation-level distillation. This approach effectively reduces inference latency while maintaining audio quality, addressing a significant limitation in existing autoregressive models that rely on multi-step sampling. The use of energy-scoring to map Gaussian noise directly to audio latents is innovative and demonstrates a clear departure from traditional diffusion-based methods. The incorporation of distillation from a masked autoregressive model further enhances the model's performance, showcasing a thoughtful combination of techniques to achieve rapid and high-quality audio synthesis.

Experimental Evaluation

The experimental evaluation is comprehensive, utilizing the AudioCaps benchmark for both objective and subjective assessments. The paper reports consistent improvements over existing one-step baselines, with significant gains in fidelity and semantic relevance as measured by various metrics (FD, FAD, KL, IS, CLAP). The results demonstrate not only superior performance compared to prior models but also a substantial reduction in inference time, achieving up to 8.5 times faster batch inference than the state-of-the-art AR diffusion system, IMPACT. The thoroughness of the experiments, including ablation studies on representation distillation and classifier-free guidance, adds credibility to the findings.

Reproducibility

The paper provides detailed descriptions of the experimental setup, including datasets, model configurations, and evaluation metrics, which contribute to reproducibility. However, the absence of a publicly accessible code repository or demo limits the ability of other researchers to replicate the results directly. Clear documentation of hyperparameters and training procedures is essential for future work in this area.

Limitations

While the proposed method shows promising results, it still falls short of the audio quality achieved by multi-step diffusion models, indicating that there may be inherent trade-offs between speed and fidelity. The reliance on a single sampling step may also limit the model's flexibility in generating more complex audio sequences. Additionally, the paper does not address potential biases in the training datasets, which could affect the generalizability of the model.

Broader Impact

The advancements in low-latency text-to-audio generation have significant implications for real-time applications in multimedia content creation, interactive media, and personalized audio experiences. The ability to generate high-quality audio quickly opens up new avenues for user engagement and creative expression. Furthermore, the integration of energy-distance training and representation distillation could inspire future research in other generative tasks across different modalities. The paper presents a significant advancement in efficient generative media by introducing a one-step sampling framework that achieves substantially faster inference while maintaining high audio fidelity and semantic relevance. The innovative combination of energy-distance training and representation distillation represents a meaningful contribution to the field of machine learning, particularly in audio generation.

Analysis: Full Paper • Full text: 50,026 characters

GaMMA: Towards Joint Global-Temporal Music Understanding in Large Multimodal Models

Zuyao You, Zhesong Yu, Mingyu Liu ... · arXiv

In this paper, we propose GaMMA, a state-of-the-art (SoTA) large multimodal model (LMM) designed to achieve comprehensive musical content understanding. GaMMA inherits the streamlined encoder-decoder design of LLaVA, enabling effective cross-modal learning between music and langu...

In this paper, we propose GaMMA, a state-of-the-art (SoTA) large multimodal model (LMM) designed to achieve comprehensive musical content understanding. GaMMA inherits the streamlined encoder-decoder design of LLaVA, enabling effective cross-modal learning between music and language. By incorporating audio encoders in a mixture-of-experts manner, GaMMA effectively unifies both time-series and non-time-series music understanding tasks within one set of parameters. Our approach combines carefully curated datasets at scale with a progressive training pipeline, effectively pushing the boundaries of music understanding via pretraining, supervised fine-tuning (SFT), and reinforcement learning (RL). To comprehensively assess both temporal and non-temporal capability of music LMMs, we introduce MusicBench, the largest music-oriented benchmark, comprising 3,739 human-curated multiple-choice questions covering diverse aspects of musical understanding. Extensive experiments demonstrate that GaMMA establishes new SoTA in the music domain, achieving 79.1% accuracy on MuchoMusic, 79.3% on MusicBench-Temporal, and 81.3% on MusicBench-Global, consistently outperforming previous methods.

Institutional Affiliations

Primary: Fudan University

All Institutions: Fudan University, ByteDance

ML Relevance Analysis (83)

The main contribution of this paper is the introduction of GaMMA, a large multimodal model that effectively integrates temporal and non-temporal music understanding, alongside the establishment of MusicBench as a comprehensive evaluation benchmark. This work represents a significant advancement in the field of music AI, addressing critical gaps in existing models and providing a robust framework for future research.

Comprehensive Analysis

Methodology Assessment

The methodology presented in GaMMA is robust, utilizing a dual-encoder architecture that effectively captures both temporal and non-temporal aspects of music understanding. The mixture-of-experts approach, combined with a three-stage training strategy (pretraining, supervised fine-tuning, and reinforcement learning), is innovative and addresses existing gaps in music LMMs. The introduction of MusicBench as a comprehensive benchmark for evaluating music understanding adds significant value to the methodology, allowing for a nuanced assessment of model capabilities.

Experimental Evaluation

The experiments conducted demonstrate the effectiveness of GaMMA, achieving state-of-the-art results on multiple benchmarks, including MusicBench and MuChoMusic. The extensive evaluation across various dimensions of music understanding, including temporal reasoning and global attributes, showcases the model's capabilities. The use of human-curated questions in MusicBench enhances the credibility of the results, though the paper could benefit from more extensive comparisons with a wider range of existing models.

Reproducibility

The paper provides detailed implementation specifics, including training strategies, hyperparameters, and data curation processes, which are essential for reproducibility. However, the absence of publicly available code or datasets limits the ability for independent verification of results.

Limitations

One limitation is the reliance on curated datasets, which may introduce biases or limit the generalizability of the model. Additionally, while the dual-encoder approach is innovative, it may require significant computational resources, which could hinder accessibility for broader research applications.

Broader Impact

GaMMA has the potential to significantly impact the field of music understanding and multimodal AI by providing a framework that can be adapted for various applications, such as music recommendation systems, educational tools, and interactive music assistants. Its ability to understand and reason about music in a nuanced manner could lead to advancements in how machines interact with human creativity and cultural expressions. The main contribution of this paper is the introduction of GaMMA, a large multimodal model that effectively integrates temporal and non-temporal music understanding, alongside the establishment of MusicBench as a comprehensive evaluation benchmark. This work represents a significant advancement in the field of music AI, addressing critical gaps in existing models and providing a robust framework for future research.

Analysis: Full Paper • Full text: 44,705 characters

LASE: Language-Adversarial Speaker Encoding for Indic Cross-Script Identity Preservation

Venkata Pushpak Teja Menta · arXiv

A speaker encoder used in multilingual voice cloning should treat the same speaker identically regardless of which script the audio was uttered in. Off-the-shelf encoders do not, and the failure is accent-conditional. On a 1043-pair Western-accented voice corpus across English, H...

A speaker encoder used in multilingual voice cloning should treat the same speaker identically regardless of which script the audio was uttered in. Off-the-shelf encoders do not, and the failure is accent-conditional. On a 1043-pair Western-accented voice corpus across English, Hindi, Telugu, and Tamil, WavLM-base-plus-sv loses 0.082 absolute cosine similarity when the same voice changes script and ECAPA-TDNN loses 0.105. On a 1369-pair Indian-accented voice corpus, the gap shrinks to 0.006 (WavLM-SV) and 0.044 (ECAPA-TDNN). The leak is largest where it matters most for cross-script TTS: when a system projects a non-Indic-trained voice into Indic scripts. We present LASE (Language-Adversarial Speaker Encoder), a small projection head over frozen WavLM-base-plus trained with two losses: a supervised contrastive loss over voice identity, and a gradient-reversal cross-entropy against a 4-language classifier that pushes the embedding to be language-uninformative while remaining speaker-informative. Trained on 1118 quality-gated cross-script pairs synthesised from 8 commercial multilingual voices, LASE's residual gap is consistent with zero on both corpora (Delta = 0.013 Western, Delta = 0.026 Indian; both bootstrap 95% CIs include zero) and amplifies the cross-script-vs-floor margin 2.4-2.7x over both baselines. An ECAPA+GRL ablation shows the GRL objective improves either backbone but the WavLM choice contributes too. In synthetic multi-speaker diarisation, LASE matches ECAPA-TDNN on cross-script speaker recall (0.788 vs 0.789) with ~100x less training data. We release the r1 checkpoint, both corpora, and the bootstrap recipe.

Institutional Affiliations

Primary: Praxel Ventures

All Institutions: Praxel Ventures

GitHub

ML Relevance Analysis (83)

The paper presents LASE, a novel approach to cross-script identity preservation in multilingual voice cloning, demonstrating significant advancements in disentangling language from speaker identity and providing valuable resources for future research. The methodology and results contribute meaningfully to the field of audio processing and speaker recognition, particularly in the context of Indic languages.

Comprehensive Analysis

Methodology Assessment

The paper introduces a novel approach using a Language-Adversarial Speaker Encoder (LASE) that effectively disentangles language from speaker identity in multilingual voice cloning tasks. The methodology employs a gradient-reversal layer and a supervised contrastive loss to create a speaker embedding that is invariant to language, which is a significant advancement in the field. The architecture is well-defined, consisting of a frozen WavLM-base-plus backbone and a trainable projection head, which allows for efficient training and effective performance on cross-script tasks.

Experimental Evaluation

The experiments are robust, utilizing two distinct corpora to evaluate the performance of LASE against established baselines (WavLM-base-plus-sv and ECAPA-TDNN). The results demonstrate a significant reduction in the identity gap across scripts, with LASE achieving a gap of 0.013 compared to 0.082 and 0.105 for the baselines. The paper also includes a thorough analysis of the training dynamics and presents a synthetic multi-speaker diarisation benchmark, showing that LASE can match ECAPA-TDNN's performance with significantly less training data.

Reproducibility

The authors provide a comprehensive set of resources, including the model weights, training corpus, and evaluation scripts, which enhances reproducibility. The detailed description of the training process, loss functions, and hyperparameters further supports the ability of other researchers to replicate the results.

Limitations

The study relies solely on synthetic data generated by ElevenLabs, which may not fully capture the complexities of natural human speech. Additionally, the held-out set shares voices with the training data, limiting the generalization assessment. The paper also acknowledges that the model's performance on real-world data and new voices remains to be evaluated.

Broader Impact

The implications of this work are significant for applications in multilingual voice cloning, speaker verification, and diarisation systems, particularly in contexts involving Indian languages. The ability to maintain speaker identity across different scripts can enhance user experience in customer support, content creation, and accessibility technologies. The paper presents LASE, a novel approach to cross-script identity preservation in multilingual voice cloning, demonstrating significant advancements in disentangling language from speaker identity and providing valuable resources for future research. The methodology and results contribute meaningfully to the field of audio processing and speaker recognition, particularly in the context of Indic languages.

Analysis: Full Paper • Full text: 27,824 characters

MMAudio-LABEL: Audio Event Labeling via Audio Generation for Silent Video

Kazuya Tateishi, Akira Takahashi, Atsuo Hiroe ... · CVPR 2026 Sight and Sound Workshop

Recent advances in multimodal generation have enabled high-quality audio generation from silent videos. Practical applications, such as sound production, demand not only the generated audio but also explicit sound event labels detailing the type and timing of sounds. One straight...

Recent advances in multimodal generation have enabled high-quality audio generation from silent videos. Practical applications, such as sound production, demand not only the generated audio but also explicit sound event labels detailing the type and timing of sounds. One straightforward approach involves applying a standard sound event detection to the generated audio. However, this post-hoc pipeline is inherently limited, as it is prone to error accumulation. To address this limitation, we propose MMAudio-LABEL (LAtent-Based Event Labeling), an event-aware audio generation framework built on a foundational audio generation model as its backbone that jointly generates audio and frame-aligned sound event predictions from silent videos. We evaluate our method on the Greatest Hits dataset for onset detection and 17-class material classification. Our approach improves onset-detection accuracy from 46.7% to 75.0% and material-classification accuracy from 40.6% to 61.0% over baselines. These results suggest that jointly learning audio generation and event prediction enables a more interpretable and practical video-to-audio synthesis.

Institutional Affiliations

Primary: Sony Group Corporation

All Institutions: Sony Group Corporation, Sony AI

ML Relevance Analysis (83)

The paper presents MMAudio-LABEL, a novel framework for joint audio generation and event labeling from silent videos, demonstrating significant improvements over existing methods. The technical contributions and methodology are well-articulated, showcasing the potential for broader applications in multimedia content creation and multimodal learning.

Comprehensive Analysis

Methodology Assessment

The proposed MMAudio-LABEL framework innovatively combines audio generation with event labeling in a unified architecture, addressing the limitations of traditional post-hoc sound event detection methods. By leveraging a multimodal transformer and exploring two distinct architectures (Parallel Heads and Joint Heads), the authors demonstrate a thoughtful approach to integrating visual and auditory information. The methodology is well-structured, with clear explanations of the model architecture and training objectives, although further details on the training data preprocessing and augmentation strategies could enhance clarity.

Experimental Evaluation

The experiments are robust, utilizing the Greatest Hits dataset to evaluate both onset detection and material classification. The reported improvements in accuracy metrics (from 46.7% to 75.0% for onset detection and from 40.6% to 61.0% for material classification) provide compelling evidence of the framework's effectiveness. However, the paper could benefit from additional comparative analyses against a wider range of baseline models to contextualize the performance gains further.

Reproducibility

The implementation details are adequately described, including model architecture, training parameters, and evaluation metrics. However, the absence of a publicly available code repository or demo limits reproducibility. Providing access to the trained models or code would significantly enhance the paper's impact and usability for the research community.

Limitations

One notable limitation is the reliance on a specific dataset (Greatest Hits), which may not fully represent the diversity of audio events in real-world scenarios. Additionally, the model's performance on less distinctive materials indicates potential challenges in generalization. The paper could also discuss the computational complexity and resource requirements of the proposed framework.

Broader Impact

The MMAudio-LABEL framework has significant implications for content creation, immersive media, and human-computer interaction, as it enables more intuitive sound event labeling from silent videos. This could streamline workflows in various industries, including film production and gaming, where accurate audio representation is crucial. The integration of audio generation and event labeling also opens avenues for future research in multimodal learning and generative models. The paper presents MMAudio-LABEL, a novel framework for joint audio generation and event labeling from silent videos, demonstrating significant improvements over existing methods. The technical contributions and methodology are well-articulated, showcasing the potential for broader applications in multimedia content creation and multimodal learning.

Analysis: Full Paper • Full text: 13,868 characters

MedMosaic: A Challenging Large Scale Benchmark of Diverse Medical Audio

Harshit Rajgarhia, Shuubham Ojha, Asif Shaik ... · ICML 2026

We present MedMosaic, a medical audio question-answering dataset designed to benchmark language and audio reasoning models under realistic clinical constraints. Medical audio data is difficult to collect due to privacy regulations and high annotation costs arising from domain exp...

We present MedMosaic, a medical audio question-answering dataset designed to benchmark language and audio reasoning models under realistic clinical constraints. Medical audio data is difficult to collect due to privacy regulations and high annotation costs arising from domain expertise. Thus, existing benchmarks tend to underrepresent complex medical audio scenarios. To address these challenges, MedMosaic features a diverse range of medical audio types, including condition-related physiological sounds, carefully constructed synthetic voices to mimic speech with artifacts as well as real short and long length clinical conversations to model varying context lengths. The dataset also features a total of 46,701 question-answer pairs, spanning categories such as multiple-choice, sequential multi-turn, and open-ended question-answers, enabling systematic evaluation of multi-hop reasoning and answer generation capabilities. Benchmarking 13 audio and multimodal reasoning models reveals that reasoning remains challenging for all evaluated systems, with substantial performance variation across question types. In particular, even state-of-the-art model like Gemini-2.5-pro can only achieve 68.1% accuracy approximately. These findings underscore persistent limitations in medical reasoning and highlight the need for more robust, domain-specific multimodal reasoning models.

Institutional Affiliations

Primary: University of Maryland, College Park, MD, USA

All Institutions: Centific Global Solutions Inc., University of Maryland, College Park, MD, USA

ML Relevance Analysis (82)

The paper presents MedMosaic, a large-scale medical audio question-answering benchmark designed to evaluate audio reasoning models under realistic clinical constraints. This work is significant as it addresses a critical gap in the evaluation of multimodal reasoning in the medical domain, providing a structured framework for future research and development in audio understanding and reasoning.

Comprehensive Analysis

Methodology Assessment

The methodology presented in this paper is robust, featuring a comprehensive pipeline for generating question-answer pairs from diverse medical audio sources. The authors effectively address the challenges of collecting and annotating medical audio data by leveraging synthetic audio generation techniques. The structured approach to creating varied question types (e.g., sound-only, speech-only, multi-turn) is commendable, as it allows for a nuanced evaluation of audio reasoning capabilities. The use of subject matter experts for validation adds credibility to the dataset's clinical relevance. However, the reliance on synthetic data raises questions about the authenticity of the generated audio and its implications for real-world applications.

Experimental Evaluation

The experimental evaluation is thorough, benchmarking 13 different audio and multimodal reasoning models against the MedMosaic dataset. The results demonstrate significant performance challenges across all models, highlighting the dataset's difficulty and the need for further advancements in medical audio reasoning. The detailed breakdown of model performance across various question types provides valuable insights into the strengths and weaknesses of current systems. However, the paper could benefit from more extensive comparisons with existing benchmarks to contextualize the results further.

Reproducibility

The paper provides a detailed description of the dataset generation process and the evaluation framework, which aids in reproducibility. However, the absence of publicly available code or datasets limits the ability for other researchers to replicate the findings. The authors should consider releasing the dataset and the generation pipeline to enhance reproducibility and facilitate further research in this area.

Limitations

The primary limitation of this work lies in the reliance on synthetic audio, which may not fully capture the complexities of real-world medical audio scenarios. Additionally, while the dataset is extensive, the performance of state-of-the-art models remains relatively low, indicating that the benchmark may still be too challenging for current systems. The authors acknowledge the need for further validation before clinical deployment, which is a critical consideration for any application in healthcare.

Broader Impact

The development of MedMosaic has the potential to significantly advance the field of medical audio processing and reasoning. By providing a challenging benchmark, it encourages the development of more sophisticated models capable of understanding and reasoning over complex medical audio. This could ultimately lead to improved clinical decision-making and patient outcomes. However, the authors emphasize the importance of extensive validation before any real-world application, highlighting the need for caution in deploying AI systems in healthcare settings. The paper presents MedMosaic, a large-scale medical audio question-answering benchmark designed to evaluate audio reasoning models under realistic clinical constraints. This work is significant as it addresses a critical gap in the evaluation of multimodal reasoning in the medical domain, providing a structured framework for future research and development in audio understanding and reasoning.

Analysis: Full Paper • Full text: 50,026 characters

PRISM: Exposing and Resolving Spurious Isolation in Federated Multimodal Continual Learning

Beining Wu, Zihao Ding, Jun Huang · arXiv

While current federated multimodal continual learning over mixture-of-experts low-rank adaptation (MoE-LoRA) is built on the unverified assumption that routing isolates task-specific knowledge into disjoint experts, we argue that routing operates per-sample, while forgetting accu...

While current federated multimodal continual learning over mixture-of-experts low-rank adaptation (MoE-LoRA) is built on the unverified assumption that routing isolates task-specific knowledge into disjoint experts, we argue that routing operates per-sample, while forgetting accumulates across the task sequence, and gradient conflict persists within each expert even when routing is maximally polarized. Moreover, activation-subspace protection can also fail because, under parameter-efficient fine-tuning, it entangles tasks due to a dimension-counting bound, and federated averaging (FedAvg) disrupts client-side orthogonality. To address this, we propose PRISM (Per-expert Routing-projection Interference-informed Subspace Method), which maintains a per-expert gradient subspace basis whose orthogonality is preserved under FedAvg and reinterprets MoE routing as a capacity allocator. Our results show that, on LLaVA-1.5-7B, LLaVA-1.5-13B, and Qwen2.5-VL-7B across CoIN-6 and CoIN-Long-10, PRISM outperforms sixteen the state of the art baselines in average accuracy. Compared to the best federated multimodal baseline, the performance margin increases from +3.23 pp on CoIN-6 to +6.06 pp on CoIN-Long-10.

Institutional Affiliations

Primary: South Dakota State University

All Institutions: South Dakota State University

ML Relevance Analysis (82)

The main contribution of this paper is the introduction of PRISM, a novel approach that effectively resolves issues of spurious isolation in federated multimodal continual learning by maintaining orthogonality in gradient subspaces and reinterpreting routing mechanisms. The comprehensive analysis of the methodology, experimental results, and potential applications underscores its significance in advancing the field of federated learning.

Comprehensive Analysis

Methodology Assessment

The paper introduces PRISM, a novel method addressing the limitations of existing federated multimodal continual learning approaches. The methodology is well-structured, focusing on the preservation of orthogonality in gradient subspaces and reinterpreting MoE routing as a capacity allocator. The proposed mechanisms, including the Per-Expert Federated Orthogonal Subspace Union (PE-FOSU) and interference-informed scheduling, are innovative and effectively tackle the identified issues of spurious isolation and entangled activation subspaces. The authors provide a clear theoretical foundation for their approach, which is crucial for understanding the underlying principles of their method.

Experimental Evaluation

The experimental setup is robust, evaluating PRISM against sixteen state-of-the-art baselines across two multimodal benchmarks (CoIN-6 and CoIN-Long-10). The results demonstrate significant improvements in average accuracy and backward transfer, with detailed comparisons that highlight the advantages of the proposed method. The paper includes comprehensive analyses of the results, showcasing the effectiveness of PRISM in various scenarios.

Reproducibility

The paper provides sufficient implementation details, including the architecture, training protocols, and evaluation metrics. However, the absence of a public code repository or demo URL limits the reproducibility of the results. Future work should consider making the code available to facilitate validation by the research community.

Limitations

While the proposed method shows promise, the paper does not address the computational overhead associated with maintaining per-expert gradient subspaces, which could be a concern in large-scale applications. Additionally, the evaluation is limited to specific multimodal benchmarks, and further testing on diverse datasets would strengthen the findings.

Broader Impact

The implications of this research extend to various applications in federated learning, particularly in scenarios where data privacy is paramount. By enhancing the performance of multimodal continual learning systems, PRISM could contribute to advancements in areas such as personalized AI, healthcare, and collaborative learning environments. The main contribution of this paper is the introduction of PRISM, a novel approach that effectively resolves issues of spurious isolation in federated multimodal continual learning by maintaining orthogonality in gradient subspaces and reinterpreting routing mechanisms. The comprehensive analysis of the methodology, experimental results, and potential applications underscores its significance in advancing the field of federated learning.

Analysis: Full Paper • Full text: 50,026 characters

CustomDancer: Customized Dance Recommendation by Text-Dance Retrieval

Yawen Qin, Ke Qiu, Qin Zhang · arXiv

Dance serves as both a cultural cornerstone and a medium for personal expression, yet the rapid growth of online dance content has made personalized discovery increasingly difficult. Text-based dance retrieval offers a natural interface for users to search with choreographic inte...

Dance serves as both a cultural cornerstone and a medium for personal expression, yet the rapid growth of online dance content has made personalized discovery increasingly difficult. Text-based dance retrieval offers a natural interface for users to search with choreographic intent, but it remains underexplored because dance requires simultaneous reasoning over linguistic semantics, musical rhythm, and full-body motion dynamics. We introduce TD-Data, a large-scale open dataset for text-dance retrieval, containing about 4,000 12-second dance clips, 14.6 hours of motion, 22 genres, and annotations from professional dance experts. On top of this dataset, we propose CustomDancer, a multimodal retrieval framework that aligns text with dance through a CLIP-based text encoder, music and motion encoders, and a music-motion blending module. CustomDancer achieves state-of-the-art performance on TD-Data, reaching 10.23% Recall@1 and improving retrieval quality in both quantitative benchmarks and user preference studies.

Institutional Affiliations

Primary: South-Central Minzu University

All Institutions: South-Central Minzu University

ML Relevance Analysis (81)

The main contribution of this paper is the introduction of CustomDancer, a multimodal framework for text-dance retrieval, and the TD-Data dataset, which together advance the state-of-the-art in dance content discovery. The comprehensive methodology, rigorous experimental evaluation, and acknowledgment of limitations underscore the significance of this work in the intersection of machine learning and the performing arts.

Comprehensive Analysis

Methodology Assessment

The methodology is robust, introducing a novel multimodal retrieval framework (CustomDancer) that effectively combines text, music, and motion through a well-structured architecture. The use of a CLIP-based text encoder alongside dedicated music and motion encoders is innovative, allowing for a more nuanced understanding of dance retrieval. The music-motion blending module is particularly noteworthy as it captures the interaction between music and motion, which is crucial for dance. The construction of the TD-Data dataset with expert annotations adds significant value, providing a solid foundation for training and evaluation.

Experimental Evaluation

The experiments are comprehensive, utilizing multiple evaluation metrics (Recall@K, Median Rank, Mean Rank) that are appropriate for the task. The comparison with strong baselines demonstrates the effectiveness of CustomDancer, and the user study adds a qualitative dimension to the evaluation, confirming that the model aligns well with human judgments. The ablation studies provide insights into the contributions of different components of the model, reinforcing the importance of temporal modeling and feature fusion.

Reproducibility

The paper provides detailed implementation details, including the architecture of the encoders and the training objectives. However, the lack of a publicly available code repository or dataset could hinder reproducibility. Future work should consider releasing the code and dataset to facilitate further research in this area.

Limitations

The paper acknowledges several limitations, including challenges with specialized terminology, conflicts between visual motion and musical affect, and potential performer bias. These factors can impact retrieval accuracy and user satisfaction. Additionally, the dataset's focus on 3D motion and music may overlook important visual elements like costumes and facial expressions.

Broader Impact

The work has the potential to significantly impact the fields of dance education, choreography, and creative recommendation systems. By making dance retrieval more accessible, it can facilitate learning and exploration of diverse dance styles. However, the authors emphasize the need for cultural sensitivity in dataset construction and application, highlighting the importance of preserving the context and community significance of dance styles. The main contribution of this paper is the introduction of CustomDancer, a multimodal framework for text-dance retrieval, and the TD-Data dataset, which together advance the state-of-the-art in dance content discovery. The comprehensive methodology, rigorous experimental evaluation, and acknowledgment of limitations underscore the significance of this work in the intersection of machine learning and the performing arts.

Analysis: Full Paper • Full text: 30,766 characters

Transformer-based End-to-End Control Filter Generation for Active Noise Control

Ziyi Yang, Zhengding Luo, Yisong Zou ... · INTER-NOISE 2026

To address the limitations of existing Generative Fixed-Filter Active Noise Control (GFANC) methods, which rely on filter decomposition and recombination and require supervised learning with labeled data, this paper proposes a Transformer-based End-to-End Control-Filter Generatio...

To address the limitations of existing Generative Fixed-Filter Active Noise Control (GFANC) methods, which rely on filter decomposition and recombination and require supervised learning with labeled data, this paper proposes a Transformer-based End-to-End Control-Filter Generation (E2E-CFG) framework. Unlike previous approaches that predict combination weights of sub control filters, the proposed method directly generates control filters in an unsupervised manner by integrating the co-processor and real-time controller into a fully differentiable ANC system, where the accumulated error signal is used as the training objective. By abandoning the decomposition--reconstruction process, the proposed design simplifies the control pipeline and avoids error accumulation, while the Transformer architecture effectively captures global and dynamic noise characteristics through its attention mechanism. Numerical simulations on real-recorded noises demonstrate that the proposed method achieves improved noise reduction performance and adaptability to different types of noises compared with the original GFANC framework.

Institutional Affiliations

Primary: unknown

All Institutions: unknown

ML Relevance Analysis (75)

The paper presents a novel Transformer-based framework for active noise control that simplifies the filter generation process and improves adaptability to real-world noise conditions. This work is significant as it combines advanced neural architectures with practical applications in noise cancellation, potentially leading to enhanced performance in diverse acoustic environments.

Comprehensive Analysis

Methodology Assessment

The proposed Transformer-based End-to-End Control-Filter Generation (E2E-CFG) framework represents a significant methodological advancement in active noise control (ANC) by integrating a Transformer architecture for direct control-filter generation. This approach eliminates the need for sub-filter decomposition and recombination, which simplifies the control pipeline and enhances adaptability to varying noise conditions. The unsupervised training paradigm, which relies on minimizing the accumulated residual error, is innovative as it reduces the dependency on labeled data, a common limitation in many machine learning applications. The use of a differentiable ANC system allows for end-to-end training, which is a notable strength of the methodology.

Experimental Evaluation

The experimental setup is robust, utilizing a large synthetic dataset of 83,977 noise samples and evaluating the model's performance on both unseen real-world and synthetic noises. The results indicate that the proposed method outperforms the existing GFANC framework in most real-noise scenarios, demonstrating its practical applicability. However, the performance on synthetic noises is mixed, suggesting that while the model excels in real-world conditions, it may not universally outperform all existing methods across all noise types. The evaluation metrics used, particularly the noise reduction (NR) levels, are appropriate for assessing ANC performance.

Reproducibility

The paper provides sufficient detail regarding the model architecture, training parameters, and experimental setup, which should allow for reproducibility. However, the absence of a publicly available code repository or demo URL limits the ease with which other researchers can replicate the results. Future work could benefit from sharing the implementation details and datasets used for training and testing.

Limitations

One significant limitation is the reliance on a fixed acoustic path during training and evaluation, which may not generalize well to different acoustic environments without retraining the model. Additionally, the increased complexity of the Transformer-based model, while beneficial for performance, raises concerns about computational efficiency and resource requirements, which could limit its deployment in real-time applications.

Broader Impact

The proposed framework has the potential to significantly improve active noise control systems in various applications, including consumer electronics, automotive, and industrial environments. By enhancing adaptability to dynamic noise conditions, this research could lead to more effective noise cancellation solutions, improving user experience and comfort in noisy environments. The implications for real-time processing and deployment in practical scenarios are promising, although further work is needed to address the identified limitations. The paper presents a novel Transformer-based framework for active noise control that simplifies the filter generation process and improves adaptability to real-world noise conditions. This work is significant as it combines advanced neural architectures with practical applications in noise cancellation, potentially leading to enhanced performance in diverse acoustic environments.

Analysis: Full Paper • Full text: 20,255 characters

Alethia: A Foundational Encoder for Voice Deepfakes

Yi Zhu, Brahmi Dwivedi, Jayaram Raghuram ... · ICML 2026

Existing voice deepfake detection and localization models rely heavily on representations extracted from speech foundation models (SFMs). However, downstream finetuning has now reached a state of diminishing returns. In this paper, we shift the focus to pretraining and propose a ...

Existing voice deepfake detection and localization models rely heavily on representations extracted from speech foundation models (SFMs). However, downstream finetuning has now reached a state of diminishing returns. In this paper, we shift the focus to pretraining and propose a novel recipe that combines bottleneck masked embedding prediction with flow-matching based spectrogram reconstruction. The outcome, Alethia, is the first foundational audio encoder for various voice deepfake detection and localization tasks. We evaluate on $5$ different tasks with $56$ benchmark datasets, and note Alethia significantly outperforms state-of-the-art SFMs with superior robustness to real-world perturbations and zero-shot generalization to unseen domains (e.g., singing deepfakes). We also demonstrate the limitation of discrete targets in masked token prediction, and show the importance of continuous embedding prediction and generative pretraining for capturing deepfake artifacts.

Institutional Affiliations

Primary: Reality Defender Inc.

All Institutions: Reality Defender Inc., INRS

ML Relevance Analysis (83)

The main contribution of this paper is the introduction of Alethia, a foundational encoder for voice deepfakes that significantly enhances detection and localization capabilities through an innovative pretraining methodology. This work addresses critical gaps in existing models and sets a new standard for future research in the domain of audio deepfake detection.

Comprehensive Analysis

Methodology Assessment

The paper introduces a novel pretraining framework for voice deepfake detection, Alethia, which innovatively combines bottleneck masked embedding prediction with flow-matching based spectrogram reconstruction. This dual-branch approach allows the model to learn robust representations that capture generative artifacts in voice deepfakes, addressing limitations in existing speech foundation models (SFMs) that primarily focus on downstream finetuning. The methodology is well-structured, with a clear explanation of the model architecture, pretraining objectives, and the rationale behind the design choices, such as the use of continuous embeddings instead of discrete tokens.

Experimental Evaluation

The experimental evaluation is comprehensive, covering five different tasks across 56 benchmark datasets, which is a significant contribution to the field. The results demonstrate that Alethia outperforms existing SFMs in various metrics, including equal error rate (EER) and accuracy, particularly in challenging scenarios. The zero-shot generalization capability to unseen domains, such as singing deepfakes, is a notable strength of the model. However, the paper could benefit from more detailed ablation studies to further validate the contributions of each component in the proposed framework.

Reproducibility

The paper provides a thorough description of the experimental setup, including data preprocessing, model architecture, and training procedures. However, the lack of publicly available code or datasets limits reproducibility. Providing a GitHub repository or links to the datasets used would enhance the ability of other researchers to replicate the findings.

Limitations

One limitation of the study is the reliance on self-curated datasets for pretraining, which may introduce biases or artifacts not present in real-world data. Additionally, while the model shows promising results, its performance on edge cases or highly diverse datasets remains to be fully explored. The paper also does not address potential ethical implications of deepfake technology, which is crucial given the sensitive nature of the application.

Broader Impact

The research has significant implications for the field of audio processing and deepfake detection, contributing to the development of more robust systems that can help mitigate the risks associated with the misuse of deepfake technology. As deepfakes become more prevalent, the ability to detect and localize them effectively is crucial for maintaining trust in digital communications. The main contribution of this paper is the introduction of Alethia, a foundational encoder for voice deepfakes that significantly enhances detection and localization capabilities through an innovative pretraining methodology. This work addresses critical gaps in existing models and sets a new standard for future research in the domain of audio deepfake detection.

Analysis: Full Paper • Full text: 50,026 characters

Few-Shot Accent Synthesis for ASR with LLM-Guided Phoneme Editing

Yurii Halychanskyi, Nimet Beyza Bozdag, Mark Hasegawa-Johnson ... · arXiv

Accented automatic speech recognition (ASR) often degrades due to the limited availability of accented training data. Prior work has explored accent modeling in low-resource settings, but existing approaches typically require minutes to hours of labeled speech, which may still be...

Accented automatic speech recognition (ASR) often degrades due to the limited availability of accented training data. Prior work has explored accent modeling in low-resource settings, but existing approaches typically require minutes to hours of labeled speech, which may still be impractical for truly scarce accent scenarios. We propose a pipeline that adapts a text-to-speech (TTS) decoder to a target-accent speaker using fewer than ten reference utterances and employs large language model (LLM)-based phoneme editing to generate accent-conditioned pronunciations. The resulting synthetic speech is used to fine-tune a self-supervised ASR model. Experiments demonstrate consistent word error rate (WER) reductions on real accented speech, including cross-speaker evaluation and ultra-low data regimes. A matched-rate random phoneme baseline shows that phoneme-space perturbation itself is a strong form of augmentation, while LLM-guided edits provide additional gains through accent-conditioned structure.

Institutional Affiliations

Primary: University of Illinois Urbana-Champaign

All Institutions: University of Illinois Urbana-Champaign, National Center for Supercomputing Applications

Demo

ML Relevance Analysis (83)

The main contribution of this paper is the development of a few-shot accent synthesis pipeline that leverages LLM-guided phoneme editing to improve ASR performance in low-resource settings. This innovative approach not only addresses the challenge of accent adaptation but also demonstrates the effectiveness of combining TTS and ASR technologies to enhance speech recognition across diverse accents.

Comprehensive Analysis

Methodology Assessment

The proposed methodology effectively combines few-shot learning with LLM-guided phoneme editing to address the challenge of accent adaptation in ASR systems. The approach is innovative in its use of a phoneme-conditioned TTS model and the integration of LLMs for phoneme editing, which allows for accent-specific pronunciation adjustments while maintaining prosodic alignment. The system's architecture is well-defined, and the use of a matched-rate random phoneme baseline provides a strong comparative framework to evaluate the effectiveness of the LLM-guided edits.

Experimental Evaluation

The experiments are comprehensive, evaluating the proposed method across multiple accents (Indian and Korean English) and demonstrating significant improvements in WER through synthetic data generation. The paper provides a clear experimental setup, including detailed descriptions of the datasets, evaluation metrics, and results. The findings indicate that the proposed method not only enhances ASR performance in low-resource scenarios but also shows potential for cross-speaker generalization, which is a critical aspect of practical ASR applications.

Reproducibility

The paper includes sufficient implementation details, including training configurations, feature extraction methods, and evaluation protocols, which support reproducibility. However, the absence of a public code repository limits the ease with which other researchers can replicate the results. The authors should consider releasing their code and models to enhance reproducibility.

Limitations

One notable limitation is that the system inherits prosody from the source speech rather than modeling accent-specific prosodic variations, which may restrict the fidelity of the synthesized speech. Additionally, the adaptation is limited to a single reference speaker, which could affect the generalizability of the results across different speakers and accents. Future work should address these limitations by exploring multi-speaker accent generation and explicit prosody modeling.

Broader Impact

The research has significant implications for improving ASR systems in diverse linguistic contexts, particularly for underrepresented accents. By enabling effective accent adaptation with minimal data, this work can contribute to more inclusive speech technologies that better serve global populations. The potential applications extend to various domains, including voice assistants, transcription services, and accessibility tools, enhancing communication for speakers of different accents. The main contribution of this paper is the development of a few-shot accent synthesis pipeline that leverages LLM-guided phoneme editing to improve ASR performance in low-resource settings. This innovative approach not only addresses the challenge of accent adaptation but also demonstrates the effectiveness of combining TTS and ASR technologies to enhance speech recognition across diverse accents.

Analysis: Full Paper • Full text: 24,226 characters

RoboKA: KAN Informed Multimodal Learning for RoboCall Surveillance System

Nitin Choudhury, Nikhil Kumar, Aditya Kumar Sinha ... · International Conference on Multimedia & Expo (ICME) 2026, 7th International Workshop on Surveillance Data Processing

Wide exploration on robocall surveillance research is hindered due to limited access to public datasets, due to privacy concerns. In this work, we first curate Robo-SAr, a synthetic robocall dataset designed for robocall surveillance research. Robo-SAr comprises of ~200 unwanted ...

Wide exploration on robocall surveillance research is hindered due to limited access to public datasets, due to privacy concerns. In this work, we first curate Robo-SAr, a synthetic robocall dataset designed for robocall surveillance research. Robo-SAr comprises of ~200 unwanted and ~1200 legitimate synthetic robocall samples across three realistic adversarial axes: psycholinguistics-manipulated transcripts, emotion-eliciting speech, and cloned voices. We further propose RoboKA, a Kolmogorov-Arnold Network (KAN)-based multimodal fusion framework designed to model structured nonlinear interactions between acoustic and linguistic cues that characterize diverse adversarial robocall strategies. RoboKA first leverages cross-modal contrastive learning to align latent modality representations and feeds the resulting embeddings to a KAN-projection head for final classification. We benchmark RoboKA against strong unimodal and multimodal baselines in both in-domain and out-of-domain setups, finding RoboKA to surpass all baselines in terms of recall and F1-score.

Institutional Affiliations

Primary: Indraprastha Institute of Information Technology Delhi

All Institutions: Indraprastha Institute of Information Technology Delhi, George Mason University

ML Relevance Analysis (83)

The main contribution of this paper is the introduction of Robo-SAr, a novel adversarial dataset for robocall surveillance, and the development of RoboKA, a KAN-informed multimodal framework that significantly improves the detection of unwanted calls. This work addresses critical gaps in the field by providing a comprehensive approach to modeling the complex interactions between audio and linguistic cues, thereby advancing the state of the art in robocall detection.

Comprehensive Analysis

Methodology Assessment

The methodology presented in this paper is robust and innovative, leveraging a novel dataset (Robo-SAr) that addresses the limitations of existing datasets in robocall research. The use of Kolmogorov-Arnold Networks (KAN) for multimodal fusion is a significant advancement, as it allows for the modeling of complex nonlinear interactions between audio and text modalities. The cross-modal contrastive learning approach enhances the alignment of representations, which is crucial for effective robocall detection. The authors also provide a clear explanation of their methods and the rationale behind their choices, making the methodology both sound and well-justified.

Experimental Evaluation

The experimental evaluation is comprehensive, benchmarking RoboKA against various unimodal and multimodal baselines under different conditions, including in-domain and out-of-domain setups. The results demonstrate a clear performance advantage for RoboKA, particularly in challenging scenarios, which underscores the effectiveness of the proposed approach. The use of human validation for the dataset adds credibility to the findings, although the paper could benefit from more detailed statistical analysis of the results.

Reproducibility

The paper commits to releasing the dataset and code upon review, which is a positive step towards ensuring reproducibility. However, the lack of explicit URLs for accessing the dataset and code is a drawback. The methodology is described in sufficient detail to allow for replication, but the absence of a demo or project URL limits immediate accessibility for other researchers.

Limitations

The paper acknowledges several limitations, including the focus on English language robocalls, which restricts the applicability of the findings to multilingual contexts. Additionally, the reliance on synthetic data raises questions about the generalizability of the results to real-world scenarios. The authors also note that the dataset may not fully capture the complexities of real-world robocalls, which could impact the robustness of the model in practical applications.

Broader Impact

The implications of this research are significant, particularly in the context of increasing robocall threats. By providing a robust framework for detecting deceptive robocalls, this work has the potential to enhance consumer protection and inform regulatory efforts. The methodology could also be adapted for other domains where multimodal deception detection is relevant, such as phishing or online scams. The main contribution of this paper is the introduction of Robo-SAr, a novel adversarial dataset for robocall surveillance, and the development of RoboKA, a KAN-informed multimodal framework that significantly improves the detection of unwanted calls. This work addresses critical gaps in the field by providing a comprehensive approach to modeling the complex interactions between audio and linguistic cues, thereby advancing the state of the art in robocall detection.

Analysis: Full Paper • Full text: 25,592 characters

From Birdsong to Rumbles: Classifying Elephant Calls with Out-of-Species Embeddings

Christiaan M. Geldenhuys, Thomas R. Niesler · arXiv

We show that pretrained acoustic embeddings classify elephant vocalisations at a level approaching that of end-to-end supervised neural networks, without any fine-tuning of the embedding model. This result is of practical importance because annotated bioacoustic data are scarce a...

We show that pretrained acoustic embeddings classify elephant vocalisations at a level approaching that of end-to-end supervised neural networks, without any fine-tuning of the embedding model. This result is of practical importance because annotated bioacoustic data are scarce and costly to obtain, leaving conventional supervised approaches prone to overfitting and to poor generalisation under domain shift. A broad range of embedding models drawn from general audio, speech, and bioacoustic domains is evaluated, all of which are either out-of-domain (containing no bioacoustic data) or out-of-species (containing no elephant call data). The embedding networks themselves remain fixed; only the lightweight downstream classifiers, which include a linear model and several small neural networks, are trained. Among the models considered, Perch 2.0 achieves the best cross-validated classification performance, attaining AUCs of 0.849 on African bush elephant (Loxodonta africana) calls and 0.936 on Asian elephant (Elephas maximus) calls, with Perch 1.0 close behind. The best-performing system is within 2.2 % of an end-to-end supervised elephant call classification system. A layerwise analysis of pretrained transformer encoders, considered as embedding models, shows that intermediate representations outperform final-layer outputs. The second layer of both wav2vec2.0 and HuBERT encodes sufficient information for effective elephant call classification; truncation at this layer therefore preserves classification performance whilst retaining only approximately 10 % of the parameters of the full network. Such compact embedding networks are well suited to on-device processing where computational resources are limited.

Institutional Affiliations

Primary: University of Stellenbosch

All Institutions: University of Stellenbosch

ML Relevance Analysis (82)

The paper presents a pioneering evaluation of elephant call classification using pretrained acoustic embeddings, achieving significant performance without fine-tuning. This work not only advances the field of bioacoustics but also sets a precedent for leveraging existing models in low-data scenarios, thereby enhancing conservation efforts through automated analysis of wildlife vocalizations.

Comprehensive Analysis

Methodology Assessment

The paper introduces a novel approach to elephant call classification using pretrained acoustic embeddings without fine-tuning, which is significant given the scarcity of annotated bioacoustic data. The methodology is well-structured, employing a variety of embedding models from different domains and evaluating their performance with lightweight classifiers. The choice to analyze intermediate layers of transformer models for their efficacy in classification is particularly innovative, providing insights into the model's internal representations. The segmentation and classification processes are clearly defined, ensuring a robust experimental design.

Experimental Evaluation

The experiments are comprehensive, utilizing two distinct datasets for evaluation, which enhances the validity of the results. The performance metrics, including AUC and MAP, are appropriate for the classification task and allow for a nuanced understanding of model effectiveness. The results demonstrate that the best-performing embedding model, Perch 2.0, achieves competitive performance compared to end-to-end supervised models, highlighting the potential of using out-of-domain embeddings in low-resource settings.

Reproducibility

The paper provides sufficient detail regarding the experimental setup, including data segmentation, model configurations, and hyperparameter tuning, which supports reproducibility. However, the lack of publicly available code or datasets limits the ease with which other researchers can replicate the study.

Limitations

One notable limitation is the reliance on pretrained models that may not be strictly out-of-species, particularly with Perch 2.0, which raises questions about the generalizability of the findings. Additionally, the paper does not address potential biases in the datasets or the implications of using embeddings from models trained on other species.

Broader Impact

The implications of this research extend beyond elephant call classification, as it demonstrates the utility of pretrained embeddings in bioacoustics, potentially influencing conservation strategies and wildlife management. The approach could be adapted for other endangered species, promoting the use of machine learning in ecological research and conservation efforts. The paper presents a pioneering evaluation of elephant call classification using pretrained acoustic embeddings, achieving significant performance without fine-tuning. This work not only advances the field of bioacoustics but also sets a precedent for leveraging existing models in low-data scenarios, thereby enhancing conservation efforts through automated analysis of wildlife vocalizations.

Analysis: Full Paper • Full text: 50,026 characters

Predicting Upcoming Stuttering Events from Three-Second Audio: Stratified Evaluation Reveals Severity-Selective Precursors, and the Model Deploys Fully On-Device

Nazar Kozak · IEEE/ACM Transactions on Audio, Speech, and Language Processing

Audio-based stuttering systems to date have been trained for detection -- what disfluency is present now -- leaving prediction, the capability needed for closed-loop intervention, unstudied at deployable scale. We train a 616K-parameter CNN on SEP-28k (Apple, 20,131 three-second ...

Audio-based stuttering systems to date have been trained for detection -- what disfluency is present now -- leaving prediction, the capability needed for closed-loop intervention, unstudied at deployable scale. We train a 616K-parameter CNN on SEP-28k (Apple, 20,131 three-second clips) to predict whether the next contiguous clip contains any disfluency. (1) Severity-selective precursor signal: on the episode-grouped test set, aggregate preblock AUC is modest (0.581 [0.542, 0.619]), but stratifying by upcoming event type reveals concentration on clinically severe events -- blocks 0.601 [0.554, 0.651] and sound repetitions 0.617 [0.567, 0.667] both exclude chance, while fillers (0.45) and word repetitions (0.49) are at chance. The aggregate objective converges to a severity-selective predictor because severe events carry prosodic precursors; fillers do not. (2) Cross-population transfer: without fine-tuning, the same checkpoint applied to 1,024 pediatric Children-Who-Stutter utterances (FluencyBank Teaching) attains AUC 0.674 detection and 0.655 prediction; DisfluencySpeech and LibriStutter reach 0.58-0.60 AUC. (3) Deployable on-device: lossless export to CoreML (1.19 MB), ONNX (40 KB), TFLite. Neural-Engine latency per 3 s window: 0.25 ms (iPhone 17 Pro Max, A19 Pro) to 0.55 ms (iPhone SE 3rd-gen and M1 Max). A 4 Hz streaming simulation uses 0.54% of the real-time budget. Platt-calibrated outputs (test ECE 0.010, from 0.177 raw). Five negative ablations -- output-level Future-Guided Learning, multi-clip GRU, time-axis concatenation, asymmetric focal loss, direct block-targeted training -- none improved over the vanilla baseline.

Institutional Affiliations

Primary: Kozak Technologies Inc

All Institutions: Kozak Technologies Inc

GitHub

ML Relevance Analysis (82)

The main contribution of this paper is the development of a predictive model for stuttering events using audio data, demonstrating that a relatively simple CNN can effectively identify clinically severe disfluencies based on prosodic precursors. This work not only advances the understanding of stuttering prediction but also paves the way for practical applications in speech therapy and real-time intervention systems.

Comprehensive Analysis

Methodology Assessment

The paper employs a convolutional neural network (CNN) architecture specifically designed for predicting stuttering events based on audio input. The methodology is robust, utilizing a well-defined dataset (SEP-28k) and employing a clear training objective that focuses on predicting upcoming disfluencies. The stratification of results by severity of disfluency types is a significant methodological strength, allowing for a nuanced understanding of the model's predictive capabilities. The inclusion of negative ablation studies further strengthens the methodology by demonstrating a thorough exploration of potential improvements that did not yield better results.

Experimental Evaluation

The experiments are well-structured, with a clear focus on both detection and prediction tasks. The use of multiple datasets, including cross-population transfer evaluations, enhances the credibility of the findings. The reported AUC scores provide a quantitative measure of performance, and the stratified analysis reveals important insights into the model's strengths and weaknesses. The deployment metrics, including on-device latency and model size, are particularly relevant for practical applications, showcasing the model's readiness for real-world use.

Reproducibility

The paper emphasizes reproducibility by providing access to the training code, label-generation scripts, and the trained model weights. The detailed description of the training process, including hyperparameters and data preprocessing steps, further supports reproducibility. The inclusion of a catalog of negative results is a commendable practice that aids future research by preventing redundant efforts.

Limitations

The paper acknowledges several limitations, including the single-clip context that may restrict the model's performance and the potential for variability across different speakers and datasets. The lack of fine-tuning on external datasets raises questions about the generalizability of the model's predictions. Additionally, the reliance on a coarse label for upcoming events could be improved with more precise annotations.

Broader Impact

The research has significant implications for the field of speech therapy and assistive technologies for individuals who stutter. By enabling predictive capabilities in real-time, the model could facilitate closed-loop interventions that provide timely feedback to users. The deployment of such technology on consumer devices could enhance accessibility and usability for a broader audience, potentially improving the quality of life for many individuals. The main contribution of this paper is the development of a predictive model for stuttering events using audio data, demonstrating that a relatively simple CNN can effectively identify clinically severe disfluencies based on prosodic precursors. This work not only advances the understanding of stuttering prediction but also paves the way for practical applications in speech therapy and real-time intervention systems.

Analysis: Full Paper • Full text: 29,687 characters

BUT System Description for CHiME-9 MCoRec Challenge

Dominik Klement, Alexander Polok, Nguyen Hai Phong ... · HSCMA 2026 Workshop at ICASSP 2026

Multi-talker automatic speech recognition (ASR) in conversational recordings remains an open problem, particularly in scenarios with large portion of overlapping speech where identifying and transcribing a target speaker is difficult from audio alone. Visual cues can help resolve...

Multi-talker automatic speech recognition (ASR) in conversational recordings remains an open problem, particularly in scenarios with large portion of overlapping speech where identifying and transcribing a target speaker is difficult from audio alone. Visual cues can help resolve speaker ambiguity, yet their integration into long-context audio-visual (AV) ASR systems has been limited. The CHiME-9 MCoRec task addresses this challenge by requiring transcription of audio-visual recordings of heavily-overlapped parallel conversations, followed by clustering the participants into conversational groups. In this work, we present the BUT system based on a long-context target-speaker AV-ASR model capable of processing long-form recordings in a single decoding pass. Our architecture conditions a pre-trained NVIDIA Parakeet-v2 ASR model on visual representations from a pre-trained AV-HuBERT model. To cluster participants into conversation groups, we employ Qwen3.5-122B LLM to estimate transcript topic similarity followed by hierarchical agglomerative clustering. On the MCoRec development set, the proposed system achieves 33.7% WER and a clustering F1 score of 0.97, improving over the official baseline by 16.2% WER and 0.15 F1 absolute. On the eval set, our team ranked second, being 0.16% WER and 0.5% F1 worse than the best system.

Institutional Affiliations

Primary: Brno University of Technology

All Institutions: Brno University of Technology

GitHub

ML Relevance Analysis (78)

This paper presents a novel approach to multi-talker ASR by integrating audio-visual cues and leveraging LLMs for clustering, achieving significant improvements over existing methods. The methodology is well-structured, and the results indicate a meaningful contribution to the field, although attention to limitations and reproducibility could enhance its impact further.

Comprehensive Analysis

Methodology Assessment

The proposed methodology integrates audio-visual cues into a long-context ASR system, leveraging pre-trained models (NVIDIA Parakeet-v2 and AV-HuBERT) effectively. The use of a gated mechanism for fusing audio and visual features is a notable innovation, allowing the model to dynamically adjust its reliance on each modality. The clustering approach, which employs a large language model (LLM) for semantic topic similarity, represents a significant departure from traditional heuristic methods. This combination of techniques is well-justified and demonstrates a thoughtful approach to addressing the challenges of multi-talker ASR.

Experimental Evaluation

The experimental setup is robust, with clear metrics for both transcription (WER) and clustering (F1 score). The authors provide a thorough analysis of their results, showing substantial improvements over the baseline. However, the reliance on synthetic data for training raises questions about the generalizability of the results to real-world scenarios. The evaluation on both development and eval sets, along with comparisons to baseline systems, adds credibility to their findings.

Reproducibility

The paper includes sufficient implementation details, including the training regimen, data preprocessing, and the use of specific frameworks (NeMo and DSPy). The availability of the code on GitHub enhances reproducibility, although the authors could provide more detailed instructions for replicating the experiments.

Limitations

One limitation is the potential domain mismatch between the synthetic training data and the real-world MCoRec dataset, which could affect the model's performance in practical applications. Additionally, while the clustering approach shows promise, its reliance on LLMs may introduce variability based on the model's performance and the quality of the transcripts.

Broader Impact

The advancements in multi-talker ASR have significant implications for applications in various fields, including telecommunications, accessibility for the hearing impaired, and human-computer interaction. The integration of visual cues into ASR systems could lead to more robust and accurate transcription services, enhancing communication in noisy environments. This paper presents a novel approach to multi-talker ASR by integrating audio-visual cues and leveraging LLMs for clustering, achieving significant improvements over existing methods. The methodology is well-structured, and the results indicate a meaningful contribution to the field, although attention to limitations and reproducibility could enhance its impact further.

Analysis: Full Paper • Full text: 18,467 characters

A Toolkit for Detecting Spurious Correlations in Speech Datasets

Lara Gauder, Pablo Riera, Andrea Slachevsky ... · arXiv

We introduce a toolkit for uncovering spurious correlations between recording characteristics and target class in speech datasets. Spurious correlations may arise due to heterogeneous recording conditions, a common scenario for health-related datasets. When present both in the tr...

We introduce a toolkit for uncovering spurious correlations between recording characteristics and target class in speech datasets. Spurious correlations may arise due to heterogeneous recording conditions, a common scenario for health-related datasets. When present both in the training and test data, these correlations result in an overestimation of the system performance -- a dangerous situation, specially in high-stakes application where systems are required to satisfy minimum performance requirements. Our toolkit implements a diagnostic method based on the detection of the target class using only the non-speech regions in the audio. Better than chance performance at this task indicates that information about the target class can be extracted from the non-speech regions, flagging the presence of spurious correlations. The toolkit is publicly available for research use.

Institutional Affiliations

Primary: Instituto de Investigación en Ciencias de la Computación

All Institutions: Instituto de Investigación en Ciencias de la Computación, Departamento de Computación, Facultad de Ciencias Exactas y Naturales, Facultad de Medicina, Centro de Neurociencias Cognitivas, Universidad de Chile, Universidad de San Andrés

GitHub

ML Relevance Analysis (83)

The paper introduces a novel toolkit for detecting spurious correlations in speech datasets, addressing a critical issue in machine learning applications. The technical contributions and methodology are well-articulated, providing valuable insights into the reliability of speech-based models, particularly in high-stakes scenarios.

Comprehensive Analysis

Methodology Assessment

The methodology presented in the paper is robust and well-structured, focusing on the detection of spurious correlations in speech datasets. The authors introduce a systematic approach that leverages non-speech regions of audio to diagnose potential biases in datasets, which is a significant advancement in ensuring the reliability of machine learning models in high-stakes applications. The toolkit's design, which includes careful selection of voice-activity detection systems and feature extraction methods, demonstrates a thorough understanding of the challenges posed by spurious correlations.

Experimental Evaluation

The experiments conducted on two Alzheimer's disease speech datasets are comprehensive and well-executed. The authors provide a detailed analysis of the performance of their method against various configurations, including different feature extraction techniques and VAD systems. The use of statistical significance testing adds rigor to their findings, although the reliance on specific datasets may limit generalizability.

Reproducibility

The paper offers a clear description of the experimental setup and the toolkit's implementation, which is publicly available on GitHub. This enhances reproducibility, as other researchers can apply the same methods to their datasets. However, the paper could benefit from more detailed instructions on how to utilize the toolkit effectively.

Limitations

One limitation of the study is the potential overfitting to the specific datasets used for evaluation, which may not represent the broader spectrum of speech datasets. Additionally, while the toolkit addresses spurious correlations, it does not provide solutions for all possible biases that may arise in speech data collection.

Broader Impact

The implications of this research are significant, particularly in the context of health-related machine learning applications where spurious correlations can lead to harmful consequences. The toolkit can serve as a critical resource for researchers and practitioners in the field, promoting more reliable and ethical use of speech datasets in machine learning. The paper introduces a novel toolkit for detecting spurious correlations in speech datasets, addressing a critical issue in machine learning applications. The technical contributions and methodology are well-articulated, providing valuable insights into the reliability of speech-based models, particularly in high-stakes scenarios.

Analysis: Full Paper • Full text: 25,711 characters

Dual-LoRA: Parameter-Efficient Adversarial Disentanglement for Cross-Lingual Speaker Verification

Qituan Shangguan, Junhao Du, Kunyang Peng ... · arXiv

Cross-lingual speaker verification suffers from severe language-speaker entanglement. This causes systematic degradation in the hardest scenario: correctly accepting utterances from the same speaker across different languages while rejecting those from different speakers sharing ...

Cross-lingual speaker verification suffers from severe language-speaker entanglement. This causes systematic degradation in the hardest scenario: correctly accepting utterances from the same speaker across different languages while rejecting those from different speakers sharing the same language. Standard adversarial disentanglement degrades speaker discriminability; blind discriminators inadvertently penalize speaker-discriminative traits that merely correlate with language. To address this, we propose Dual-LoRA, injecting trainable task-factorized LoRA adapters into a frozen pre-trained backbone. Our core innovation is a Language-Anchored Adversary: by grounding the discriminator with an explicit language branch, adversarial gradients target true linguistic cues rather than arbitrary correlations, preserving essential speaker characteristics. Evaluated on the TidyVoice benchmark, our system achieves a 0.91% validation EER and achieves 3rd place in the official challenge.

Institutional Affiliations

Primary: Nanjing University

All Institutions: Nanjing University, AISpeech Co, Jiangsu Key Lab of Language Computing, MoE Key Lab of Artificial Intelligence, Shanghai Jiao Tong University, Soul AI Lab

ML Relevance Analysis (83)

The paper presents Dual-LoRA, an innovative framework for cross-lingual speaker verification that effectively disentangles language and speaker identity, achieving notable performance improvements on benchmark evaluations. The comprehensive methodology and rigorous experimental validation contribute significantly to the field, addressing a critical challenge in speaker verification systems.

Comprehensive Analysis

Methodology Assessment

The methodology presented in the paper is innovative, particularly in its use of Dual-LoRA, which introduces a parameter-efficient approach to disentangle language and speaker identity in cross-lingual speaker verification. The architecture's design, which incorporates two parallel LoRA streams and a Language-Anchored Adversary, is well-justified and addresses key challenges in the field. The decision to keep the backbone frozen while adapting only the LoRA modules is a strategic choice that enhances the model's efficiency and effectiveness.

Experimental Evaluation

The experiments conducted on the TidyVoice benchmark are robust, with a clear focus on evaluating the proposed framework against established baselines. The use of multiple backbones and the systematic analysis of different configurations provide strong evidence for the effectiveness of the Dual-LoRA approach. The reported results, including the significant reduction in EER, particularly in challenging scenarios, underscore the practical impact of the proposed method.

Reproducibility

The paper provides sufficient implementation details, including the architecture, training procedures, and hyperparameters, which facilitate reproducibility. However, the absence of a publicly accessible code repository or demo limits the ability for others to replicate the results independently.

Limitations

One notable limitation is the reliance on a single benchmark dataset (TidyVoice) for evaluation, which may not fully capture the generalizability of the proposed method across diverse real-world scenarios. Additionally, while the paper addresses the issue of language-speaker entanglement, it does not explore potential biases that may arise from the training data or the implications of using specific languages.

Broader Impact

The proposed Dual-LoRA framework has the potential to significantly enhance cross-lingual speaker verification systems, making them more effective for applications in voice authentication and personalization across different languages. This advancement could lead to broader adoption of voice-based technologies in multilingual contexts, improving accessibility and user experience. The paper presents Dual-LoRA, an innovative framework for cross-lingual speaker verification that effectively disentangles language and speaker identity, achieving notable performance improvements on benchmark evaluations. The comprehensive methodology and rigorous experimental validation contribute significantly to the field, addressing a critical challenge in speaker verification systems.

Analysis: Full Paper • Full text: 14,491 characters

SPG-Codec: Exploring the Role and Boundaries of Semantic Priors in Ultra-Low-Bitrate Neural Speech Coding

Mingyu Zhao, Zijian Lin, Kun Wei ... · ICME 2026

Conventional neural speech codecs suffer from severe intelligibility degradation at ultra-low bitrates, where the bottleneck transitions from acoustic distortion to semantic loss. To address this issue, this paper conducts a systematic investigation into the role and fundamental ...

Conventional neural speech codecs suffer from severe intelligibility degradation at ultra-low bitrates, where the bottleneck transitions from acoustic distortion to semantic loss. To address this issue, this paper conducts a systematic investigation into the role and fundamental limits of integrating frozen semantic priors -- specifically HuBERT and Whisper -- into neural speech coding. We introduce and quantitatively validate a novel Semantic Retirement phenomenon: while semantic constraints reduce the Word Error Rate (WER) by up to ~10% relatively at 1.5 kbps, their benefits rapidly diminish beyond 6 kbps, indicating a practical capacity boundary. We further uncover a clear trade-off between different prior types: acoustic-rich priors (HuBERT) better preserve prosodic and timbral details, whereas high-level linguistic priors (Whisper) effectively suppress phonetic hallucinations in noisy environments (reducing hallucination rates by 26 percent) and substantially narrow the generalization gap for unseen speakers. Building on these findings, we propose a bitrate-aware regulation strategy that dynamically adjusts prior strength to optimize the trade-off between semantic consistency and perceptual naturalness. Extensive experimental evaluations confirm that our approach achieves competitive intelligibility and noise robustness compared to existing baselines, offering a principled pathway toward ultra-low-bitrate generative speech coding.

Institutional Affiliations

Primary: Tsinghua Shenzhen International Graduate School, Tsinghua University

All Institutions: Tsinghua Shenzhen International Graduate School, Tsinghua University, Tencent

ML Relevance Analysis (83)

This paper presents a comprehensive analysis of the role of semantic priors in neural speech coding, introducing a novel framework that enhances intelligibility and robustness at ultra-low bitrates. The innovative methodology and thorough experimental evaluation contribute significantly to the field of audio processing, addressing a critical challenge in speech codec design.

Comprehensive Analysis

Methodology Assessment

The methodology presented in this paper is robust and well-structured. The authors propose a novel framework that integrates frozen semantic priors (HuBERT and Whisper) into a neural speech codec, addressing the challenges of intelligibility degradation at ultra-low bitrates. The introduction of the "Semantic Retirement" phenomenon is a significant contribution, as it quantitatively defines the limits of semantic guidance in speech coding. The bitrate-aware regulation strategy is particularly innovative, allowing the model to dynamically adjust the strength of semantic constraints based on the bitrate, which is a practical approach to optimize performance across varying conditions.

Experimental Evaluation

The experimental evaluation is extensive and well-executed, utilizing the LibriSpeech dataset to validate the proposed framework. The authors provide a thorough analysis of the performance metrics, including Word Error Rate (WER), Perceptual Evaluation of Speech Quality (PESQ), and robustness against noise. The results convincingly demonstrate the effectiveness of the proposed method in improving intelligibility and reducing hallucination rates, particularly in low-bitrate scenarios. The ablation studies further strengthen the findings by isolating the effects of different semantic priors and the regulation strategy.

Reproducibility

The paper includes sufficient implementation details, such as the architecture of the neural codec, the configuration of the Residual Vector Quantization, and the training setup. However, the absence of a publicly available code repository or demo URL limits the reproducibility of the results. Providing access to the models and datasets used would enhance the ability of other researchers to replicate and build upon this work.

Limitations

One limitation is the reliance on frozen semantic priors, which may not capture the full range of acoustic nuances needed for optimal performance in all scenarios. Additionally, the paper primarily focuses on two specific priors (HuBERT and Whisper), which may limit the generalizability of the findings to other types of semantic guidance. The authors also acknowledge the potential for over-smoothing at higher bitrates, which could affect the naturalness of the output.

Broader Impact

The findings of this research have significant implications for the development of efficient speech coding systems, particularly in applications where bandwidth is severely limited, such as mobile communications and low-bitrate streaming services. The insights gained from the "Semantic Retirement" phenomenon could inform future research on codec design and the integration of semantic information into other audio processing tasks. The approach could also pave the way for advancements in speech synthesis and recognition systems that require high intelligibility in challenging acoustic environments. This paper presents a comprehensive analysis of the role of semantic priors in neural speech coding, introducing a novel framework that enhances intelligibility and robustness at ultra-low bitrates. The innovative methodology and thorough experimental evaluation contribute significantly to the field of audio processing, addressing a critical challenge in speech codec design.

Analysis: Full Paper • Full text: 19,761 characters

Diffusion Reconstruction towards Generalizable Audio Deepfake Detection

Bo Cheng, Songjun Cao, Xiaoming Zhang ... · arXiv

Achieving robust generalization against unseen attacks remains a challenge in Audio Deepfake Detection (ADD), driven by the rapid evolution of generative models. To address this, we propose a framework centered on hard sample classification. The core idea is that a model capable ...

Achieving robust generalization against unseen attacks remains a challenge in Audio Deepfake Detection (ADD), driven by the rapid evolution of generative models. To address this, we propose a framework centered on hard sample classification. The core idea is that a model capable of distinguishing challenging hard samples is inherently equipped to handle simpler cases effectively. We investigate multiple reconstruction paradigms, identifying the diffusion-based method as optimal for generating hard samples. Furthermore, we leverage multi-layer feature aggregation and introduce a Regularization-Assisted Contrastive Learning (RACL) objective to enhance generalizability. Experiments demonstrate the superior generalization of our approach, with our best model achieving a significant reduction in the average Equal Error Rate (EER) compared to the baseline.

Institutional Affiliations

Primary: Southern University of Science and Technology

All Institutions: Southern University of Science and Technology, Tencent Youtu Lab

ML Relevance Analysis (82)

The main contribution of this paper is the development of a robust framework for Audio Deepfake Detection that leverages hard sample classification and diffusion-based reconstruction to enhance generalization against unseen attacks. This work represents a meaningful advancement in the field of audio deepfake detection, addressing critical challenges posed by evolving generative models.

Comprehensive Analysis

Methodology Assessment

The paper proposes a novel framework for Audio Deepfake Detection (ADD) that emphasizes hard sample classification and utilizes diffusion-based reconstruction methods. The integration of multi-layer feature aggregation and the introduction of Regularization-Assisted Contrastive Learning (RACL) are significant contributions that enhance the model's generalization capabilities. The methodology is well-structured, with clear explanations of the reconstruction paradigms and loss functions employed. However, while the approach is innovative, it builds on existing concepts in contrastive learning and reconstruction, which slightly limits its novelty.

Experimental Evaluation

The experiments are comprehensive, evaluating the proposed methods across multiple datasets, including ASVspoof and CodecFake. The results demonstrate a significant reduction in the average Equal Error Rate (EER) compared to baseline models, showcasing the effectiveness of the proposed framework. The ablation studies provide insights into the contributions of different components of the methodology, reinforcing the validity of the findings. However, the paper could benefit from a more detailed analysis of potential edge cases or scenarios where the model may underperform.

Reproducibility

The implementation details are sufficiently detailed, including data preprocessing, model architecture, and training parameters, which enhances reproducibility. However, the absence of a publicly available code repository or demo limits the ability for other researchers to replicate the results directly.

Limitations

One limitation is the reliance on specific reconstruction methods, which may not generalize well across all types of audio deepfakes. Additionally, the performance on certain datasets showed minor degradation, suggesting that the model may prioritize generalization over specific artifacts. The paper could also discuss potential biases in the datasets used for training and evaluation.

Broader Impact

The implications of this research are significant, particularly in the context of security and misinformation, as robust audio deepfake detection systems are crucial for maintaining trust in audio communications. The proposed framework could be applied in various domains, including cybersecurity, media verification, and social media platforms, where audio authenticity is paramount. The main contribution of this paper is the development of a robust framework for Audio Deepfake Detection that leverages hard sample classification and diffusion-based reconstruction to enhance generalization against unseen attacks. This work represents a meaningful advancement in the field of audio deepfake detection, addressing critical challenges posed by evolving generative models.

Analysis: Full Paper • Full text: 17,398 characters

The False Resonance: A Critical Examination of Emotion Embedding Similarity for Speech Generation Evaluation

Yun-Shao Tsai, Yi-Cheng Lin, Huang-Cheng Chou ... · arXiv

Objective metrics for emotional expressiveness are vital for speech generation, particularly in expressive synthesis and voice conversion requiring emotional prosody transfer. To quantify this, the field widely relies on emotion similarity between reference and generated samples....

Objective metrics for emotional expressiveness are vital for speech generation, particularly in expressive synthesis and voice conversion requiring emotional prosody transfer. To quantify this, the field widely relies on emotion similarity between reference and generated samples. This approach computes cosine similarity of embeddings from encoders like emotion2vec, assuming they capture affective cues despite linguistic and speaker variations. We challenge this assumption through controlled adversarial tasks and human alignment tests. Despite high classification accuracy, these latent spaces are unsuitable for zero-shot similarity evaluation. Representational limitations cause linguistic and speaker interference to overshadow emotional features, degrading discriminative ability. Consequently, the metric misaligns with human perception. This acoustic vulnerability reveals it rewards acoustic mimicry over genuine emotional synthesis.

Institutional Affiliations

Primary: National Taiwan University

All Institutions: National Taiwan University, University of Southern California

ML Relevance Analysis (78)

The paper critically examines the limitations of the emotion similarity metric EMO-SIM in evaluating emotional expressiveness in speech generation, revealing its misalignment with human perception and robustness issues. This comprehensive analysis challenges existing methodologies and underscores the need for improved evaluation frameworks in the field.

Comprehensive Analysis

Methodology Assessment

The paper employs a systematic approach to evaluate the limitations of the widely adopted EMO-SIM metric for emotional expressiveness in speech generation. It rigorously tests the metric against three criteria: categorical emotion robustness, dimensional emotion sensitivity, and human perception alignment. The methodology includes adversarial sampling, calibration of latent spaces, and a comprehensive evaluation against human judgments, which is a significant strength. However, the lack of a clear new metric or framework to replace EMO-SIM is a notable gap.

Experimental Evaluation

The experiments are well-designed, utilizing diverse datasets and multiple evaluation scenarios to assess the performance of EMO-SIM. The results consistently demonstrate the metric's inadequacy in capturing genuine emotional expressiveness, particularly under various acoustic and linguistic distractors. The statistical analyses, including Spearman's correlation and triplet accuracy, provide robust evidence of the findings. However, the paper could benefit from additional comparisons with existing metrics to contextualize its claims further.

Reproducibility

The paper provides sufficient detail on the experimental setup, including dataset preparation and evaluation criteria, which aids reproducibility. However, the absence of publicly available code or datasets limits the ability for other researchers to replicate the findings fully.

Limitations

The primary limitation is the lack of a proposed alternative metric to EMO-SIM, which leaves a gap in practical applicability. Additionally, the focus on a single metric may overlook other potential evaluation frameworks that could be more effective. The experiments also rely heavily on subjective human evaluations, which may introduce variability.

Broader Impact

This work has significant implications for the development of more reliable metrics in speech synthesis and emotional voice conversion, which are critical for applications in human-computer interaction, entertainment, and accessibility technologies. By highlighting the deficiencies of current evaluation methods, it encourages the community to pursue more accurate and meaningful metrics for emotional expressiveness in generated speech. The paper critically examines the limitations of the emotion similarity metric EMO-SIM in evaluating emotional expressiveness in speech generation, revealing its misalignment with human perception and robustness issues. This comprehensive analysis challenges existing methodologies and underscores the need for improved evaluation frameworks in the field.

Analysis: Full Paper • Full text: 22,447 characters

Beyond Isolated Utterances: Cue-Guided Interaction for Context-Dependent Conversational Multimodal Understanding

Zhaoyan Pan, Hengyang Zhou, Xiangdong Li ... · arXiv

Conversational multimodal understanding aims to infer the meaning or label of the current utterance from its preceding dialogue context together with textual, acoustic, and visual signals. Existing methods mainly strengthen contextual modeling through enhanced encoding, fusion, o...

Conversational multimodal understanding aims to infer the meaning or label of the current utterance from its preceding dialogue context together with textual, acoustic, and visual signals. Existing methods mainly strengthen contextual modeling through enhanced encoding, fusion, or propagation, but rarely abstract the context-utterance dependency into an explicit cue and incorporate it into later multimodal reasoning. To address this issue, we propose CUCI-Net for conversational multimodal understanding. CUCI-Net fully preserves the structural distinction between context and utterance during encoding, effectively abstracts their dependency into an interpretation cue by combining local modality evidence with global contextual evidence, and seamlessly integrates the resulting cue into the final multimodal interaction stage for context-conditioned prediction. Extensive experiments on mainstream benchmark datasets fully demonstrate the effectiveness of the proposed method.

Institutional Affiliations

Primary: Zhejiang University

All Institutions: Nanjing University, Zhejiang University

ML Relevance Analysis (83)

The main contribution of this paper is the introduction of CUCI-Net, a novel framework for conversational multimodal understanding that effectively preserves the context-utterance structure and utilizes an interpretation cue to guide multimodal reasoning, leading to improved performance in sarcasm detection tasks. This work significantly advances the state of the art in multimodal dialogue understanding by addressing key limitations in existing methodologies.

Comprehensive Analysis

Methodology Assessment

The proposed CUCI-Net introduces a three-stage framework that emphasizes the preservation of context-utterance structure, the abstraction of context-utterance dependencies into an interpretation cue, and the integration of this cue into multimodal reasoning. This methodology is innovative as it directly addresses the limitations of existing models that often overlook the explicit context-utterance relationship in multimodal dialogue understanding. The use of dual-expert encoders and the structured approach to cue-guided interaction represent a significant advancement in the field.

Experimental Evaluation

The experiments conducted on the MUStARD and MUStARD++ datasets demonstrate the effectiveness of CUCI-Net, achieving superior performance compared to various strong baselines. The results are rigorously reported, with metrics such as Precision, Recall, and F1-score, and the ablation studies provide clear insights into the contributions of each component of the model. This thorough evaluation strengthens the claims made regarding the model's effectiveness.

Reproducibility

The paper provides detailed implementation details, including architecture specifications, optimization settings, and feature extraction methods. However, the absence of a public code repository or demo URL limits the reproducibility of the results, as others cannot easily replicate the experiments or validate the findings independently.

Limitations

One notable limitation is the reliance on specific datasets (MUStARD and MUStARD++) that may not fully represent the diversity of conversational contexts in real-world applications. Additionally, while the model excels in sarcasm detection, its performance on other forms of non-literal expressions or more complex conversational dynamics remains to be thoroughly evaluated.

Broader Impact

The advancements presented in CUCI-Net have potential applications in various domains, including conversational AI, sentiment analysis, and multimodal interaction systems. By improving context-dependent understanding in dialogue systems, this research can enhance user experiences in virtual assistants, customer service bots, and social robots, contributing to more natural and effective human-computer interactions. The main contribution of this paper is the introduction of CUCI-Net, a novel framework for conversational multimodal understanding that effectively preserves the context-utterance structure and utilizes an interpretation cue to guide multimodal reasoning, leading to improved performance in sarcasm detection tasks. This work significantly advances the state of the art in multimodal dialogue understanding by addressing key limitations in existing methodologies.

Analysis: Full Paper • Full text: 50,026 characters

Mitigating Shared-Private Branch Imbalance via Dual-Branch Rebalancing for Multimodal Sentiment Analysis

Chunlei Meng, Jiabin Luo, Pengbin Feng ... · arXiv

Multimodal Sentiment Analysis (MSA) requires integrating language, acoustic, and visual signals without sacrificing modality-specific sentiment evidence. Existing methods mainly improve either shared-private decomposition or cross-modal interaction. Although effective, both ultim...

Multimodal Sentiment Analysis (MSA) requires integrating language, acoustic, and visual signals without sacrificing modality-specific sentiment evidence. Existing methods mainly improve either shared-private decomposition or cross-modal interaction. Although effective, both ultimately depend on how shared and modality-specific evidence is organized before prediction. We observe that, under standard shared-private pipelines, modality heterogeneity often induces a branch-imbalance process: dominant shared patterns accumulate in the shared branch, yielding redundant and modality-biased evidence, while repeated interaction and rigid alignment gradually leak shared information into modality-specific channels and weaken discriminative private representations. As a result, the complementarity between shared and private representations is reduced, limiting robust sentiment reasoning. To address this issue, we propose the Dual-Branch Rebalancing Framework (DBR) on top of a standard multimodal decoupling stage. In the shared branch, a Temporal-Structural Factorization (TSF) module disentangles temporal evolution from structural dependencies and adaptively integrates them to reduce shared redundancy. In the private branch, an Anchor-Guided Private Routing (AGPR) module preserves discriminative modality-specific patterns while allowing controlled cross-modal borrowing. A Bidirectional Rebalancing Fusion (BRF) module then reunifies the two regularized branches in a context-aware manner for final prediction. Extensive experiments on CMU-MOSI, CMU-MOSEI, and MIntRec demonstrate that DBR consistently outperforms the compared baselines. Further analyses show that these improvements come from coordinated mitigation of branch imbalance.

Institutional Affiliations

Primary: Fudan University

All Institutions: China University of Petroleum-Beijing at Karamay, Fudan University, Peking University, University of Southern California, University of Macau

ML Relevance Analysis (83)

The paper presents a comprehensive framework for addressing shared-private branch imbalance in multimodal sentiment analysis, contributing valuable insights and methodologies to the field. The innovative approach and rigorous experimental validation position this work as a significant advancement in multimodal representation learning.

Comprehensive Analysis

Methodology Assessment

The proposed Dual-Branch Rebalancing Framework (DBR) introduces a novel approach to mitigating shared-private branch imbalance in multimodal sentiment analysis. The methodology is well-structured, comprising three main components: Temporal-Structural Factorization (TSF) to disentangle shared representations, Anchor-Guided Private Routing (AGPR) to maintain modality-specific features, and Bidirectional Rebalancing Fusion (BRF) for effective integration. This coordinated design addresses the inherent challenges of modality heterogeneity and redundancy, showcasing a clear understanding of the complexities involved in multimodal representation learning.

Experimental Evaluation

The experimental evaluation is robust, utilizing multiple widely recognized benchmarks (CMU-MOSI, CMU-MOSEI, and MIntRec) to validate the effectiveness of DBR. The results demonstrate significant improvements over state-of-the-art baselines across various metrics, indicating the proposed framework's strong performance. The ablation studies further substantiate the contributions of each module, providing insights into their individual impacts on overall performance.

Reproducibility

The paper provides sufficient implementation details, including the use of PyTorch, training configurations, and evaluation metrics, which facilitate reproducibility. However, the absence of a publicly available code repository or demo limits the practical reproducibility of the results.

Limitations

While the framework shows promising results, the paper does not address potential limitations such as the scalability of the model to larger datasets or the computational efficiency of the proposed modules. Additionally, the reliance on specific benchmarks may not fully capture the generalizability of the approach across diverse multimodal tasks.

Broader Impact

The findings of this research have significant implications for the field of multimodal sentiment analysis, particularly in applications involving human-centered AI systems. By improving the integration of diverse modalities, the proposed framework can enhance the robustness of sentiment prediction in real-world scenarios, potentially benefiting areas like social media analysis, customer feedback interpretation, and emotional AI. The paper presents a comprehensive framework for addressing shared-private branch imbalance in multimodal sentiment analysis, contributing valuable insights and methodologies to the field. The innovative approach and rigorous experimental validation position this work as a significant advancement in multimodal representation learning.

Analysis: Full Paper • Full text: 33,561 characters

One Voice, Many Tongues: Cross-Lingual Voice Cloning for Scientific Speech

Amanuel Gizachew Abebe, Yasmin Moslem · International Conference on Spoken Language Translation (IWSLT 2026)

Preserving a speaker's voice identity while generating speech in a different language remains a fundamental challenge in spoken language technology, particularly in specialized domains such as scientific communication. In this paper, we address this challenge through our system s...

Preserving a speaker's voice identity while generating speech in a different language remains a fundamental challenge in spoken language technology, particularly in specialized domains such as scientific communication. In this paper, we address this challenge through our system submission to the International Conference on Spoken Language Translation (IWSLT 2026), the Cross-Lingual Voice Cloning shared task. First, we evaluate several state-of-the-art voice cloning models for cross-lingual speech generation of scientific texts in Arabic, Chinese, and French. Then, we build voice cloning systems based on the OmniVoice foundation model. We employ data augmentation via multi-model ensemble distillation from the ACL 60/60 corpus. We investigate the effect of using this synthetic data for fine-tuning, demonstrating consistent improvements in intelligibility (WER and CER) across languages while preserving speaker similarity.

Institutional Affiliations

Primary: Shaggar Institute of Technology

All Institutions: Shaggar Institute of Technology, Trinity College Dublin

GitHub

ML Relevance Analysis (83)

The paper presents a novel approach to cross-lingual voice cloning, demonstrating significant advancements in intelligibility and speaker similarity while addressing the challenges of data scarcity in specialized domains. The methodology and results contribute meaningfully to the field of spoken language technology, particularly in the context of scientific communication.

Comprehensive Analysis

Methodology Assessment

The paper employs a robust methodology by leveraging ensemble distillation from multiple state-of-the-art voice cloning models to generate high-fidelity synthetic datasets for fine-tuning. The use of Parameter-Efficient Fine-Tuning (LoRA) is particularly noteworthy, allowing the authors to adapt a large foundation model to specific languages while preserving speaker identity. The approach is well-structured, with clear delineation of the data processing, training configuration, and inference pipeline.

Experimental Evaluation

The experiments are comprehensive, utilizing a well-defined dataset (ACL 60/60) and a variety of evaluation metrics (WER, CER, and speaker similarity). The results demonstrate consistent improvements in intelligibility and speaker similarity across the three target languages, validating the effectiveness of the proposed methods. The comparative analysis with existing models further strengthens the findings.

Reproducibility

The authors provide a public code repository that includes details on data preparation, training, and evaluation, enhancing the reproducibility of their work. However, the limited scale of the dataset may pose challenges for others attempting to replicate the results at a larger scale.

Limitations

The study is constrained by the size of the distilled training dataset (1,404 samples), which may limit the generalizability of the findings. Additionally, the reliance on automated metrics for evaluation may not fully capture the perceptual quality of synthesized speech, and the paper acknowledges the risks associated with voice cloning technology.

Broader Impact

This research has significant implications for enhancing accessibility in scientific communication across different languages, potentially democratizing knowledge dissemination. However, the ethical considerations surrounding voice cloning technologies, such as the potential for misuse, underscore the need for responsible deployment and robust safeguards. The paper presents a novel approach to cross-lingual voice cloning, demonstrating significant advancements in intelligibility and speaker similarity while addressing the challenges of data scarcity in specialized domains. The methodology and results contribute meaningfully to the field of spoken language technology, particularly in the context of scientific communication.

Analysis: Full Paper • Full text: 14,682 characters

PSP: An Interpretable Per-Dimension Accent Benchmark for Indic Text-to-Speech

Venkata Pushpak Teja Menta · arXiv

Standard text-to-speech (TTS) evaluation measures intelligibility (WER, CER) and overall naturalness (MOS, UTMOS) but does not quantify accent. A synthesiser may score well on all four yet sound non-native on features that are phonemic in the target language. For Indic languages,...

Standard text-to-speech (TTS) evaluation measures intelligibility (WER, CER) and overall naturalness (MOS, UTMOS) but does not quantify accent. A synthesiser may score well on all four yet sound non-native on features that are phonemic in the target language. For Indic languages, these features include retroflex articulation, aspiration, vowel length, and the Tamil retroflex approximant (letter zha). We present PSP, the Phoneme Substitution Profile, an interpretable, per-phonological-dimension accent benchmark for Indic TTS. PSP decomposes accent into six complementary dimensions: retroflex collapse rate (RR), aspiration fidelity (AF), vowel-length fidelity (LF), Tamil-zha fidelity (ZF), Frechet Audio Distance (FAD), and prosodic signature divergence (PSD). The first four are measured via forced alignment plus native-speaker-centroid acoustic probes over Wav2Vec2-XLS-R layer-9 embeddings; the latter two are corpus-level distributional distances. In this v1 we benchmark four commercial and open-source systems (ElevenLabs v3, Cartesia Sonic-3, Sarvam Bulbul, Indic Parler-TTS) on Hindi, Telugu, and Tamil pilot sets, with a fifth system (Praxy Voice) included on all three languages, plus an R5->R6 case study on Telugu. Three findings: (i) retroflex collapse grows monotonically with phonological difficulty Hindi < Telugu < Tamil (~1%, ~40%, ~68%); (ii) PSP ordering diverges from WER ordering -- commercial WER-leaders do not uniformly lead on retroflex or prosodic fidelity; (iii) no single system is Pareto-optimal across all six dimensions. We release native reference centroids (500 clips per language), 1000-clip embeddings for FAD, 500-clip prosodic feature matrices for PSD, 300-utterance golden sets per language, scoring code under MIT, and centroids under CC-BY. Formal MOS-correlation is deferred to v2; v1 reports five internal-consistency signals plus a native-audio sanity check.

Institutional Affiliations

Primary: Praxel Ventures

All Institutions: Praxel Ventures

GitHub

ML Relevance Analysis (83)

The paper presents a novel accent evaluation benchmark for Indic TTS systems, offering a detailed and interpretable framework that enhances the understanding of accent fidelity in synthesized speech. The innovative methodology and significant findings position this work as a valuable contribution to the field of machine learning and speech synthesis.

Comprehensive Analysis

Methodology Assessment

The paper proposes a novel framework, the Phoneme Substitution Profile (PSP), which quantitatively evaluates accent fidelity in Indic languages for TTS systems. The methodology is robust, utilizing a combination of acoustic probes and distributional metrics to capture phonological dimensions of accent. The use of Wav2Vec2 embeddings for forced alignment and the construction of native speaker centroids are particularly innovative, allowing for a detailed analysis of accent features that are often overlooked in traditional TTS evaluations. The six dimensions of accent fidelity (RR, AF, LF, ZF, FAD, PSD) provide a comprehensive approach to understanding TTS performance across different systems and languages.

Experimental Evaluation

The experiments benchmark four commercial and open-source TTS systems across Hindi, Telugu, and Tamil, showcasing the effectiveness of the PSP framework. The findings reveal significant insights into the performance of these systems, particularly the divergence between traditional intelligibility metrics (WER) and the proposed accent fidelity metrics. The detailed analysis of results across different languages highlights the varying challenges posed by phonological complexity, making the evaluation both thorough and insightful.

Reproducibility

The authors have made a commendable effort to ensure reproducibility by releasing the scoring code and native speaker centroids under open-source licenses. However, the reliance on specific aligners and the current limitations in the quality of these tools may affect the reproducibility of results, particularly for Telugu and Tamil. Future versions of the benchmark are expected to address these issues, enhancing the overall reproducibility.

Limitations

The paper acknowledges several limitations, including the dependency on forced alignment accuracy, which varies by language, and the potential noise floor in per-phoneme scores. The authors also note that the current version of the PSP does not include formal MOS calibration, which is essential for validating the proposed metrics against human judgment. Additionally, the limited size of pilot sets may affect the statistical significance of some findings.

Broader Impact

The PSP framework has the potential to significantly impact the development of TTS systems for Indic languages, providing a much-needed tool for developers to optimize accent fidelity. By focusing on specific phonological features, the framework can help improve the naturalness and intelligibility of synthesized speech, making it more accessible to native speakers. This work also opens avenues for further research into accent evaluation in other languages and dialects, contributing to the broader field of speech synthesis. The paper presents a novel accent evaluation benchmark for Indic TTS systems, offering a detailed and interpretable framework that enhances the understanding of accent fidelity in synthesized speech. The innovative methodology and significant findings position this work as a valuable contribution to the field of machine learning and speech synthesis.

Analysis: Full Paper • Full text: 26,739 characters

Step-Audio-R1.5 Technical Report

Yuxin Zhang, Xiangyu Tony Zhang, Daijiao Liu ... · arXiv

Recent advancements in large audio language models have extended Chain-of-Thought (CoT) reasoning into the auditory domain, enabling models to tackle increasingly complex acoustic and spoken tasks. To elicit and sustain these extended reasoning chains, the prevailing paradigm -- ...

Recent advancements in large audio language models have extended Chain-of-Thought (CoT) reasoning into the auditory domain, enabling models to tackle increasingly complex acoustic and spoken tasks. To elicit and sustain these extended reasoning chains, the prevailing paradigm -- driven by the success of text-based reasoning models -- overwhelmingly relies on Reinforcement Learning with Verified Rewards (RLVR). However, as models are strictly optimized to distill rich, continuous auditory contexts into isolated, verifiable text labels, a fundamental question arises: are we fostering true audio intelligence, or merely reducing a continuous sensory medium into a discrete puzzle? We identify this as the "verifiable reward trap." While RLVR yields remarkable scores on standardized objective benchmarks, it systematically degrades the real-world conversational feel of audio models. By prioritizing isolated correctness over acoustic nuance, RLVR reduces dynamic interactions to mechanical "answering machines," severely compromising prosodic naturalness, emotional continuity, and user immersion, particularly in long-turn dialogues. To bridge the gap between mechanical objective verification and genuine sensory empathy, we introduce Step-Audio-R1.5, marking a paradigm shift toward Reinforcement Learning from Human Feedback (RLHF) in audio reasoning. Comprehensive evaluations demonstrate that Step-Audio-R1.5 not only maintains robust analytical reasoning but profoundly transforms the interactive experience, redefining the boundaries of deeply immersive long-turn spoken dialogue.

Institutional Affiliations

Primary: StepFun

All Institutions: StepFun, Nanyang Technological University, University of New South Wales, Shanghai Jiao Tong University

GitHub

ML Relevance Analysis (83)

The main contribution of this paper is the introduction of Step-Audio-R1.5, a novel audio reasoning model that integrates RLHF to enhance the quality of multi-turn dialogues, addressing the limitations of existing models that prioritize isolated correctness over conversational naturalness. This work represents a significant step forward in developing more empathetic and engaging audio interaction systems, setting a new standard for future research in audio language models.

Comprehensive Analysis

Methodology Assessment

The methodology is robust and innovative, introducing a new paradigm in audio language models by integrating Reinforcement Learning from Human Feedback (RLHF) to address the limitations of Reinforcement Learning with Verified Rewards (RLVR). The paper effectively outlines a structured approach that includes a mid-training stage, cold-start supervised fine-tuning, and a novel reward model that captures both explicit and implicit quality metrics. This combination is significant as it aims to enhance the naturalness and emotional engagement of audio interactions, which is a critical aspect often overlooked in traditional models.

Experimental Evaluation

The experimental evaluation is comprehensive, utilizing multiple benchmarks, including the newly proposed AudioMultiChallenge and Step-Caption, which are well-designed to assess various dimensions of audio reasoning and dialogue quality. The results indicate that Step-Audio-R1.5 performs competitively against leading models, demonstrating significant improvements in multi-turn dialogue scenarios. The use of diverse datasets and rigorous evaluation metrics strengthens the findings.

Reproducibility

The paper provides a clear description of the architecture and training process, which aids in reproducibility. However, it lacks detailed implementation specifics such as hyperparameters and training duration, which are essential for fully replicating the experiments. The availability of the project URL is a positive aspect, as it may contain additional resources for implementation.

Limitations

One limitation is the potential over-reliance on human feedback, which may introduce biases based on the evaluators' preferences. Additionally, while the model shows improvements in conversational quality, the paper does not extensively discuss how it handles edge cases or unexpected user inputs, which are common in real-world applications.

Broader Impact

The proposed model has the potential to significantly advance the field of audio language processing by improving user interactions in conversational AI systems. This could lead to more engaging and emotionally aware audio applications in various domains, including virtual assistants, customer service, and entertainment. The main contribution of this paper is the introduction of Step-Audio-R1.5, a novel audio reasoning model that integrates RLHF to enhance the quality of multi-turn dialogues, addressing the limitations of existing models that prioritize isolated correctness over conversational naturalness. This work represents a significant step forward in developing more empathetic and engaging audio interaction systems, setting a new standard for future research in audio language models.

Analysis: Full Paper • Full text: 19,659 characters

SymphonyGen: 3D Hierarchical Orchestral Generation with Controllable Harmony Skeleton

Xuzheng He, Nan Nan, Zhilin Wang ... · arXiv

Generating symphonic music requires simultaneously managing high-level structural form and dense, multi-track orchestration. Existing symbolic models often struggle with a "complexity-control imbalance", in which scaling bottlenecks limit long-term granular steerability. We prese...

Generating symphonic music requires simultaneously managing high-level structural form and dense, multi-track orchestration. Existing symbolic models often struggle with a "complexity-control imbalance", in which scaling bottlenecks limit long-term granular steerability. We present SymphonyGen, a 3D hierarchical framework for contemporary cinematic orchestration. SymphonyGen employs a cascading decoder architecture that decomposes the Bar, Track, and Event axes, improving computational efficiency and scalability over conventional 1D or 2D models. We introduce "short-score" conditioning via a beat-quantized multi-voice harmony skeleton, enabling outline control while preserving textural diversity. The model is further refined using Group Relative Policy Optimization (GRPO) with a cross-modal audio-perceptual reward, aligning symbolic output with modern acoustic expectations. Additionally, we implement a dissonance-averse sampling algorithm to suppress unintended tonal clashes during inference. Objective evaluations show that both reinforcement learning and dissonance-averse sampling effectively enhance harmonic cleanliness while maintaining melodic expression. Subjective evaluations demonstrate that SymphonyGen outperforms baselines in musicality and preference for orchestral music generation. Demo page: https://symphonygen.github.io/

Institutional Affiliations

Primary: Central Conservatory of Music

All Institutions: Frontier Institute of Science and Technology, Central Conservatory of Music, Department of AI Music and Music Information Technology, Shenzhen University, Interdisciplinary Research Center

Demo

ML Relevance Analysis (83)

The main contribution of this paper is the development of SymphonyGen, a novel 3D hierarchical framework for orchestral music generation that effectively addresses the complexities of high-level structural form and dense orchestration. This work represents a substantial advancement in the field of AI music generation, combining innovative methodologies with rigorous evaluation to produce a system that aligns closely with modern acoustic expectations.

Comprehensive Analysis

Methodology Assessment

The paper introduces a 3D hierarchical architecture that effectively manages the complexities of orchestral music generation by decomposing the task into Bar, Track, and Event levels. This cascading decoder architecture enhances computational efficiency and scalability, which is a significant improvement over conventional models. The introduction of a "short-score" conditioning via a beat-quantized multi-voice harmony skeleton is innovative, allowing for greater control over the generated music while maintaining textural diversity. The use of Group Relative Policy Optimization (GRPO) with a cross-modal audio-perceptual reward is a novel approach that aligns the generated symbolic output with acoustic expectations, addressing the limitations of previous models. The dissonance-averse sampling algorithm further refines the output by suppressing unintended tonal clashes, showcasing a thoughtful integration of music theory into the generative process.

Experimental Evaluation

The experimental design is robust, featuring both objective and subjective evaluations. The use of a large dataset (SymphonyNet) for training and validation ensures that the model is well-tested across various orchestral styles. Objective metrics such as harmony precision, recall, and dissonance scores provide quantitative assessments of the model's performance, while subjective evaluations involving listener preferences add qualitative insights. The results indicate that SymphonyGen outperforms baseline models in terms of musicality and preference, particularly among general listeners, which is a strong endorsement of its effectiveness.

Reproducibility

The paper provides detailed implementation information, including architecture specifications, training procedures, and evaluation metrics. However, the absence of a publicly available code repository limits reproducibility. The authors mention that implementation details will be available in their codebase, but without immediate access, it is challenging to fully assess reproducibility.

Limitations

The paper acknowledges some limitations, such as the potential for "strange" harmonies or "noisy" segments in the generated music, which may stem from errors in harmony skeleton generation. Additionally, the subjective evaluations indicate that while the model performs well, it may still produce overly full orchestrations at times, suggesting room for improvement in balancing orchestration richness with clarity.

Broader Impact

SymphonyGen has significant implications for the field of AI-assisted music composition, particularly in cinematic orchestration. By providing a controllable framework for composers, it enhances the collaborative potential between human creativity and AI-generated music. The model's ability to produce high-quality orchestral compositions could benefit various applications, including film scoring, video game music, and other multimedia projects, ultimately enriching the landscape of contemporary music creation. The main contribution of this paper is the development of SymphonyGen, a novel 3D hierarchical framework for orchestral music generation that effectively addresses the complexities of high-level structural form and dense orchestration. This work represents a substantial advancement in the field of AI music generation, combining innovative methodologies with rigorous evaluation to produce a system that aligns closely with modern acoustic expectations.

Analysis: Full Paper • Full text: 26,159 characters

UNet-Based Fusion and Exponential Moving Average Adaptation for Noise-Robust Speaker Recognition

Chong-Xin Gan, Peter Bell, Man-Wai Mak ... · arXiv

The joint training of speech enhancement and speaker embedding networks for speaker recognition is widely adopted under noisy acoustic environments. While effective, this paradigm often fails to leverage the generalization and robustness benefits inherent in large-scale speech en...

The joint training of speech enhancement and speaker embedding networks for speaker recognition is widely adopted under noisy acoustic environments. While effective, this paradigm often fails to leverage the generalization and robustness benefits inherent in large-scale speech enhancement pre-training. Moreover, maintaining the speaker information in the denoised speech is not an explicit objective of the speech enhancement process. To address these limitations, we proposed a scalable \textbf{U}Net-based \textbf{F}usion framework (UF-EMA) that considers the noisy and enhanced speech as a multi-channel input, thereby enabling the speaker encoder to exploit speaker information effectively. In addition, an \textbf{E}xponential \textbf{M}oving \textbf{A}verage strategy is applied to a speaker encoder pre-trained on clean speech to mitigate overfitting and facilitate a smooth transition from clean to noisy conditions. Experimental results on multiple noise-contaminated test sets showcase the superiority of the proposed approach.

Institutional Affiliations

Primary: The Hong Kong Polytechnic University

All Institutions: The Hong Kong Polytechnic University, Centre for Speech Technology Research, The University of Edinburgh, The University of Hong Kong

ML Relevance Analysis (83)

This paper presents a novel approach to speaker recognition in noisy environments by integrating a UNet-based fusion framework with an Exponential Moving Average strategy for speaker encoder adaptation. The technical contributions are well-founded and address critical challenges in the field, showcasing the potential for improved performance in practical applications.

Comprehensive Analysis

Methodology Assessment

The proposed methodology introduces a UNet-based fusion framework (UF-EMA) that effectively integrates noisy and enhanced speech signals to improve speaker recognition performance in noisy environments. The use of multi-channel input allows the speaker encoder to retain speaker-specific information, which is often lost in traditional approaches. The incorporation of an Exponential Moving Average strategy for updating the speaker encoder is a novel approach that addresses the challenges of overfitting and adaptation to varying noise conditions. The methodology is well-structured and provides a clear rationale for the design choices made, supported by a comprehensive theoretical background.

Experimental Evaluation

The experimental evaluation is robust, utilizing multiple noise-contaminated test sets to validate the proposed method's effectiveness. The results demonstrate a significant improvement in performance compared to existing methods, with lower Equal Error Rates (EER) across various conditions. The ablation studies provide insights into the contributions of individual components, reinforcing the effectiveness of the proposed fusion and EMA strategies. However, the paper could benefit from additional qualitative assessments, such as subjective listening tests, to complement the quantitative metrics.

Reproducibility

The paper provides a detailed description of the experimental setup, including the datasets used (VoxCeleb1 and Vox1-O), the training process, and the evaluation metrics (EER). However, there is a lack of publicly available code or datasets, which may hinder reproducibility. Clear instructions for replicating the experiments would enhance the paper's impact.

Limitations

One limitation is the reliance on pre-trained speech enhancement models, which may not be universally applicable across all domains or languages. Additionally, while the proposed method shows improvements in noisy conditions, it may still struggle in extreme noise scenarios or with overlapping speakers. The paper does not address potential computational costs associated with the proposed methods, which could affect real-time applications.

Broader Impact

The proposed framework has significant implications for real-world applications in speaker recognition systems, particularly in environments with background noise, such as call centers, security systems, and personal assistants. By improving the robustness of speaker recognition, this research could enhance user experience and accessibility in various audio processing applications. This paper presents a novel approach to speaker recognition in noisy environments by integrating a UNet-based fusion framework with an Exponential Moving Average strategy for speaker encoder adaptation. The technical contributions are well-founded and address critical challenges in the field, showcasing the potential for improved performance in practical applications.

Analysis: Full Paper • Full text: 21,476 characters

Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models

Chun-Yi Kuan, Wei-Ping Huang, Hung-yi Lee · arXiv

Recent audio-aware large language models (ALLMs) have demonstrated strong capabilities across diverse audio understanding and reasoning tasks, but they still frequently produce hallucinated or overly confident outputs. While uncertainty estimation has been extensively studied in ...

Recent audio-aware large language models (ALLMs) have demonstrated strong capabilities across diverse audio understanding and reasoning tasks, but they still frequently produce hallucinated or overly confident outputs. While uncertainty estimation has been extensively studied in text-only LLMs, it remains largely unexplored for ALLMs, where audio-conditioned generation introduces additional challenges such as perceptual ambiguity and cross-modal grounding. In this work, we present the first systematic empirical study of uncertainty estimation in ALLMs. We benchmark five representative methods, including predictive entropy, length-normalized entropy, semantic entropy, discrete semantic entropy, and P(True), across multiple models and diverse evaluation settings spanning general audio understanding, reasoning, hallucination detection, and unanswerable question answering. Our results reveal two key findings. First, semantic-level and verification-based methods consistently outperform token-level baselines on general audio reasoning benchmarks. Second, on trustworthiness-oriented benchmarks, the relative effectiveness of uncertainty methods becomes notably more model- and benchmark-dependent, indicating that conclusions drawn from general reasoning settings do not straightforwardly transfer to hallucination and unanswerable-question scenarios. We further explore uncertainty-based adaptive inference as a potential downstream application. We hope this study provides a foundation for future research on reliable, uncertainty-aware audio-language systems.

Institutional Affiliations

Primary: National Taiwan University

All Institutions: National Taiwan University, Artificial Intelligence Center of Research Excellence (AI-CoRE)

ML Relevance Analysis (83)

This paper makes a significant contribution by systematically evaluating uncertainty estimation methods in audio-aware large language models, revealing critical insights that could guide future research and applications in multimodal AI systems. The comprehensive benchmarking and analysis of methods provide a valuable foundation for improving the reliability of ALLMs in practical scenarios.

Comprehensive Analysis

Methodology Assessment

The paper presents a systematic empirical study of uncertainty estimation methods tailored for audio-aware large language models (ALLMs). It benchmarks five distinct methods, including predictive entropy and semantic entropy, across various models and tasks, highlighting the unique challenges posed by audio inputs. The methodology is sound, employing a two-stage protocol for uncertainty estimation and a clear comparative analysis across multiple benchmarks. However, the reliance on existing methods from text-based LLMs without significant adaptation for audio-specific challenges could be seen as a limitation.

Experimental Evaluation

The experiments are comprehensive, covering a wide range of benchmarks that assess both general audio understanding and trustworthiness-oriented tasks. The results indicate that semantic-level and verification-based methods consistently outperform token-level baselines, providing valuable insights into the performance of uncertainty estimation in ALLMs. The evaluation metrics, including AUROC and AURAC, are appropriate for the tasks at hand.

Reproducibility

While the paper provides a detailed description of the experimental setup, including the models used and the evaluation protocols, it lacks specific implementation details or code availability, which could hinder reproducibility. The absence of a project URL further complicates this aspect.

Limitations

The study primarily focuses on constrained answer spaces, which may not generalize well to open-ended tasks. Additionally, the uncertainty estimation methods are largely inherited from text LLM literature, potentially limiting their effectiveness in capturing audio-specific uncertainties. The fixed threshold for adaptive inference may not be optimal across all scenarios, and the study does not explore more sophisticated routing strategies.

Broader Impact

The findings have significant implications for the development of more reliable audio-language systems, particularly in applications requiring robust uncertainty estimation for decision-making. The work lays a foundation for future research in uncertainty-aware models, which could enhance the safety and reliability of AI systems in high-stakes environments. This paper makes a significant contribution by systematically evaluating uncertainty estimation methods in audio-aware large language models, revealing critical insights that could guide future research and applications in multimodal AI systems. The comprehensive benchmarking and analysis of methods provide a valuable foundation for improving the reliability of ALLMs in practical scenarios.

Analysis: Full Paper • Full text: 50,026 characters

ML-SAN: Multi-Level Speaker-Adaptive Network for Emotion Recognition in Conversations

Kexue Wang, Yinfeng Yu, Liejun Wang · International Conference on Intelligent Computing 2026

To establish empathy with machines, it is essential to fully understand human emotional changes. However, research in multimodal emotion recognition often overlooks one problem: individual expressive traits vary significantly, which means that different people may express emotion...

To establish empathy with machines, it is essential to fully understand human emotional changes. However, research in multimodal emotion recognition often overlooks one problem: individual expressive traits vary significantly, which means that different people may express emotions differently. In our daily lives, we can see this. When communicating with different people, some express "happiness" through their facial expressions and words, while others may hide their happiness or express it through their actions. Both are expressions of 'happiness,' but such differences in emotional expression are still too difficult for machines to distinguish. Current emotion recognition remains at a 'static' level, using a single recognition model to identify all emotional styles. This "simplification" often affects the recognition results, especially in multi-turn dialogues. To address this problem, this paper introduces a novel Multi-Level Speaker Adaptive Network (ML-SAN), which, specifically, effectively addresses the challenge of speaker identity information confusion. ML-SAN does not simply assign a speaker's ID after recognition; instead, it employs a three-stage adaptive process: First, Input-level Calibration uses Feature-Level Linear Modulation (FiLM) to adjust the raw audio and visual features into a neutral space unrelated to the speaker. Then, Interaction-level Gating re-adjusts the trust level for each modality (e.g., voice or facial features) based on the speaker's identity information. Finally, Output-level Regularization maintains the consistency of speaker features in the latent space. Tests on the MELD and IEMOCAP datasets show that our model (ML-SAN) achieves better results, performs exceptionally well in handling challenging tail sentiment categories, and better addresses the diversity of speakers in real-world scenarios.

Institutional Affiliations

Primary: Xinjiang University

All Institutions: Joint Research Laboratory for Embodied Intelligence, Joint International Research Laboratory of Silk Road Multilingual Cognitive Computing, School of Computer Science and Technology, Xinjiang University

ML Relevance Analysis (82)

The main contribution of this paper is the introduction of the Multi-Level Speaker-Adaptive Network (ML-SAN), which effectively addresses speaker heterogeneity in multimodal emotion recognition through a novel three-stage adaptive process. This work represents a significant advancement in the field of emotion recognition by integrating speaker identity into the modeling process, thereby improving the accuracy and robustness of emotion detection in conversations.

Comprehensive Analysis

Methodology Assessment

The proposed ML-SAN framework introduces a three-stage adaptive process that effectively addresses the challenges of speaker identity confusion in emotion recognition. The use of Feature-wise Linear Modulation (FiLM) for input calibration, dynamic gating for interaction-level adjustments, and output regularization to maintain speaker identity showcases a thoughtful and innovative approach to handling multimodal data. This hierarchical adaptation strategy is a significant advancement over traditional speaker-agnostic methods, as it actively incorporates speaker characteristics into the model's decision-making process.

Experimental Evaluation

The experiments conducted on the MELD and IEMOCAP datasets demonstrate the effectiveness of the ML-SAN model, achieving superior performance compared to the baseline MultiEMO. The rigorous evaluation, including ablation studies to analyze the contribution of each component, adds credibility to the findings. The reported metrics, such as the weighted F1-score, indicate that the model performs well, particularly in challenging scenarios involving diverse emotional expressions.

Reproducibility

The paper provides sufficient details regarding the experimental setup, including the use of specific datasets and the implementation of baseline models under identical conditions. However, the absence of a publicly accessible code repository limits the reproducibility of the results. Future work should consider making the code available to facilitate further research and validation.

Limitations

While the ML-SAN model shows promising results, the paper acknowledges potential challenges in real-world applications, such as background noise and missing modalities. Additionally, the model's reliance on specific datasets may limit its generalizability to other contexts or languages. The authors should address these limitations in future iterations of their work.

Broader Impact

The ability to accurately recognize emotions in conversations has significant implications for the development of empathetic AI systems. This research could enhance human-computer interaction in various applications, including virtual assistants, mental health support, and customer service. By improving emotion recognition, ML-SAN can contribute to more nuanced and effective communication between humans and machines. The main contribution of this paper is the introduction of the Multi-Level Speaker-Adaptive Network (ML-SAN), which effectively addresses speaker heterogeneity in multimodal emotion recognition through a novel three-stage adaptive process. This work represents a significant advancement in the field of emotion recognition by integrating speaker identity into the modeling process, thereby improving the accuracy and robustness of emotion detection in conversations.

Analysis: Full Paper • Full text: 15,133 characters

Praxy Voice: Voice-Prompt Recovery + BUPS for Commercial-Class Indic TTS from a Frozen Non-Indic Base at Zero Commercial-Training-Data Cost

Venkata Pushpak Teja Menta · arXiv

Commercial TTS systems produce near-native Indic audio, but the best open-source bases (Chatterbox, Indic Parler-TTS, IndicF5) trail them on measured phonological dimensions, and the most widely adopted multilingual base (Chatterbox, 23 languages) does not even tokenise Telugu or...

Commercial TTS systems produce near-native Indic audio, but the best open-source bases (Chatterbox, Indic Parler-TTS, IndicF5) trail them on measured phonological dimensions, and the most widely adopted multilingual base (Chatterbox, 23 languages) does not even tokenise Telugu or Tamil. We ask: what is the minimum intervention that brings such a non-Indic-native base to commercial-class output on Telugu, Tamil, and Hindi, without training a new acoustic decoder and without any commercial TTS training data? We combine three pieces: (1) BUPS, a Brahmic Unified Phoneme Space that deterministically romanises seven Indic scripts to ISO-15919 so Chatterbox's Latin tokeniser can process them; (2) a LoRA adapter on only the text-token predictor (Chatterbox's t3), trained on ~1,220h of licensed Indic audio with a Hindi-proxy language_id; (3) a voice-prompt recovery recipe -- an 8-11s same-language reference clip plus three sampling overrides (exaggeration 0.7, temperature 0.6, min_p 0.1; "Config B") -- that recovers commercial-class acoustic output with no acoustic-decoder training. On Hindi, the LoRA regresses accuracy and we instead use vanilla Chatterbox + Config B, giving a two-branch deployment. Evaluated on 10-utterance pilot sets with the companion PSP benchmark, Praxy Voice matches or slightly leads commercial baselines: 26.7% retroflex collapse on Telugu (vs Sarvam Bulbul 33.3%), 71% Tamil-zha collapse (vs commercial trio's 86%), 0.025 LLM-WER on Hindi (tied with Cartesia Sonic-3). For intra-sentential code-mix we add a third branch (IndicF5 + native-script transliteration) that drops code-mix LLM-WER from 0.80-0.85 to 0.14-0.27 across Hi/Te/Ta. We release R6 LoRA weights (Apache-2.0), inference code and router (MIT), and a Gradio demo.

Institutional Affiliations

Primary: Praxel Ventures

All Institutions: Praxel Ventures

GitHub

ML Relevance Analysis (82)

The paper presents a novel approach to adapting a frozen multilingual TTS model for Indic languages, demonstrating competitive performance against commercial systems while requiring minimal training data. The combination of BUPS, LoRA adaptation, and voice-prompt recovery represents a significant advancement in TTS technology, particularly for low-resource languages.

Comprehensive Analysis

Methodology Assessment

The methodology presented in the paper combines three innovative components: the Brahmic Unified Phoneme Space (BUPS) for romanisation of Indic scripts, a low-rank adaptation (LoRA) approach for the text-token predictor, and a voice-prompt recovery recipe that enhances acoustic output without retraining the acoustic decoder. This combination allows for effective adaptation of a frozen multilingual TTS model to support Indic languages, which is a significant advancement in TTS technology for low-resource languages. The approach is well-structured, addressing specific challenges in TTS for Indic languages and demonstrating a clear understanding of the limitations of existing systems.

Experimental Evaluation

The experimental evaluation is robust, utilizing a companion benchmark for assessing phonological accuracy and intelligibility across three Indic languages. The results indicate that the proposed system performs competitively against commercial baselines, particularly in terms of retroflex collapse and other phonological metrics. The use of a 10-utterance pilot set allows for initial validation, although the small sample size may limit statistical significance. The paper effectively communicates the results, providing detailed comparisons with existing systems.

Reproducibility

The authors have made significant efforts to ensure reproducibility by releasing the LoRA weights, inference code, and a demo interface. However, the reliance on specific datasets and the complexity of the methods may pose challenges for complete replication without access to the same resources. The paper includes sufficient detail on the methodology and experimental setup to allow for independent verification of results.

Limitations

The paper acknowledges several limitations, including the small sample size for pilot evaluations, the lack of formal Mean Opinion Score (MOS) testing, and the challenges faced in adapting the acoustic decoder. Additionally, the performance on Hindi with the LoRA adapter regressed accuracy, indicating that the method's effectiveness may vary across languages. The authors also note that the current implementation relies on reference audio clips, which may limit flexibility in practical applications.

Broader Impact

This research has the potential to significantly impact the development of TTS systems for low-resource languages, particularly in India, where many languages are underrepresented in commercial TTS solutions. By providing a method that requires minimal training data and computational resources, the work could democratize access to high-quality TTS technology for Indic languages, fostering greater inclusivity in technology. The open-source release of the model and code further enhances its potential for widespread adoption and further research. The paper presents a novel approach to adapting a frozen multilingual TTS model for Indic languages, demonstrating competitive performance against commercial systems while requiring minimal training data. The combination of BUPS, LoRA adaptation, and voice-prompt recovery represents a significant advancement in TTS technology, particularly for low-resource languages.

Analysis: Full Paper • Full text: 32,863 characters

Huí Sù: Co-constructing a Dual Feedback Apparatus

Yichen Wang, Charles Patrick Martin · International Conference on New Interfaces for Musical Expression (NIME) 2026

This performance presents a duet between two intelligent musical instruments, Sù (to trace back; to go upstream) and Agentier (playing on agentic clavier), and their human performers, connected through feedback loops. Rather than treating AI as a tool that responds predictably to...

This performance presents a duet between two intelligent musical instruments, Sù (to trace back; to go upstream) and Agentier (playing on agentic clavier), and their human performers, connected through feedback loops. Rather than treating AI as a tool that responds predictably to input, both systems operate recursively, where past actions continuously influence future behaviour. The Sù operates in the audio space through latent representation. Its performer uses Make Noise 0-series synthesisers and MIDI controllers to work with a neural feedback synthesis system based on a RAVE model, with a latent feedback loop embedded within the model's internal structure. This allows the instrument to remember and reuse its own internal states, influencing ongoing sound generation through its recent sonic history. The Agentier functions in the control space. Its performer interacts with the system using a Roland S-1 synthesiser and Keith McMillen QuNeo touchpad, where control gestures are routed into a recurrent neural network that feeds back into the synthesis process. Through this feedback loop, the system actively shapes the evolution of control signals over time. Contrasting feedback in the audio and control domains, the performance explores shared agency, resistance, and negotiation between humans and intelligent musical systems. Musical phenomena are co-produced through the entangled states of interaction, rather than through pre-existing system configuration or fixed mappings.

Institutional Affiliations

Primary: The Australian National University

All Institutions: The Australian National University

Demo

ML Relevance Analysis (78)

This paper presents a significant contribution to the field of AI in music by exploring the co-constructive relationship between human performers and intelligent musical instruments through innovative feedback mechanisms. The methodology is well-defined, though the lack of rigorous experimental evaluation and reproducibility details limits its impact.

Comprehensive Analysis

Methodology Assessment

The paper presents a novel approach to musical performance through the integration of AI in two intelligent musical instruments, Sù and Agentier. The methodology is well-articulated, detailing the use of a RAVE model for audio synthesis and a recurrent neural network for control signal generation. The recursive feedback mechanisms employed in both instruments are innovative, allowing for a dynamic interaction between the performer and the instrument, which enhances the creative process. The use of latent representations and direct manipulation of latent dimensions is particularly noteworthy, as it provides performers with greater control over the sonic output.

Experimental Evaluation

While the paper describes the performance setup and the interaction between the instruments, it lacks a comprehensive experimental evaluation with quantitative metrics. The authors mention a video documentation of a performance, which serves as a qualitative demonstration of their approach. However, there is no detailed analysis of the performance outcomes, such as audience reception or systematic comparisons with traditional instruments or other AI-enabled systems. Including metrics like Mean Opinion Score (MOS) or other objective evaluations would strengthen the claims made.

Reproducibility

The paper provides a clear description of the instruments and the technology used, which aids in reproducibility. However, specific implementation details, such as the exact configurations of the neural networks and the training datasets, are not sufficiently detailed. Additionally, the lack of a publicly available code repository limits the ability of other researchers to replicate the work fully.

Limitations

One of the main limitations is the absence of a rigorous experimental evaluation framework to assess the performance of the instruments quantitatively. The reliance on qualitative descriptions and a single performance video may not provide a comprehensive understanding of the instruments' capabilities. Furthermore, the paper does not address potential issues related to latency in real-time performance, which could affect the interaction quality between the performer and the AI systems.

Broader Impact

The integration of AI in musical performance has significant implications for the future of music creation and performance. This work encourages a rethinking of the role of the performer and the instrument, promoting a collaborative relationship that could lead to new forms of musical expression. The exploration of feedback loops and shared agency could inspire further research in both music technology and human-computer interaction, potentially influencing the design of future intelligent musical instruments. This paper presents a significant contribution to the field of AI in music by exploring the co-constructive relationship between human performers and intelligent musical instruments through innovative feedback mechanisms. The methodology is well-defined, though the lack of rigorous experimental evaluation and reproducibility details limits its impact.

Analysis: Full Paper • Full text: 8,746 characters

All That Glitters Is Not Audio: Rethinking Text Priors and Audio Reliance in Audio-Language Evaluation

Leonardo Haw-Yang Foo, Chih-Kai Yang, Chen-An Li ... · arXiv

Large Audio-Language Models show consistent performance gains across speech and audio benchmarks, yet high scores may not reflect true auditory perception. If a model can answer questions without processing the acoustic signal, the benchmark fails as a measure of auditory underst...

Large Audio-Language Models show consistent performance gains across speech and audio benchmarks, yet high scores may not reflect true auditory perception. If a model can answer questions without processing the acoustic signal, the benchmark fails as a measure of auditory understanding. We present a diagnostic framework using two axes: text prior, which measures answerability from text and general knowledge alone, and audio reliance, which assesses actual dependency on the acoustic signal. Evaluating eight LALMs across three benchmarks, we find that models retain 60-72% of their full audio scores even without any audio input. Moreover, among items that require audio, only 3.0-4.2% need the complete audio clip; the majority can be resolved using localized fragments. These findings challenge the assumption that benchmark performance equals robust audio understanding, and we conclude with practical guidelines for improving evaluation reliability and benchmark design.

Institutional Affiliations

Primary: National Taiwan University

All Institutions: National Taiwan University, NTU Artificial Intelligence Center of Research Excellence

ML Relevance Analysis (83)

The paper presents a critical analysis of the reliance on audio in audio-language models, challenging existing benchmarks and proposing a framework for better evaluation. The methodology and findings are significant, offering valuable insights for researchers and practitioners in the field of machine learning and audio understanding.

Comprehensive Analysis

Methodology Assessment

The paper introduces a novel diagnostic framework that assesses audio-language models (LALMs) based on two axes: text prior and audio reliance. This dual-axis approach allows for a nuanced understanding of how much of a model's performance can be attributed to textual cues versus actual audio processing. The methodology is well-structured, employing controlled settings to quantify the text prior and audio reliance, which is a significant advancement in evaluating LALMs. The use of multiple benchmarks and a variety of models strengthens the robustness of the findings.

Experimental Evaluation

The experiments are thorough, evaluating eight LALMs across three distinct benchmarks. The results indicate a substantial grounding gap, revealing that models can achieve high scores without audio input, which challenges the assumption of robust auditory understanding. The analysis of performance retention with partial audio is particularly insightful, providing a clear picture of how audio information is utilized by the models. However, the paper could benefit from more detailed statistical analysis to support its claims.

Reproducibility

The paper provides a clear description of the experimental setup, including the models used and the evaluation protocols. However, it lacks specific URLs or repositories for code and data, which could hinder reproducibility. Including such resources would enhance the paper's impact and facilitate further research in this area.

Limitations

One limitation is the reliance on existing benchmarks, which may not fully capture the complexities of audio understanding. Additionally, while the study identifies issues with current benchmarks, it does not propose new benchmarks or datasets, which could be a missed opportunity for advancing the field. The findings may also be limited by the specific models and benchmarks chosen for evaluation.

Broader Impact

The findings have significant implications for the design of future audio-language benchmarks and the evaluation of LALMs. By highlighting the potential for models to rely on textual priors rather than genuine auditory understanding, the paper calls for a reevaluation of how auditory capabilities are assessed in machine learning. This could lead to more accurate and reliable evaluations, ultimately improving the development of models that genuinely understand audio. The paper presents a critical analysis of the reliance on audio in audio-language models, challenging existing benchmarks and proposing a framework for better evaluation. The methodology and findings are significant, offering valuable insights for researchers and practitioners in the field of machine learning and audio understanding.

Analysis: Full Paper • Full text: 21,012 characters

RAS: a Reliability Oriented Metric for Automatic Speech Recognition

Wenbin Huang, Yuhang Qiu, Bohan Li ... · arXiv

Automatic speech recognition systems often produce confident yet incorrect transcriptions under noisy or ambiguous conditions, which can be misleading for both users and downstream applications. Standard evaluation based on Word Error Rate focuses solely on accuracy and fails to ...

Automatic speech recognition systems often produce confident yet incorrect transcriptions under noisy or ambiguous conditions, which can be misleading for both users and downstream applications. Standard evaluation based on Word Error Rate focuses solely on accuracy and fails to capture transcription reliability. We introduce an abstention-aware transcription framework that enables ASR models to explicitly abstain from uncertain segments. To evaluate reliability under abstention, we propose RAS, a reliability-oriented metric that balances transcription informativeness and error aversion, with its trade-off parameter calibrated by human preference. We then train an abstention-aware ASR model through supervised bootstrapping followed by reinforcement learning. Our experiments demonstrate substantial improvements in transcription reliability while maintaining competitive accuracy.

Institutional Affiliations

Primary: Shanghai Jiao Tong University

All Institutions: Shanghai Jiao Tong University, MoE Key Lab of Artificial Intelligence, Jiangsu Key Lab of Language Computing

ML Relevance Analysis (83)

The paper presents a significant advancement in automatic speech recognition by introducing an abstention-aware framework and a novel reliability metric, RAS, which enhances the reliability of ASR outputs in uncertain conditions. The methodology is well-founded and the experimental results robustly support the proposed contributions, marking a meaningful step forward in the field of speech processing.

Comprehensive Analysis

Methodology Assessment

The paper introduces a novel abstention-aware transcription framework for ASR systems, which allows models to abstain from uncertain segments rather than producing potentially misleading transcriptions. The proposed Reliability-Aware Score (RAS) metric is innovative as it integrates a placeholder for uncertainty directly into the transcription process, moving beyond traditional metrics like Word Error Rate (WER). The methodology is well-structured, employing a two-stage training pipeline that combines supervised bootstrapping and reinforcement learning, effectively enhancing the model's reliability in challenging acoustic conditions.

Experimental Evaluation

The experiments are comprehensive, utilizing two datasets (LibriSpeech and TALCS) to evaluate the proposed method under both clean and noisy conditions. The results demonstrate significant improvements in transcription reliability, particularly in adverse environments, while maintaining competitive accuracy. The use of human preference alignment for calibrating the RAS metric adds robustness to the evaluation process, ensuring that the proposed framework is grounded in real-world applicability.

Reproducibility

The paper provides detailed descriptions of the methodology, including the training pipeline and experimental setup. However, there is a lack of supplementary material or code repositories that would facilitate complete reproducibility. The absence of a project URL limits the ability for other researchers to replicate the findings directly.

Limitations

While the proposed framework shows promise, the reliance on human preference data for calibrating the RAS metric may introduce biases based on the specific population sampled. Additionally, the performance in highly diverse acoustic environments beyond those tested (e.g., different languages or dialects) remains unaddressed, which could limit the generalizability of the findings.

Broader Impact

The approach has significant implications for high-stakes applications of ASR, such as medical and legal transcription, where reliability is critical. By providing a mechanism for models to indicate uncertainty, the framework can enhance user trust and improve decision-making processes in various domains. The introduction of RAS as a new evaluation metric could also pave the way for further research into reliable ASR systems. The paper presents a significant advancement in automatic speech recognition by introducing an abstention-aware framework and a novel reliability metric, RAS, which enhances the reliability of ASR outputs in uncertain conditions. The methodology is well-founded and the experimental results robustly support the proposed contributions, marking a meaningful step forward in the field of speech processing.

Analysis: Full Paper • Full text: 20,862 characters

Speech Enhancement Based on Drifting Models

Liang Xu, Diego Caviedes-Nozal, Bastiaan Kleijn ... · arXiv

We propose Speech Enhancement based on Drifting Models (DriftSE), a novel generative framework that formulates denoising as an equilibrium problem. Rather than relying on iterative sampling, DriftSE natively achieves one-step inference by evolving the pushforward distribution of ...

We propose Speech Enhancement based on Drifting Models (DriftSE), a novel generative framework that formulates denoising as an equilibrium problem. Rather than relying on iterative sampling, DriftSE natively achieves one-step inference by evolving the pushforward distribution of a mapping function to directly match the clean speech distribution. This evolution is driven by a Drifting Field, a learned correction vector that guides samples toward the high-density regions of the clean distribution, which naturally facilitates training on unpaired data by matching distributions rather than paired samples. We investigate the framework under two formulations: a direct mapping from the noisy observation, and a stochastic conditional generative model from a Gaussian prior. Experiments on the VoiceBank-DEMAND benchmark demonstrate that DriftSE achieves high-fidelity enhancement in a single step, outperforming multi-step diffusion baselines and establishing a new paradigm for speech enhancement.

Institutional Affiliations

Primary: Victoria University of Wellington

All Institutions: Victoria University of Wellington, GN Audio A/S

ML Relevance Analysis (83)

The main contribution of this paper is the introduction of DriftSE, a novel generative framework for speech enhancement that reformulates denoising as an equilibrium problem, achieving high-fidelity results in a single inference step. This work represents a significant advancement in the field of speech enhancement, combining innovative methodology with robust experimental validation to address critical challenges in real-time applications.

Comprehensive Analysis

Methodology Assessment

The proposed method, DriftSE, innovatively formulates speech enhancement as an equilibrium problem, leveraging a learned Drifting Field for one-step inference. This approach diverges from traditional iterative sampling techniques, providing a significant computational advantage. The use of a semantic latent space for drift computation enhances the model's ability to capture complex speech structures, which is a notable improvement over existing methods. The dual formulation of the model—direct mapping and conditional generation—adds flexibility and robustness to the framework, allowing it to adapt to various scenarios, including unpaired training.

Experimental Evaluation

The experiments conducted on the VoiceBank-DEMAND benchmark and the DNS Challenge 2020 blind test set showcase the effectiveness of DriftSE in achieving high-fidelity speech enhancement. The reported metrics (PESQ, SI-SDR, SCOREQ) indicate that DriftSE outperforms both multi-step diffusion models and other one-step approaches, establishing its competitive edge. The thorough evaluation across different datasets and conditions demonstrates the model's generalization capabilities, which is crucial for real-world applications.

Reproducibility

The paper provides detailed implementation specifics, including architecture choices, training procedures, and hyperparameter settings, which are essential for reproducibility. However, the absence of a public code repository or demo URL limits the accessibility of the method for further validation by the research community.

Limitations

While the DriftSE framework shows promising results, its reliance on a pre-trained self-supervised learning encoder may introduce limitations related to the quality and representativeness of the latent features. Additionally, the performance drop in unpaired settings suggests that the model may struggle in scenarios where clean-reference data is not available, highlighting a potential area for improvement.

Broader Impact

The DriftSE framework has significant implications for real-time speech enhancement applications, particularly in environments with varying noise conditions. Its ability to perform one-step inference could facilitate deployment in low-latency scenarios, such as telecommunication and assistive technologies. Furthermore, the methodology could inspire future research in generative modeling and distribution matching across other domains beyond audio. The main contribution of this paper is the introduction of DriftSE, a novel generative framework for speech enhancement that reformulates denoising as an equilibrium problem, achieving high-fidelity results in a single inference step. This work represents a significant advancement in the field of speech enhancement, combining innovative methodology with robust experimental validation to address critical challenges in real-time applications.

Analysis: Full Paper • Full text: 21,825 characters

CineAGI: Character-Consistent Movie Creation through LLM-Orchestrated Multi-Modal Generation and Cross-Scene Integration

Tianyidan Xie, Zhentao Huang, Mingjie Wang ... · ICME 2026

Automated movie creation requires coordinating multiple characters, modalities, and narrative elements across extended sequences -- a challenge that existing end-to-end approaches struggle to address effectively. We present \textbf{CineAGI}, a hierarchical movie generation framew...

Automated movie creation requires coordinating multiple characters, modalities, and narrative elements across extended sequences -- a challenge that existing end-to-end approaches struggle to address effectively. We present \textbf{CineAGI}, a hierarchical movie generation framework that decomposes this complex task through specialized multi-agent orchestration. Our framework employs three key innovations: (1) a multi-agent narrative synthesis module where specialized LLM agents collaboratively generate comprehensive cinematic blueprints with character profiles, scene descriptions, and cross-modal specifications; (2) a decoupled character-centric pipeline that maintains identity consistency through instance-level tracking and integration while enabling flexible multi-character composition; and (3) a hierarchical audio-visual synchronization mechanism ensuring frame-level alignment of dialogue, expressions, and music. Extensive experiments demonstrate that CineAGI achieves 40\% improvement in overall consistency, 4.4\% gain in subject consistency, 5.4\% enhancement in aesthetic quality, and 28.7\% higher character consistency compared to baselines. Our work establishes a principled foundation for automated multi-scene video generation that preserves narrative coherence and character authenticity.

Institutional Affiliations

Primary: Nanjing University

All Institutions: Nanjing University, Zhejiang Sci-Tech University, University of British Columbia, Beijing Shuzhimei Technology Co., Ltd, Jilin University, Tianjin University

ML Relevance Analysis (83)

CineAGI represents a significant advancement in automated movie creation through its innovative multi-agent orchestration framework. The comprehensive methodology and substantial experimental validation establish it as a leading approach in the field, with the potential to reshape how narratives are crafted in digital media.

Comprehensive Analysis

Methodology Assessment

The methodology presented in CineAGI is robust and innovative, leveraging a hierarchical multi-agent orchestration approach to tackle the complex task of automated movie creation. The use of specialized LLM agents for narrative synthesis, character generation, and cinematographic synthesis is a significant advancement over traditional end-to-end models. The framework's ability to maintain character consistency and narrative coherence across scenes through decoupled processing and explicit synchronization mechanisms is particularly noteworthy. The detailed breakdown of each module and the integration of various generative models demonstrate a comprehensive understanding of the challenges in automated filmmaking.

Experimental Evaluation

The experimental evaluation is thorough, utilizing a diverse benchmark of 100 story prompts across multiple genres to assess the framework's performance. The use of both quantitative metrics and qualitative human evaluations provides a well-rounded perspective on the system's effectiveness. The reported improvements in consistency and aesthetic quality are substantial, indicating that the proposed methods yield significant enhancements over existing baselines. However, the paper could benefit from more detailed comparisons with a wider range of contemporary methods to further contextualize its contributions.

Reproducibility

The paper provides a detailed description of the experimental setup, including generation settings, evaluation metrics, and baseline comparisons. However, the lack of publicly available code or demo URLs limits reproducibility. Future work should consider releasing the implementation to facilitate further research and validation by the community.

Limitations

One limitation of the study is the reliance on specific generative models, which may not generalize across all contexts or genres of filmmaking. Additionally, while the framework shows improvements in character consistency and narrative coherence, the complexity of the system may introduce challenges in real-time applications or scalability. The computational cost of approximately 11.3 minutes per scene on a single GPU could also be a barrier for broader adoption.

Broader Impact

The implications of CineAGI extend beyond academic research into practical applications in the film and entertainment industry. By automating aspects of movie creation, this framework could democratize content production, enabling creators with limited resources to produce high-quality narratives. Furthermore, the integration of AI in creative processes raises questions about authorship and the role of human creativity in storytelling. CineAGI represents a significant advancement in automated movie creation through its innovative multi-agent orchestration framework. The comprehensive methodology and substantial experimental validation establish it as a leading approach in the field, with the potential to reshape how narratives are crafted in digital media.

Analysis: Full Paper • Full text: 19,401 characters

HeadRouter: Dynamic Head-Weight Routing for Task-Adaptive Audio Token Pruning in Large Audio Language Models

Peize He, Yaodi Luo, Xiaoqian Liu ... · arXiv

Recent large audio language models (LALMs) demonstrate remarkable capabilities in processing extended multi-modal sequences, yet incur high inference costs. Token compression is an effective method that directly reduces redundant tokens in the sequence. Existing compression metho...

Recent large audio language models (LALMs) demonstrate remarkable capabilities in processing extended multi-modal sequences, yet incur high inference costs. Token compression is an effective method that directly reduces redundant tokens in the sequence. Existing compression methods usually assume that all attention heads in LALMs contribute equally to various audio tasks and calculate token importance by averaging scores across all heads. However, our analysis demonstrates that attention heads exhibit distinct behaviors across diverse audio domains. We further reveal that only a sparse subset of attention heads actively responds to audio, with completely different performance when handling semantic and acoustic tasks. In light of this observation, we propose HeadRouter, a head-importance-aware token pruning method that perceives the varying importance of attention heads in different audio tasks to maximize the retention of crucial tokens. HeadRouter is training-free and can be applied to various LALMs. Extensive experiments on the AudioMarathon and MMAU-Pro benchmarks demonstrate that HeadRouter achieves state-of-the-art compression performance, exceeding the baseline model even when retaining 70% of the audio tokens and achieving 101.8% and 103.0% of the vanilla average on Qwen2.5-Omni-3B and Qwen2.5-Omni-7B, respectively.

Institutional Affiliations

Primary: Shanghai Jiao Tong University

All Institutions: Shanghai Jiao Tong University, DAIL Tech, Northeastern University, Sichuan University, Huazhong University of Science and Technology

GitHub

ML Relevance Analysis (83)

The main contribution of this paper is the introduction of HeadRouter, a dynamic head-weight routing mechanism for audio token pruning in large audio language models, which significantly enhances performance and efficiency in processing diverse audio tasks. This work represents a meaningful advancement in the field of audio language models, addressing critical challenges in token management and model efficiency while maintaining high performance across various audio tasks.

Comprehensive Analysis

Methodology Assessment

The proposed HeadRouter method introduces a novel dynamic head-weight routing mechanism that adapts to the varying importance of attention heads in large audio language models (LALMs). This approach is innovative in its use of entropy-based selectivity scores and Gaussian soft mixing to create task-specific head-weight profiles. The training-free nature of the method allows it to be easily integrated into existing models without additional training overhead, which is a significant advantage for practical applications.

Experimental Evaluation

The experiments conducted on the AudioMarathon and MMAU-Pro benchmarks demonstrate the effectiveness of HeadRouter in outperforming existing token pruning methods across various audio tasks. The results indicate that the method not only maintains performance while aggressively pruning tokens but also adapts well to different audio contexts, showcasing its robustness. The comparative analysis with state-of-the-art methods further validates the proposed approach's superiority in managing token importance dynamically.

Reproducibility

The paper provides a clear description of the methodology, including the routing mechanism and evaluation setup, which supports reproducibility. However, the lack of publicly available code or detailed implementation guidelines may hinder full reproducibility for other researchers.

Limitations

One limitation is the reliance on pre-calibrated head-weight profiles, which may not generalize across all audio tasks or models. Additionally, while the method shows promise in reducing computational costs, the paper does not explore the implications of using HeadRouter in real-time applications or its impact on latency in practical deployments.

Broader Impact

The implications of this research extend to various applications in audio processing, including speech recognition, music analysis, and multimodal systems. By improving the efficiency of LALMs, this work could facilitate more widespread adoption of advanced audio understanding technologies in real-time applications, enhancing user experiences in voice-interactive systems. The main contribution of this paper is the introduction of HeadRouter, a dynamic head-weight routing mechanism for audio token pruning in large audio language models, which significantly enhances performance and efficiency in processing diverse audio tasks. This work represents a meaningful advancement in the field of audio language models, addressing critical challenges in token management and model efficiency while maintaining high performance across various audio tasks.

Analysis: Full Paper • Full text: 35,797 characters

Opening the Design Space: Two Years of Performance with Intelligent Musical Instruments

Charles Patrick Martin · International Conference on New Interfaces for Musical Expression (NIME) 2026

Machine generation of symbolic music and digital audio are hot topics but there have been relatively few digital musical instruments that integrate generative AI. Present musical AI tools are not artist centred and do not support experimentation or integrating into musical instru...

Machine generation of symbolic music and digital audio are hot topics but there have been relatively few digital musical instruments that integrate generative AI. Present musical AI tools are not artist centred and do not support experimentation or integrating into musical instruments or practices. This work introduces an inexpensive generative AI instrument platform based on a single board computer that connects via MIDI to other musical devices. The platform uses artist-collected datasets with models trained on a regular computer. This paper asks what the design space of intelligent musical instruments might look like when accessible and portable AI systems are available for artistic exploration. I contribute five examples of instruments created and tested through a two-year first-person artistic research process. These show that (re)mapping can replace retraining for discovering AI interaction, that fast input interleaving is a new co-creative strategy, that small-data AI models can be a transportable design resource, and that cheap hardware can lower barriers to inclusion. This work could enable artists to explore new interaction and performance schemes with intelligent musical instruments.

Institutional Affiliations

Primary: The Australian National University

All Institutions: The Australian National University

Demo · GitHub

ML Relevance Analysis (78)

This paper presents a novel generative AI platform for intelligent musical instruments, emphasizing artist-centered design and small-data approaches. The comprehensive exploration of performance experiences and instrument development contributes valuable insights to the intersection of AI and music, highlighting the potential for innovative co-creative practices.

Comprehensive Analysis

Methodology Assessment

The methodology is grounded in a first-person artistic research approach, which is innovative in the context of generative AI in music. The use of small-data AI models trained on artist-collected datasets is a significant contribution, allowing for a more personalized and artist-centered exploration of generative AI in musical contexts. The paper effectively outlines the design and implementation of a generative AI platform that integrates with existing musical instruments, showcasing a practical application of AI in music performance. The iterative development of five distinct instruments provides a rich qualitative dataset for analysis.

Experimental Evaluation

The experiments conducted over two years of performance practice are well-documented, providing insights into the evolution of the instruments and their interactions with musicians. The author details the performance experiences and the adaptability of the instruments in various contexts, which adds depth to the evaluation. However, the paper lacks quantitative metrics for assessing the performance of the AI models, which could strengthen the evaluation of their effectiveness.

Reproducibility

The implementation details are provided, including the use of Raspberry Pi and the open-source nature of the software, which enhances reproducibility. The availability of the project on GitHub allows others to replicate the setup and experiment with the platform. However, more detailed instructions on the configuration and training processes would further aid reproducibility.

Limitations

The study is limited by its first-person perspective, which may not capture the full range of experiences from diverse musicians. Additionally, the exploration of model updates over time is not systematically addressed, which could provide further insights into the adaptability and longevity of the AI models in performance settings.

Broader Impact

This work has the potential to democratize access to intelligent musical instruments by lowering the cost barrier and encouraging experimentation among artists. The findings could influence future designs of musical AI systems, promoting a shift towards artist-centered approaches in generative AI applications. The implications for HCI and music technology communities are significant, as the research opens new avenues for interaction and collaboration between humans and AI in creative practices. This paper presents a novel generative AI platform for intelligent musical instruments, emphasizing artist-centered design and small-data approaches. The comprehensive exploration of performance experiences and instrument development contributes valuable insights to the intersection of AI and music, highlighting the potential for innovative co-creative practices.

Analysis: Full Paper • Full text: 44,097 characters

RTCFake: Speech Deepfake Detection in Real-Time Communication

Jun Xue, Zhuolin Yi, Yihuan Huang ... · ACL 2026

With the rapid advancement of speech generation technologies, the threat posed by speech deepfakes in real-time communication (RTC) scenarios has intensified. However, existing detection studies mainly focus on offline simulations and struggle to cope with the complex distortions...

With the rapid advancement of speech generation technologies, the threat posed by speech deepfakes in real-time communication (RTC) scenarios has intensified. However, existing detection studies mainly focus on offline simulations and struggle to cope with the complex distortions introduced during RTC transmission, including unknown speech enhancement processes (e.g., noise suppression) and codec compression. To address this challenge, we present the first large-scale speech deepfake dataset tailored for RTC scenarios, termed \textit{RTCFake}, totaling approximately 600 hours. The dataset is constructed by transmitting speech through multiple mainstream social media and conferencing platforms (e.g., Zoom), enabling precise pairing between offline and online speech. In addition, we propose a phoneme-guided consistency learning (PCL) strategy that enforces models to learn platform-invariant semantic structural representations. In this paper, the RTCFake dataset is divided into training, development, and evaluation sets. The evaluation set further includes both unseen RTC platforms and unseen complex noise conditions, thereby providing a more realistic and challenging evaluation benchmark for speech deepfake detection. Furthermore, the proposed PCL strategy achieves significant improvements in both cross-platform generalization and noise robustness, offering an effective and generalizable modeling paradigm. The \textit{RTCFake} dataset is provided in the {https://huggingface.co/datasets/JunXueTech/RTCFake}.

Institutional Affiliations

Primary: unknown

All Institutions: unknown

GitHub

ML Relevance Analysis (76)

The paper presents RTCFake, a novel dataset and a phoneme-guided consistency learning strategy for detecting speech deepfakes in real-time communication, addressing a critical gap in existing research. The methodology is innovative, and the experimental results demonstrate substantial improvements, making it a valuable contribution to the field of audio and speech processing.

Comprehensive Analysis

Methodology Assessment

The paper introduces a phoneme-guided consistency learning (PCL) strategy, which is a novel approach aimed at enhancing the robustness of speech deepfake detection in real-time communication scenarios. The proposed methodology effectively addresses the challenges posed by various distortions and codec compressions encountered in RTC environments. The dataset, RTCFake, is a significant contribution, as it is specifically designed for the complexities of real-time communication, which is often overlooked in existing literature.

Experimental Evaluation

The authors provide a comprehensive evaluation of their proposed method using a large-scale dataset of approximately 600 hours of speech. The evaluation set includes both unseen RTC platforms and complex noise conditions, which enhances the realism of the testing environment. The reported improvements in cross-platform generalization and noise robustness are significant, indicating that the proposed method is effective in practical applications.

Reproducibility

While the paper mentions the availability of the RTCFake dataset on Hugging Face, it lacks detailed implementation specifics regarding the PCL strategy and the models used. This omission could hinder reproducibility, as other researchers may struggle to replicate the results without clear guidance on the experimental setup.

Limitations

One limitation is that the dataset may not encompass all possible real-time communication scenarios, potentially limiting the generalizability of the findings. Additionally, the paper does not address the computational efficiency of the proposed method, which is crucial for real-time applications.

Broader Impact

The implications of this research are significant, as it addresses a pressing issue in the age of deepfake technology. The ability to detect speech deepfakes in real-time communication can have far-reaching effects on security, privacy, and trust in digital communications. The proposed dataset and methodology could serve as a foundation for future research in this area. The paper presents RTCFake, a novel dataset and a phoneme-guided consistency learning strategy for detecting speech deepfakes in real-time communication, addressing a critical gap in existing research. The methodology is innovative, and the experimental results demonstrate substantial improvements, making it a valuable contribution to the field of audio and speech processing.

Analysis: Full Paper • Full text: 720 characters

Audio ML Papers

🏆 Top Papers This Week

Institutional Affiliations

ML Relevance Analysis (92)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (89)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (83)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (83)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (83)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (83)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (83)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (82)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (83)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (83)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility