Large reasoning models typically follow a read-then-think paradigm: they observe the complete input, reason over a static context, and then produce the answer. Yet many real-world scenarios are inherently dynamic, such as audio and video stream, where information arrives as a continuous stream and models must reason, update, and respond under partial observations. Recent streaming reasoning methods allow models to think while reading, but they largely rely on supervised imitation of pre-constructed trajectories, which limits their flexibility. In this paper, we propose AdaSR, an adaptive streaming reasoning framework that enables models to reason during input streaming and perform final deliberation once the stream is complete, learning when to think, and how much computation to allocate across different stages. To optimize this hierarchical reasoning process, we introduce Hierarchical Relative Policy Optimization (HRPO), which decomposes policy optimization into streaming reasoning and deep reasoning phases, providing more fine-grained advantage assignment instead of uniformly distributing a single sequence-level advantage over all tokens. HRPO integrates format, accuracy, and adaptive thinking rewards to enforce valid reasoning protocols, preserve final task performance, and encourage latency-aware computation allocation. Experiments show that AdaSR achieves a better balance among reasoning accuracy, computational efficiency, and streaming latency compared with supervised fine-tuning baseline. We release our code at https://github.com/EIT-NLP/StreamingLLM/tree/main/AdaSR.
Primary: Eastern Institute of Technology
All Institutions: Eastern Institute of Technology, Shanghai Jiao Tong University, The Hong Kong Polytechnic University, Southeast University, Xi'an Jiaotong-Liverpool University
The main contribution of this paper is the introduction of AdaSR, an innovative framework for adaptive streaming reasoning that optimizes the reasoning process in dynamic environments through hierarchical policy optimization and adaptive rewards. This work represents a significant advancement in the field of machine learning, particularly in the context of real-time reasoning and decision-making under uncertainty.
The paper introduces AdaSR, a novel adaptive streaming reasoning framework that leverages reinforcement learning to optimize reasoning during dynamic input streams. The methodology is robust, incorporating Hierarchical Relative Policy Optimization (HRPO) to address the temporal credit assignment problem inherent in streaming reasoning. By decomposing the policy optimization into distinct phases, AdaSR allows for more nuanced advantage assignment, which is a significant improvement over traditional methods that apply uniform advantages across all tokens. The integration of adaptive rewards further enhances the model's ability to balance reasoning accuracy and computational efficiency.
The experiments are comprehensive, evaluating AdaSR against multiple benchmarks in reasoning tasks, including mathematical reasoning and context-based question answering. The results demonstrate significant improvements in accuracy and efficiency compared to baseline models, indicating the effectiveness of the proposed approach. The paper provides detailed metrics on accuracy, token lengths, and latency, which are critical for assessing the performance of streaming reasoning models.
The authors have released their code, which is a positive step towards reproducibility. However, the paper lacks detailed implementation specifics that would facilitate easier replication of the experiments, such as hyperparameter settings and training configurations.
The paper acknowledges that AdaSR is primarily focused on text streams with verifiable answers, which may limit its applicability to more complex scenarios involving continuous audio or video streams. Additionally, the reliance on reinforcement learning may introduce challenges in training stability and convergence, which are not thoroughly addressed.
The proposed framework has the potential to significantly enhance real-time reasoning capabilities in various applications, including interactive AI systems, real-time translation, and autonomous agents. By enabling models to adaptively allocate computation based on input dynamics, AdaSR could lead to more responsive and efficient AI systems in real-world scenarios. The main contribution of this paper is the introduction of AdaSR, an innovative framework for adaptive streaming reasoning that optimizes the reasoning process in dynamic environments through hierarchical policy optimization and adaptive rewards. This work represents a significant advancement in the field of machine learning, particularly in the context of real-time reasoning and decision-making under uncertainty.
We introduce TuneJury, an open, instance-level pairwise reward model for text-to-music that predicts a music preference score from a text prompt and an audio clip. The released checkpoint is trained on publicly available human-preference labels covering arena-style (A vs. B) votes, metric-alignment preference pairs, crowdsourced pairwise comparisons, and expert aesthetic ratings. The predicted score margin between two clips is well calibrated on our held-out test split, supporting data filtering via a simple score threshold. TuneJury generalizes to both held-out test pairs and out-of-distribution benchmarks, remaining competitive with prior baselines on the latter. For generators released after training, we introduce anchor calibration, a post-hoc, per-system Bradley-Terry calibration that recovers agreement at substantially better data efficiency than from-scratch retraining. The same frozen reward drives consistent reward-axis gains across three downstream applications: inference-time best-of-N selection, DITTO-style latent optimization, and expert-iteration post-training. TuneJury is available at https://github.com/yonghyunk1m/TuneJury.
Primary: Carnegie Mellon University
All Institutions: Carnegie Mellon University, Sony AI, Georgia Tech, KAIST, Peking University, QMUL
TuneJury presents a novel approach to music generation preference alignment through a pairwise reward model, demonstrating competitive performance with a lean architecture and practical applications in real-world scenarios. The comprehensive evaluation and innovative calibration method position this work as a meaningful contribution to the field of machine learning in audio.
The methodology introduces TuneJury as a pairwise reward model for text-to-music generation, leveraging a small MLP head over frozen audio and text encoders. The choice of a pairwise approach is well-justified, addressing the limitations of absolute scoring systems in subjective domains like music. The model is trained on a diverse set of human-rated pairs, which enhances its robustness and generalizability. The introduction of anchor calibration as a post-hoc adjustment method is a notable innovation that allows for adaptation to new systems without the need for retraining, showcasing a practical approach to real-world application.
The experimental evaluation is comprehensive, utilizing multiple datasets and benchmarks to assess the performance of TuneJury. The authors provide detailed comparisons against existing models, demonstrating that TuneJury achieves competitive accuracy with fewer parameters and without relying on pseudo-label augmentation. The results are statistically significant, with clear metrics reported for pairwise accuracy and calibration, as well as downstream applications. The experiments effectively illustrate the model's capabilities across different scenarios, including inference-time selection and latent optimization.
The paper includes sufficient details on the training procedure, architecture, and datasets used, which enhances reproducibility. The authors have made the code, checkpoints, and demo available, which is a strong point for enabling other researchers to replicate their findings. However, some hyperparameter settings and specific implementation details could be more explicitly stated to further aid reproducibility.
The paper acknowledges several limitations, including potential biases in the training data, particularly the lack of representation for vocal music and the calibration signal's dependence on the specific datasets used. The performance drop on post-cutoff splits indicates that the model may not generalize well to newer music generation systems, which could limit its applicability in rapidly evolving contexts.
TuneJury has the potential to significantly impact the field of music generation by providing a more aligned and efficient method for evaluating generated music against human preferences. Its open-source nature encourages community engagement and further research, potentially leading to advancements in multimodal systems that combine text and audio understanding. The implications for music generation tools and applications in creative industries are substantial, as this model could enhance user experience and satisfaction in automated music creation. TuneJury presents a novel approach to music generation preference alignment through a pairwise reward model, demonstrating competitive performance with a lean architecture and practical applications in real-world scenarios. The comprehensive evaluation and innovative calibration method position this work as a meaningful contribution to the field of machine learning in audio.
Fine-tuning Transformer-based foundation models has become the dominant strategy for domain adaptation in audio and speech processing. To reduce the computational and memory costs of this process, parameter-efficient transfer learning (PETL) methods have been widely explored. Meanwhile, Mamba, a recent state-space model, has emerged as a promising alternative to Transformers for sequence modeling. In this work, we present MambAdapter, a parameter-efficient transfer learning approach that integrates Mamba into low-rank bottleneck adapters. Our design combines parameter sharing across adapters with the injection of a lightweight Mamba module, enabling more effective modeling of audio features. We demonstrate that MambAdapter matches or outperforms strong PETL baselines on four audio classification tasks and five speech recognition languages, even when operating under reduced parameter budgets.
Primary: Université de Montréal
All Institutions: Université de Montréal, Imperial College London, Concordia University, Mila -- Quebec AI Institute
The main contribution of this work is the introduction of MambAdapter, a novel parameter-efficient transfer learning method that combines Mamba's state-space modeling with low-rank bottleneck adapters, achieving competitive performance on audio and speech tasks while significantly reducing the number of trainable parameters. This paper represents a meaningful advancement in the quest for efficient model adaptation in the rapidly evolving field of audio processing.
The paper introduces MambAdapter, which innovatively integrates Mamba, a state-space model, into low-rank bottleneck adapters for parameter-efficient transfer learning in speech and audio tasks. The methodology is well-grounded in existing literature, leveraging the strengths of Mamba's linear-time modeling capabilities while addressing the inefficiencies of traditional Transformer fine-tuning. The use of shared projections and the lightweight Mamba module is a thoughtful design choice that enhances the model's ability to capture long-range dependencies in audio data.
The experimental setup is robust, with comprehensive evaluations across multiple audio classification tasks and multilingual speech recognition. The authors provide a clear comparison against established PETL baselines, demonstrating that MambAdapter achieves competitive or superior performance while maintaining a lower parameter budget. The results are statistically validated through averaging over multiple random seeds, which adds credibility to their findings.
The paper includes a link to the code repository, which is essential for reproducibility. However, the paper could benefit from more detailed hyperparameter settings and training configurations to facilitate easier replication of results by other researchers.
While the paper presents promising results, it does not extensively explore the limitations of MambAdapter, such as potential performance degradation in extremely low-resource settings or the impact of varying audio characteristics on model performance. Additionally, the focus on a limited number of datasets may restrict the generalizability of the findings.
The integration of Mamba into PETL frameworks has significant implications for the field of audio and speech processing, particularly in resource-constrained environments. The findings could influence future research directions in efficient model adaptation, potentially leading to advancements in real-time speech recognition and audio classification applications. The main contribution of this work is the introduction of MambAdapter, a novel parameter-efficient transfer learning method that combines Mamba's state-space modeling with low-rank bottleneck adapters, achieving competitive performance on audio and speech tasks while significantly reducing the number of trainable parameters. This paper represents a meaningful advancement in the quest for efficient model adaptation in the rapidly evolving field of audio processing.
We propose diarization-conditioned spoken language models (SLMs), a strategy for extending SLMs to far-field multi-talker audio. Rather than adapting the decoder via Serialized Output Training, which risks catastrophic forgetting, we condition the acoustic encoder on diarization masks to extract target-speaker representations, keeping the decoder frozen. We instantiate this as Dixtral, integrating a Diarization Conditioned Whisper (DiCoW) encoder into the Voxtral SLM. On AMI, NOTSOFAR-1, LibriSpeechMix, and Mixer6, Dixtral outperforms Gemini 3.0 Flash, VibeVoice, and Voxtral Mini Transcribe V2 on speaker-attributed transcription by 29.0%, 19.8%, and 16.0% absolute cpWER respectively. On a novel long-form multi-speaker QA benchmark, zero-shot Dixtral matches Gemini on far-field content understanding, and when fine-tuned surpasses both Gemini and Voxtral operating on close-talk across all tasks.
Primary: Brno University of Technology
All Institutions: Brno University of Technology, Carnegie Mellon University
The paper presents a compelling and effective method for grounding spoken LLMs in multi-speaker audio through encoder-side diarization conditioning, achieving state-of-the-art performance on transcription and novel capabilities in multi-speaker reasoning and QA.
The paper proposes a novel architectural strategy for extending Spoken Large Language Models (SLMs) to multi-speaker scenarios by conditioning the acoustic encoder on diarization masks, rather than adapting the decoder. This approach, instantiated as Dixtral, integrates a Diarization Conditioned Whisper (DiCoW) encoder with a frozen Voxtral decoder. The core innovation lies in the "Diarization Conditioning" mechanism, which uses frame-level speaker activity probabilities (STNO masks) to modulate internal representations via learnable affine transformations (FDDT). This allows the model to extract target-speaker representations while keeping the LLM decoder frozen, thereby avoiding catastrophic forgetting of reasoning capabilities associated with Serialized Output Training (SOT) and vocabulary expansion. The methodology is theoretically sound, offering a computationally efficient alternative ($O(S N^2)$ vs $O((SN)^2)$) for multi-speaker decoding.
The evaluation is comprehensive, covering four standard multi-speaker ASR datasets (AMI, NOTSOFAR-1, LibriSpeechMix, Mixer6) and a novel long-form QA/Summarization benchmark (NSF-QA). Dixtral demonstrates significant improvements over strong baselines, including Gemini 3.0 Flash, VibeVoice, and Voxtral Mini Transcribe V2, with absolute cpWER reductions of 16-29%. The inclusion of a paralinguistic QA task (emotion/gender) is particularly strong, as it tests the model's ability to utilize audio features beyond text, which cascaded systems cannot do. The results are robust, showing that zero-shot Dixtral matches Gemini on content QA and surpasses it when fine-tuned. The out-of-domain performance on Mixer6 further validates generalization.
The authors provide open-source code and a new dataset (NSF-QA). Training details are well-specified, including hardware constraints (8x A5000), optimization settings, and data chunking strategies. The use of established backbones (Whisper, Ministral, DiariZen) and clear integration points (FDDT, MLP adapter) ensures high reproducibility. The release of the benchmark dataset is a significant contribution to reproducibility in this niche.
The performance is inherently dependent on the quality of the external diarization system (DiariZen). Errors in diarization will propagate directly to the transcription and reasoning tasks. The paper acknowledges this but does not extensively analyze the sensitivity to diarization errors. Additionally, the current implementation requires separate inference passes for each target speaker, which, while more efficient than joint decoding, still scales linearly with the number of speakers. The fine-tuning for QA/Summarization slightly degrades pure ASR performance, indicating a trade-off that requires careful multi-task optimization in future work.
This work significantly advances the field of spoken language understanding by enabling end-to-end, multi-speaker reasoning in far-field audio. It bridges the gap between modular ASR pipelines and unified SLMs, offering a path towards more robust and capable voice assistants and meeting transcription tools. The ability to handle paralinguistic information (emotion, gender) in a multi-speaker context opens new avenues for affective computing and human-computer interaction. The paper presents a compelling and effective method for grounding spoken LLMs in multi-speaker audio through encoder-side diarization conditioning, achieving state-of-the-art performance on transcription and novel capabilities in multi-speaker reasoning and QA.
Recent acoustic-to-articulatory inversion (AAI) models rely on electromagnetic articulography (EMA) data, which are costly and limited in scale. To address this limitation, we propose \textit{ArtBoost}, a novel data augmentation strategy that leverages large-scale speech--mesh datasets originally developed for speech-driven 3D facial animation to improve AAI under limited EMA supervision. \textit{ArtBoost} extracts pseudo articulatory trajectories from visible facial anchors and uses them for pre-training before fine-tuning on real EMA data. Experiments show consistent improvements in PCC and RMSE. Trajectory analyses confirm that the pseudo articulatory signals reflect physically meaningful visible articulatory dynamics. Additional evaluations across different AAI architectures demonstrate stable performance gains, indicating that \textit{ArtBoost} can be integrated into diverse AAI models. These results suggest that speech--mesh data provide an effective and scalable source of articulatory supervision for AAI. Project page: https://cau-irislab.github.io/Interspeech26-ArtBoost/
Primary: Chung-Ang University
All Institutions: Chung-Ang University
The main contribution of this paper is the introduction of ArtBoost, a novel data augmentation strategy that effectively utilizes speech--mesh datasets to enhance acoustic-to-articulatory inversion under limited EMA supervision. This innovative approach addresses a critical gap in the field, demonstrating both methodological rigor and potential for broad application in speech technology.
The proposed methodology, ArtBoost, innovatively repurposes large-scale speech--mesh datasets to generate pseudo articulatory trajectories for acoustic-to-articulatory inversion (AAI). The three-step process—segmenting recordings, tracking facial anchors, and pre-training AAI models—demonstrates a systematic approach to augmenting limited EMA data. The use of visible facial dynamics to infer articulatory movements is a novel angle that effectively addresses the data scarcity issue in AAI, showcasing a solid understanding of both the limitations of current methodologies and the potential of leveraging existing datasets.
The experiments are well-structured, utilizing multiple datasets (HPRC and USC-TIMIT) and architectures to validate the effectiveness of ArtBoost. The reported improvements in PCC and RMSE across different models provide strong evidence of the method's robustness. However, the paper could benefit from more detailed statistical analyses and comparisons with baseline methods to further substantiate the claims of performance enhancement.
The paper provides sufficient implementation details, including the preprocessing protocols and evaluation metrics used. However, the absence of code availability or a clear reproducibility statement limits the ease with which other researchers can replicate the results. Including a GitHub repository or similar would enhance reproducibility.
The main limitation lies in the reliance on visible articulators, which may not capture the full range of articulatory dynamics. Additionally, the performance gains are more pronounced in datasets with limited EMA data, suggesting that the method may not be as effective when ample ground-truth data is available. The paper also does not address potential biases in the speech--mesh datasets that could affect generalization.
ArtBoost has significant implications for speech synthesis, articulatory analysis, and related fields by providing a scalable method for training AAI models without extensive EMA data collection. This could lead to advancements in applications such as speech-driven animation and assistive technologies for speech disorders, making it a valuable contribution to the field. The main contribution of this paper is the introduction of ArtBoost, a novel data augmentation strategy that effectively utilizes speech--mesh datasets to enhance acoustic-to-articulatory inversion under limited EMA supervision. This innovative approach addresses a critical gap in the field, demonstrating both methodological rigor and potential for broad application in speech technology.
Zero-shot cross-lingual phoneme recognition is often hindered by the fragility of direct acoustic-to-symbol mapping, which is susceptible to language-specific variations. Echoing joint-embedding predictive architecture (JEPA) work in vision, we propose ArtNet, a framework that explores a structured feature prediction task based on articulatory features to enhance acoustic robustness. Specifically, ArtNet integrates an articulatory predictor, designed to extract universal articulatory representations from self-supervised learning (SSL) features, with a variational information bottleneck (VIB) to suppress language-specific variations. Experiments on seven unseen languages demonstrate that ArtNet, particularly when synergized with the proposed vector-space inventory alignment (VSIA) strategy, significantly outperforms competitive baselines, achieving a 20.56\% relative reduction in phoneme error rate (PER) and 7.01\% in phoneme feature error rate (PFER).
Primary: Fudan University
All Institutions: Fudan University, Pedawise
The main contribution of this paper is the introduction of ArtNet, a novel framework that employs articulatory features and a variational information bottleneck to improve zero-shot phoneme recognition across languages. This work represents a meaningful advancement in the field of automatic speech recognition, particularly in addressing the challenges posed by language-specific variations.
The proposed methodology of ArtNet is innovative, leveraging a structured prediction task based on articulatory features to enhance the robustness of phoneme recognition across languages. The integration of a variational information bottleneck (VIB) to suppress language-specific variations is a significant advancement. The framework's reliance on self-supervised learning (SSL) features and the construction of a structured articulatory target space are well-conceived, allowing for effective disentanglement of linguistic content from language-specific acoustic characteristics. The introduction of vector-space inventory alignment (VSIA) as an inference strategy further enhances the model's adaptability in zero-shot scenarios.
The experiments conducted on seven unseen languages provide a robust evaluation of ArtNet's performance. The reported improvements in phoneme error rate (PER) and phoneme feature error rate (PFER) demonstrate the effectiveness of the proposed framework. The use of multiple architectural variants (MLP, TDNN, LSTM) to assess the impact of temporal context on feature extraction is a commendable approach that adds depth to the experimental analysis. However, the paper could benefit from more detailed statistical analysis of the results to reinforce the significance of the findings.
The paper provides sufficient implementation details, including the architecture of the SSL backbone and the training process. However, it lacks a publicly accessible code repository or demo URL, which would enhance reproducibility. Clearer documentation of hyperparameters and training procedures would also assist other researchers in replicating the study.
One limitation is the reliance on a single source language (English) for training, which may not generalize well to all languages, especially those with significantly different phonetic structures. Additionally, while the paper addresses substitution errors, it does not fully explore other potential error types that could arise in zero-shot scenarios. The absence of a demo or project URL limits the accessibility of the framework for further exploration by the community.
The implications of this research are significant for multilingual speech recognition systems, particularly in resource-scarce languages. By enhancing zero-shot phoneme recognition, ArtNet could facilitate better communication technologies in diverse linguistic contexts, potentially improving accessibility and inclusivity in speech technology applications. The main contribution of this paper is the introduction of ArtNet, a novel framework that employs articulatory features and a variational information bottleneck to improve zero-shot phoneme recognition across languages. This work represents a meaningful advancement in the field of automatic speech recognition, particularly in addressing the challenges posed by language-specific variations.
The rapid advancement of AI music generators highlights the urgent need for reliable Synthetic Song Detection (SSD). Existing SSD methods often rely on low-level artifacts or fixed feature assumptions, struggling to capture generator-agnostic cues. To address this, we propose Sofia (Synthetic-song detection framework via music features), a flexible framework that models music-intrinsic attributes via feature-specific experts and an adaptive Mixture-of-Experts (MoE) module. By configuring Sofia with representative Vocal, Audio-effect, Global structure features, and their combinations, we present their individual and complementary contributions. To comprehensively evaluate our framework, we further construct MUSIC8K, a challenging benchmark featuring lastest emerging generators and realistic audio perturbations. Experiments show that Sofia learns generator-agnostic representations from music-intrinsic features, improving the F1 score by 18.5 points over the strongest baseline on MUSIC8K-O while maintaining strong robustness.
Primary: Central Conservatory of Music
All Institutions: Central Conservatory of Music, Southern University of Science and Technology, Fudan University
The paper presents Sofia, a flexible framework for synthetic song detection that leverages music-intrinsic features through a Mixture-of-Experts approach. This innovative methodology addresses existing limitations in SSD methods, providing a significant contribution to the field of audio analysis and detection.
The paper introduces Sofia, a novel framework for Synthetic Song Detection (SSD) that utilizes a Mixture-of-Experts (MoE) approach to model music-intrinsic features. This method allows for flexible feature incorporation, enabling the framework to adapt to various music generators and improve generalization. The use of feature-specific experts to capture distinct musical attributes is a significant methodological advancement, as it addresses the limitations of existing SSD methods that rely on low-level artifacts or fixed feature assumptions. The framework's design supports systematic analysis of individual and complementary contributions of different music features, which is a notable strength.
The authors construct the MUSIC8K dataset, which serves as a benchmark for evaluating the performance of SSD methods against the latest music generators and realistic audio perturbations. The experiments demonstrate that Sofia achieves a substantial improvement in F1 score (18.5 points) over the strongest baseline, showcasing its effectiveness in learning generator-agnostic representations. The comprehensive evaluation across various configurations and the inclusion of robustness testing against audio perturbations further validate the framework's performance and adaptability.
The paper provides detailed implementation information, including training configurations, audio preprocessing, and encoder settings, which enhances reproducibility. The availability of the code and dataset on GitHub and Hugging Face facilitates further experimentation and validation by other researchers in the field.
While the framework shows promise, it primarily explores one instantiation (Sofia-VAG) based on selected music features. As music generation technology evolves, the reliance on a fixed set of features may limit the framework's applicability to future generators. Additionally, the paper does not address the potential computational overhead associated with the MoE architecture, which could impact real-time applications.
The development of a robust SSD framework has significant implications for the music industry, particularly in combating the proliferation of synthetic music and ensuring authenticity in audio content. The ability to detect synthetic songs effectively can enhance content moderation on platforms that host user-generated music, thereby preserving the integrity of artistic expression. Furthermore, the framework's adaptability to new music generation technologies positions it as a valuable tool for future research in audio analysis and detection. The paper presents Sofia, a flexible framework for synthetic song detection that leverages music-intrinsic features through a Mixture-of-Experts approach. This innovative methodology addresses existing limitations in SSD methods, providing a significant contribution to the field of audio analysis and detection.
This paper proposes a novel confidence score guided incremental and speaker adaptive pseudo-labeling approach for semi-supervised elderly speech recognition. It facilitates higher-quality pseudo-label selection and progressive refinement, while also mitigating speaker heterogeneity. A confidence estimation module is designed to rank the reliability of untranscribed data, enabling a curriculum learning trajectory that progressively folds in unlabeled data subsets from high to low confidence. Speaker-specific characteristics are captured through speaker adaptive training with learnable prompts. Experiments on the English DementiaBank Pitt and Cantonese JCCOCC MoCA elderly speech datasets suggest that the proposed method outperforms the semi-supervised baseline using no confidence scores guided incremental or speaker adaptive pseudo-labeling by statistically significant word error rate (WER) or character error rate (CER) reductions of 1.45% and 2.27% absolute (6.21% and 6.98% relative).
Primary: The Chinese University of Hong Kong
All Institutions: The Chinese University of Hong Kong, Institute of Software, Chinese Academy of Sciences, National Research Council Canada
The main contribution of this paper is the introduction of a confidence score guided incremental and speaker adaptive pseudo-labeling approach for semi-supervised elderly speech recognition, which effectively addresses the challenges of speaker heterogeneity and pseudo-label reliability. This work represents a meaningful advancement in the field of audio processing and machine learning, particularly in enhancing the accessibility of speech recognition technologies for elderly users.
The proposed methodology introduces a confidence score guided incremental and speaker adaptive pseudo-labeling strategy, which is a significant advancement in semi-supervised learning for elderly speech recognition. The integration of a confidence estimation module allows for a more nuanced selection of pseudo-labels, addressing the limitations of previous methods that either discarded low-confidence labels or did not account for speaker variability. The curriculum learning approach enhances the training process by progressively incorporating data based on confidence levels, which is a novel aspect that could lead to improved model robustness and performance.
The experiments conducted on the DementiaBank Pitt and JCCOCC MoCA datasets provide strong empirical support for the proposed method. The reported reductions in WER and CER are statistically significant, indicating that the method not only improves performance over baseline models but also effectively addresses the unique challenges posed by elderly speech. The use of diverse datasets enhances the generalizability of the findings, although further validation on additional datasets would strengthen the claims.
The paper provides a clear description of the experimental setup, including the model architecture (Whisper), the parameters used for training, and the data preparation steps. However, the lack of publicly available code or a demo URL limits the reproducibility of the results. Sharing the implementation would facilitate further research and validation of the findings.
The paper does not address potential biases in the datasets used, which may affect the generalizability of the results. Additionally, while the method shows promise, it may require extensive computational resources for training, which could limit its accessibility for broader applications. The reliance on the Whisper model may also introduce limitations based on its inherent capabilities and biases.
The proposed approach has significant implications for improving speech recognition systems for elderly populations, which is crucial given the aging global demographic. Enhanced speech recognition capabilities can facilitate better communication and access to services for elderly individuals, thereby improving their quality of life. This work could also inspire further research into adaptive learning techniques for other marginalized speech groups. The main contribution of this paper is the introduction of a confidence score guided incremental and speaker adaptive pseudo-labeling approach for semi-supervised elderly speech recognition, which effectively addresses the challenges of speaker heterogeneity and pseudo-label reliability. This work represents a meaningful advancement in the field of audio processing and machine learning, particularly in enhancing the accessibility of speech recognition technologies for elderly users.
This paper introduces CraBERT, a pre-trained phoneme encoder (PPEnc) designed for efficient pre-training in text-to-speech (TTS). CraBERT employs a cascade-fusion architecture and a subword-phoneme alignment algorithm to integrate representations from a pre-trained subword-level BERT into a phoneme-level BERT. This design provides prior word- and sentence-level information, reducing the amount of pre-training required by the phoneme encoder. Subjective listening evaluations show that CraBERT achieves MOS values comparable to existing PPEncs after approximately one epoch of pre-training, whereas the baselines in our comparison are pre-trained for approximately ten epochs. These results demonstrate that CraBERT can efficiently learn representations suitable for improving the perceived naturalness and prosody of synthesized speech.
Primary: The University of Tokyo
All Institutions: The University of Tokyo
This paper introduces CraBERT, an efficient phoneme encoder that significantly reduces pre-training time while maintaining high-quality speech synthesis. The innovative integration of subword representations and the development of a new alignment algorithm mark a notable advancement in the field of text-to-speech technologies.
The methodology presented in this paper is innovative in its use of a cascade-fusion architecture that integrates subword representations from a pre-trained BERT model into a phoneme-level BERT. This approach addresses the inefficiencies of traditional phoneme encoders by leveraging existing word- and sentence-level information, significantly reducing the pre-training time required for effective phoneme representation. The introduction of a data-driven subword-phoneme alignment algorithm based on dynamic time warping (DTW) further enhances the methodology, providing a systematic way to fuse these representations.
The experimental evaluation is robust, employing subjective listening tests to assess the quality of synthesized speech using mean opinion scores (MOS). The results indicate that CraBERT achieves comparable performance to existing phoneme encoders after a fraction of the pre-training time, demonstrating its efficiency. The use of a multi-speaker dataset from the LibriTTS corpus adds credibility to the findings, although more diverse datasets could strengthen the generalizability of the results.
The paper provides detailed descriptions of the architecture, pre-training processes, and experimental setups, which are essential for reproducibility. However, the lack of publicly available code or a project repository limits the ease with which other researchers can replicate the results. The authors should consider releasing their implementation to enhance reproducibility.
One limitation is the reliance on a single pre-trained model (DistilBERT) for subword representations, which may not generalize across all languages or phonetic systems. Additionally, while the subjective evaluations show promising results, they are limited to a specific dataset and may not reflect performance across different languages or dialects. The paper also does not explore the potential for further optimization of the alignment algorithm.
The implications of this research are significant for the field of text-to-speech synthesis, particularly in improving the efficiency of phoneme encoders. The advancements in pre-training methodologies could lead to more accessible and faster TTS systems, which can be beneficial in various applications, including virtual assistants, audiobooks, and language learning tools. The approach could also inspire further research into efficient representation learning in other domains. This paper introduces CraBERT, an efficient phoneme encoder that significantly reduces pre-training time while maintaining high-quality speech synthesis. The innovative integration of subword representations and the development of a new alignment algorithm mark a notable advancement in the field of text-to-speech technologies.
This paper proposes a novel cross-utterance audio-textual prompts based speaker adaptation approach for elderly speech recognition. It enables zero-shot, real-time adaptation to unseen speakers. Speech and text embeddings are extracted from the current and a few preceding utterances, before being fused in a cross-modal manner to produce compact speaker prompts that are more consistent than i/x-vectors and ECAPA-TDNN features. Experiments on the English DementiaBank Pitt and Cantonese JCCOCC MoCA elderly speech datasets suggest that the proposed online adaptation outperforms the speaker-independent (SI) model by statistically significant word error rate (WER) or character error rate (CER) reductions of 0.61% and 1.22% absolute (2.99% and 4.48% relative). Real-time factor (RTF) speed-up ratios of up to 9.83 times are obtained over offline batch-mode adaptation.
Primary: The Chinese University of Hong Kong
All Institutions: The Chinese University of Hong Kong, Institute of Software, Chinese Academy of Sciences, National Research Council Canada
The main contribution of this paper is the introduction of a novel online speaker adaptation method that leverages cross-utterance audio-textual prompts for elderly speech recognition. This work significantly advances the field by addressing the unique challenges posed by elderly speech, offering a real-time solution that integrates both acoustic and linguistic contexts for improved recognition accuracy.
The proposed methodology introduces a novel approach to speaker adaptation for elderly speech recognition through the use of cross-utterance audio-textual prompts. This dual cross-modality fusion technique effectively integrates both audio and textual contexts, addressing key challenges in elderly speech recognition, such as speaker heterogeneity and language deficiencies. The use of a Q-Former for compressing variable-length history information is innovative, allowing for real-time adaptation without significant latency. The method's ability to perform zero-shot adaptation is particularly noteworthy, as it enables the system to adapt to unseen speakers dynamically.
The experiments conducted on two distinct elderly speech datasets, DementiaBank Pitt and JCCOCC MoCA, provide robust evidence of the method's effectiveness. The reported reductions in WER and CER are statistically significant, and the real-time factor speed-up of up to 9.83 times over offline methods is impressive. The comprehensive evaluation against various baselines, including i/x-vectors and ECAPA-TDNN, demonstrates the superiority of the proposed approach. However, the paper could benefit from a more detailed analysis of the datasets and the specific experimental setups used.
While the methodology is well-documented, there is limited information on the exact implementation details, such as hyperparameter settings and training procedures. This lack of detail may hinder reproducibility. Including a supplementary material section with code or detailed configurations would enhance the paper's reproducibility.
The paper does not address potential limitations regarding the generalizability of the proposed method across different languages or dialects beyond English and Cantonese. Additionally, the reliance on the Whisper model may limit the applicability of the approach to other ASR systems. The performance metrics, while statistically significant, may not fully capture the real-world usability of the system in varied acoustic environments.
The implications of this research are significant, particularly as the global population ages. Improved speech recognition for elderly individuals can enhance their communication abilities, thereby fostering social engagement and improving quality of life. The approach could be adapted for use in other domains where speaker adaptation is critical, such as healthcare and assistive technologies. The main contribution of this paper is the introduction of a novel online speaker adaptation method that leverages cross-utterance audio-textual prompts for elderly speech recognition. This work significantly advances the field by addressing the unique challenges posed by elderly speech, offering a real-time solution that integrates both acoustic and linguistic contexts for improved recognition accuracy.
Audio deepfake detectors often fail to generalize across speakers, as they learn speaker-identity features rather than synthesis artifacts, known as implicit identity leakage. Existing methods address this but incur architectural complexity or training instability. This paper proposes a dual-granularity orthogonal disentanglement framework enforcing feature independence at two levels: sample-level cosine orthogonality captures directional decorrelation, while batch-level cross-covariance regularization eliminates linear correlations across embedding dimensions. A curriculum disentanglement schedule progressively strengthens the orthogonality constraint without auxiliary networks or adversarial dynamics. Experiments on ASVspoof 2019 LA, ASVspoof 2021 DF, and In-the-Wild datasets demonstrate that the proposed method achieves 1.35%, 7.88%, and 21.58% equal error rates (EER), respectively, surpassing gradient reversal disentanglement by 2.60% absolute on cross-dataset transfer.
Primary: Shanghai Jiao Tong University
All Institutions: Beijing Jiaotong University, ITMO University, Shanghai Jiao Tong University
This paper presents a dual-granularity orthogonal disentanglement framework that effectively addresses the challenge of generalizing audio deepfake detection across different speakers. The innovative methodology, combined with rigorous experimental validation, positions this work as a meaningful contribution to the field of audio processing and machine learning.
The proposed dual-granularity orthogonal disentanglement framework is innovative in its approach to mitigating implicit identity leakage in audio deepfake detection. By enforcing feature independence at both sample and batch levels, the authors provide a robust method that avoids the complexities associated with adversarial training and auxiliary networks. The curriculum disentanglement schedule is a thoughtful addition that enhances the training process by gradually increasing the constraints, which is a novel approach not commonly seen in similar works. The methodology is well-structured, with clear definitions of the problem, the architecture, and the loss functions used.
The experiments are comprehensive, utilizing multiple datasets (ASVspoof 2019 LA, ASVspoof 2021 DF, and In-the-Wild) to validate the proposed method's effectiveness. The reported equal error rates (EER) demonstrate significant improvements over existing methods, particularly in cross-dataset generalization, which is a critical aspect of audio deepfake detection. The ablation studies further strengthen the claims by quantifying the contributions of various components of the model.
The paper provides sufficient details regarding the architecture, training procedures, and hyperparameters, which would allow for reproducibility. However, the absence of a publicly available code repository or demo URL is a notable drawback, as it limits the community's ability to validate and build upon the work.
While the proposed method shows promising results, it may still be susceptible to certain types of audio manipulations not covered in the datasets used. Additionally, the reliance on cosine orthogonality and cross-covariance regularization may not capture all forms of identity leakage, suggesting that further exploration of alternative disentanglement techniques could be beneficial.
The implications of this work are significant, particularly in the context of increasing concerns over audio deepfakes in security and misinformation. The proposed framework could enhance the reliability of speaker verification systems and contribute to the development of more robust audio analysis tools. As deepfake technology continues to evolve, methods like this will be crucial in maintaining trust in audio communications. This paper presents a dual-granularity orthogonal disentanglement framework that effectively addresses the challenge of generalizing audio deepfake detection across different speakers. The innovative methodology, combined with rigorous experimental validation, positions this work as a meaningful contribution to the field of audio processing and machine learning.
Microphone bleed is a persistent challenge in small ensembles and orchestral recordings, where close microphones intended for individual instruments also capture leakage from nearby sources. This overlap degrades track isolation and complicates mixing. This paper addresses the bleeding problem by making channel-permutation-equivariance a core learning principle. During training, we apply the same random permutation to the input microphone channels and their corresponding reference targets. This discourages reliance on fixed channel-instrument associations and improves robustness to changes in the recording setup and even in the recorded instruments. The proposed model is trained on synthetic ensembles with diverse simulated room acoustics and microphone placements, and evaluated on unseen simulated conditions and real URMP recordings. The results show that permutation-aware training consistently improves SDR and reduces bleeding under unseen conditions compared with non-permutation baselines. The findings highlight permutation-equivariance as a simple, data-centric strategy for robust debleeding and practical multi-channel source separation in music production workflows.
Primary: Tampere University
All Institutions: University of Jaen, Tampere University
This paper presents a compelling approach to reducing microphone bleed in small music ensembles through channel-permutation equivariance, significantly advancing the field of multichannel source separation. The methodology is innovative, and the experimental results validate its effectiveness, marking a meaningful contribution to audio processing research.
The paper introduces a novel approach to address microphone bleed in small music ensembles by enforcing channel-permutation equivariance during training. This methodology is well-justified, as it mitigates overfitting to specific instrument timbres and channel assignments, thereby enhancing the model's ability to generalize across different recording conditions. The adaptation of the Hybrid Demucs architecture for this purpose is a significant technical contribution, allowing for effective multichannel source separation.
The experiments are robust, utilizing both synthetic datasets and real-world recordings from the URMP dataset. The evaluation metrics, particularly the signal-to-distortion ratio (SDR), are appropriate for the task and provide clear insights into the model's performance. The results demonstrate that permutation-aware training consistently outperforms non-permutation baselines across various conditions, reinforcing the effectiveness of the proposed method.
The paper provides sufficient detail regarding the experimental setup, including the synthetic data generation process and the training configurations. However, the absence of a publicly available code repository or demo URL limits the reproducibility of the results. Future work should consider releasing the code to facilitate further exploration and validation by the community.
One limitation noted is the reliance on synthetic datasets, which may not fully capture the complexities of real-world recordings. Additionally, the study focuses on a fixed number of channels (P=5), which may not generalize to larger ensembles. The authors also acknowledge the need for larger real close-microphone datasets for future evaluations.
The findings have significant implications for music production workflows, particularly in classical and ensemble recordings where microphone bleed is a common issue. By improving source separation techniques, this research can enhance audio quality and facilitate better mixing processes, ultimately benefiting musicians, sound engineers, and the music industry at large. This paper presents a compelling approach to reducing microphone bleed in small music ensembles through channel-permutation equivariance, significantly advancing the field of multichannel source separation. The methodology is innovative, and the experimental results validate its effectiveness, marking a meaningful contribution to audio processing research.
Current multiparty turn-taking models often rely on complex microphone arrays or multi-camera setups, limiting their applicability in human-robot interaction scenarios. We introduce MuVAP, a causal multimodal framework that extends Voice Activity Projection by grounding acoustic predictions in face tracks, enabling speaker-aware turn-taking predictions from a monaural audio stream and a single camera view. To address the combinatorial complexity of modeling multiple speakers, we propose Role-Relative Projection, which maps any N-speaker interaction onto a fixed current versus next floor-holder state. Because existing audiovisual datasets contain disruptive editing cuts that break causal tracking, we introduce the Audio-Visual Conversation Corpus, a 31-hour dataset of unedited, single-camera multiparty conversations. Evaluations demonstrate that MuVAP outperforms strong baselines on Shift-Hold and next-speaker prediction tasks across two- and three-speaker settings.
Primary: KTH Royal Institute of Technology
All Institutions: KTH Royal Institute of Technology
The main contribution of this paper is the introduction of MuVAP, a causal multimodal framework that effectively predicts turn-taking in multiparty conversations using a single audio stream and a single camera view. This innovative approach, combined with the creation of the AVCC dataset, addresses critical limitations in existing methods and has the potential to advance the field of conversational AI significantly.
The paper presents MuVAP, a novel multimodal framework that integrates audio and visual data to predict turn-taking in multiparty conversations. The methodology is well-structured, introducing Role-Relative Projection to simplify the complexity of multiparty interactions by focusing on the current and next speaker. The use of a single audio channel and a single camera view is a significant departure from traditional methods, which often require complex setups. The introduction of the Audio-Visual Conversation Corpus (AVCC) is a crucial contribution, as it provides a dataset specifically designed for this type of analysis, addressing the limitations of existing datasets.
The experiments are comprehensive, comparing MuVAP against strong baselines across various tasks, including Shift-Hold and Next Speaker Prediction. The results demonstrate that MuVAP outperforms these baselines, showcasing its effectiveness in real-world scenarios. The evaluation metrics used, such as Macro-F1 for Shift-Hold Prediction and accuracy for Next Speaker Prediction, are appropriate for the tasks at hand. The paper provides detailed results and analysis, indicating a thorough evaluation process.
The paper includes sufficient implementation details, including the architecture of the model, training procedures, and the datasets used. However, the lack of a public demo or clear access to the trained models may hinder full reproducibility. The GitHub repository provides some resources, but additional documentation would enhance reproducibility.
The paper acknowledges limitations, such as the class imbalance introduced by the Role-Relative Projection and the reliance on visual tracking that may miss subtle facial cues. Additionally, the model's performance is evaluated primarily on two- and three-speaker settings, which may not fully represent larger group dynamics. The authors also note the potential for improved performance with more advanced visual encoders.
The implications of this research are significant for human-robot interaction and conversational AI, as it enables more natural and responsive interactions in multiparty settings. The ability to predict turn-taking dynamics using standard hardware makes this approach accessible for various applications, including social robotics, virtual assistants, and interactive media. The main contribution of this paper is the introduction of MuVAP, a causal multimodal framework that effectively predicts turn-taking in multiparty conversations using a single audio stream and a single camera view. This innovative approach, combined with the creation of the AVCC dataset, addresses critical limitations in existing methods and has the potential to advance the field of conversational AI significantly.
Low frame rates in neural audio codecs are attractive for autoregressive speech synthesis, where the generation cost scales linearly with the sequence length. Recent work has demonstrated that codecs can operate at 12.5 Hz and below, but the mechanisms underlying low frame rate degradation remain insufficiently understood. We investigate these mechanisms through a controlled frame rate ablation. We reproduce a quality cliff at 6.25 Hz reported in previous works and evaluate candidate explanations: phonemic collisions and codebook saturation, neither of which shows evidence of a fundamental barrier. The cliff is instead caused by suboptimal training configuration: fixed clip duration during training yields too few tokens at low frame rates, starving the decoder of inter-token context. Once corrected, WER degrades smoothly with phonemic load down to 3.1 Hz and 1.6 Hz, suggesting the inference-time efficiency gains of low frame rate codecs are more accessible than previously assumed.
Primary: Carnegie Mellon University Africa
All Institutions: Carnegie Mellon University Africa
The main contribution of this paper is the identification of training configuration as the primary cause of quality degradation in neural audio codecs at low frame rates, challenging previous assumptions about the inherent limitations of such systems. This work offers a novel perspective on codec design, emphasizing the importance of training methodologies in achieving efficient and intelligible audio synthesis.
The authors employ a controlled ablation study to investigate the effects of low frame rates on neural audio codecs, specifically focusing on the training configurations that lead to performance degradation. They systematically analyze potential causes for the observed quality cliff, such as phonemic collisions and codebook saturation, and identify a training misconfiguration as the primary issue. This approach is methodologically sound, as it combines theoretical analysis with empirical validation, allowing for clear conclusions about the limitations of current training practices.
The experiments are well-structured, utilizing a range of frame rates and comparing the performance of various codecs on established benchmarks such as WER, STOI, and SPK-SIM. The use of a comprehensive dataset (LibriSpeech) and the evaluation of multiple metrics provide a robust assessment of codec performance across different configurations. The results clearly illustrate the impact of training configuration on codec intelligibility at low frame rates, thus contributing valuable insights into the design of future codecs.
The paper provides sufficient details regarding the training process, model architectures, and evaluation metrics, which would allow other researchers to replicate the experiments. However, the lack of publicly available code or models limits the ease of reproducibility. Including a project URL with the code would enhance this aspect significantly.
One limitation is the focus on a specific dataset (LibriSpeech), which may not generalize to all audio synthesis tasks or languages. Additionally, while the authors identify training configuration as a key factor, they do not explore other potential architectural modifications that could further improve performance at low frame rates. The paper also lacks a discussion on the computational costs associated with training and inference at these low frame rates.
The findings have significant implications for the design of neural audio codecs, particularly in applications where inference efficiency is critical, such as real-time speech synthesis and low-latency communication systems. By demonstrating that low frame rates can be utilized effectively with appropriate training strategies, this work paves the way for more efficient audio processing technologies in various domains. The main contribution of this paper is the identification of training configuration as the primary cause of quality degradation in neural audio codecs at low frame rates, challenging previous assumptions about the inherent limitations of such systems. This work offers a novel perspective on codec design, emphasizing the importance of training methodologies in achieving efficient and intelligible audio synthesis.
This work investigates modelling strategies in continuous and discrete latent spaces in the vector quantisation (VQ)-based neural audio codec (NAC) speech enhancement (SE), along with the role of VQ regularisation. We propose cNAC-SE and dNAC-SE frameworks that predict continuous representations and discrete tokens in latent space, respectively. Theoretical analysis and visualisations in latent space are performed to exhibit their inherent modelling mechanisms. Experimental results show that the fully fine-tuned cNAC-SE model consistently outperforms all dNAC-SE variants across diverse test conditions and achieves leading performance among established generative approaches in DNS-MOS metrics. Comparison with the discriminative counterpart shows that VQ enhances robustness through an intrinsic effect of clean-prior-constrained regularisation, independent of discrete token processing. This highlights the transferable value of VQ regularisation to other continuous modelling methods.
Primary: Ghent University - imec
All Institutions: Ghent University - imec
The paper presents two novel VQ-based frameworks for speech enhancement, demonstrating their effectiveness and robustness through comprehensive experiments. The technical contributions, particularly the introduction of clean-prior-constrained VQ regularization, provide valuable insights into generative modeling strategies and have the potential to influence future research in the field.
The paper proposes two innovative frameworks, cNAC-SE and dNAC-SE, that leverage vector quantization (VQ) in both continuous and discrete latent spaces for speech enhancement. The methodology is well-structured, with a clear distinction between the two models and a robust theoretical analysis of their latent space mechanisms. The use of VQ regularization is particularly noteworthy, as it enhances the robustness of the generative model, which is a significant contribution to the field. The architecture employs transformer blocks effectively, and the introduction of clean-prior-constrained VQ regularization is a novel approach that differentiates this work from prior models.
The experiments are comprehensive, utilizing the DNS3 Challenge dataset, which is a relevant and challenging benchmark for speech enhancement tasks. The paper presents a thorough evaluation of various configurations of the proposed models, comparing them against established generative methods. The results demonstrate that the cNAC-SE model outperforms its dNAC-SE counterpart and other generative models in terms of DNS-MOS metrics, indicating the effectiveness of the proposed methodologies. The ablation studies further validate the importance of fine-tuning the encoder and decoder, showcasing the robustness of the models under different conditions.
The paper provides sufficient details regarding the experimental setup, including the dataset, training parameters, and architecture configurations, which facilitates reproducibility. However, the lack of a publicly available code repository limits the ease of reproduction for other researchers. The authors could enhance reproducibility by sharing their implementation or providing a link to a code repository.
One limitation of the study is the computational overhead associated with the full codec pipeline, which may hinder deployment in resource-constrained environments. Additionally, while the models show strong performance across various test conditions, the paper does not extensively discuss their performance in extreme noise conditions or with highly distorted inputs, which could be a relevant area for future exploration.
The proposed methodologies have significant implications for real-world applications in speech enhancement, particularly in scenarios involving noisy environments, such as telecommunications and assistive technologies. The findings could lead to improved user experiences in voice communication systems and enhance accessibility for individuals with hearing impairments. Furthermore, the transferable value of VQ regularization to other continuous modeling methods opens avenues for further research in generative modeling across different domains. The paper presents two novel VQ-based frameworks for speech enhancement, demonstrating their effectiveness and robustness through comprehensive experiments. The technical contributions, particularly the introduction of clean-prior-constrained VQ regularization, provide valuable insights into generative modeling strategies and have the potential to influence future research in the field.
With the growing focus on audio in multimedia applications, numerous advanced works on audio generation have emerged. Existing studies typically treat text-to-audio (TTA) and other related audio generation tasks, such as instruction-based audio editing, as independent challenges, adopting task-specific architectures or modules. This absence of a unified modeling paradigm substantially increases the overhead and complexity of building a system for both audio generation and editing, while also leading to limited scalability. To address this issue, we introduce AudioWeave, a unified model for TTA and audio editing without additional task-specific components. Specifically, we propose a joint condition modeling approach with a factorized position embedding, enabling the diffusion transformer backbone to operate under heterogeneous inputs of TTA and audio editing. We further propose a progressive multistage training strategy to mitigate task competition and catastrophic forgetting caused by interference among multiple tasks. This in turn helps maintain the performance of each individual task and may even lead to improvements in certain aspects. Experimental results on TTA task and six audio editing tasks show that our unified model achieves competitive performance with task-specific models, laying a groundwork for further exploration of unified audio generation models.
Primary: Institute of Artificial Intelligence of China Telecom (TeleAI)
All Institutions: Institute of Artificial Intelligence of China Telecom (TeleAI), Department of Electronic Engineering and Information Science, School of Artificial Intelligence, Tianjin University, Tianjin Key Laboratory of Cognitive Computing and Application
This paper presents AudioWeave, a unified model for audio generation and editing that effectively combines multiple tasks into a single framework, demonstrating competitive performance and paving the way for future research in unified audio models.
The paper introduces a novel unified model, AudioWeave, which integrates text-to-audio generation and audio editing tasks using a single architecture. The methodology includes a joint condition modeling approach with factorized position embedding and a progressive multistage training strategy that mitigates task competition. This approach is innovative as it avoids the complexity of task-specific architectures and allows for effective interaction between different audio generation tasks.
The experiments are comprehensive, utilizing multiple datasets for both TTA and audio editing tasks. The results demonstrate competitive performance against state-of-the-art models, with both objective and subjective evaluations showing the effectiveness of the proposed model. The inclusion of human evaluations (MOS) alongside objective metrics strengthens the validity of the findings.
The paper provides detailed implementation details, including model architecture, training strategies, and datasets used, which facilitates reproducibility. However, the lack of a publicly available code repository limits full reproducibility.
One limitation is the reliance on existing datasets, which may not fully represent the diversity of audio generation tasks. Additionally, while the model performs well, it may still lag behind specialized models in certain edge cases or specific tasks.
The unified approach to audio generation and editing has significant implications for multimedia applications, potentially streamlining workflows in content creation. The model's ability to handle multiple tasks with a single architecture could lead to more efficient tools for audio professionals and enhance user engagement in interactive media. This paper presents AudioWeave, a unified model for audio generation and editing that effectively combines multiple tasks into a single framework, demonstrating competitive performance and paving the way for future research in unified audio models.
Accent text-to-speech (TTS) aims to synthesize speech with target accents. Existing accent TTS systems typically rely on a two-stage pipeline that first converts standard phone sequences into accented phone sequences and then synthesizes accented speech. However, such approaches suffer from error accumulation and require paired standard-accented phone sequence data, which is often limited in practice. Moreover, text-based accented phone representations are insufficient to model acoustic accent characteristics such as prosody and rhythm. In this work, we propose Joycent, a diffusion-based accent TTS model that synthesizes accented speech directly from standard phone sequences and speech references without accented phone prediction. Joycent integrates accent and speaker representations through conditional layer normalization (CLN) in the text encoder. We introduce WhisAID, a Mandarin accent identification model trained on accented Mandarin speech to extract accent representations. Experimental results show that Joycent improves accentedness while preserving speaker identity compared with baseline systems. We release our code and demos at: https://github.com/oshindow/Joycent-code.
Primary: National University of Singapore
All Institutions: National University of Singapore
The main contribution of this work is the introduction of Joycent, a diffusion-based accent TTS model that synthesizes accented speech directly from standard phone sequences and speech references, effectively addressing the limitations of existing accent TTS systems. This innovative approach, combined with the introduction of WhisAID for accent identification, represents a significant advancement in the field of text-to-speech synthesis, particularly for accented speech.
The methodology presented in this paper is innovative, leveraging a diffusion-based approach to accent TTS that bypasses the traditional two-stage pipeline. The introduction of WhisAID for accent identification and the use of conditional layer normalization (CLN) to integrate accent and speaker representations are noteworthy advancements. The paper effectively addresses the limitations of existing methods by focusing on acoustic characteristics rather than solely relying on text-based representations. The use of gradient reversal layers to disentangle speaker identity from accent characteristics is a clever approach that enhances the model's generalization capabilities.
The experiments are well-structured, utilizing multiple datasets and evaluation metrics, including both subjective (MOS, SMOS) and objective measures (accuracy, F1 score). The results demonstrate that Joycent outperforms baseline models in terms of accentedness while maintaining speaker identity, which is a critical aspect of TTS systems. The ablation studies further validate the effectiveness of the proposed methods, providing insights into the importance of embedding placement and conditioning strategies.
The paper provides sufficient details regarding the model architecture, training procedures, and datasets used, which enhances reproducibility. The authors have also made their code available on GitHub, facilitating further experimentation and validation by other researchers in the field.
One limitation is the focus on Mandarin accents, which may restrict the generalizability of the findings to other languages or accents. Additionally, while the model shows promise, the subjective evaluation scores indicate that there is still room for improvement in naturalness compared to some baseline systems. The paper could also benefit from a more extensive discussion on the computational efficiency of the proposed method in real-world applications.
The proposed model has significant implications for applications in language learning, speech synthesis, and accessibility technologies. By improving the synthesis of accented speech, it can enhance user experiences in various applications, including virtual assistants, language learning tools, and media content localization. Furthermore, the ability to generate diverse accented pronunciations can aid in the development of more robust mispronunciation detection systems. The main contribution of this work is the introduction of Joycent, a diffusion-based accent TTS model that synthesizes accented speech directly from standard phone sequences and speech references, effectively addressing the limitations of existing accent TTS systems. This innovative approach, combined with the introduction of WhisAID for accent identification, represents a significant advancement in the field of text-to-speech synthesis, particularly for accented speech.
Short-duration speaker verification (SDSV) is crucial for personalized keyword spotting, where test utterances are typically shorter than three seconds. Limited speech duration results in unstable speaker representations and increased sensitivity to noise and phoneme variations, thereby degrading performance. To investigate this issue, we construct VoxPhrase, a large-scale SDSV corpus automatically segmented from the VoxCeleb dataset. Our analysis shows that text-dependent (TD) enrollment is constrained by duration and yields unstable speaker representations. In contrast, although text-independent (TI) enrollment introduces content mismatch, its representations become more stable as the enrollment duration increases. Accordingly, we propose a hybrid-enrollment neural re-scoring framework that combines TD and TI enrollment and performs frame-level comparison via parallel cross-attention. Experiments on VoxPhrase demonstrate consistent improvements across multiple speaker models.
Primary: Xi’an Jiaotong-Liverpool University
All Institutions: Xi’an Jiaotong-Liverpool University, Hithink RoyalFlush AI Research Institute, Shanghai University
This paper presents a novel hybrid approach to short-duration speaker verification that combines text-dependent and text-independent methods, demonstrating significant improvements in performance through rigorous experimentation. The methodology and results contribute valuable insights to the field, particularly in addressing the challenges of speaker verification in practical applications.
The paper introduces a hybrid enrollment method that effectively combines text-dependent (TD) and text-independent (TI) speaker verification techniques, addressing the challenges posed by short-duration utterances. The methodology is well-structured, utilizing a frozen speaker model for feature extraction and a neural verifier that employs parallel cross-attention for frame-level similarity modeling. This approach is innovative as it leverages the strengths of both TD and TI methods while mitigating their individual weaknesses, particularly in the context of short-duration speaker verification.
The experiments are comprehensive, utilizing a newly constructed VoxPhrase dataset that allows for systematic evaluation of the proposed method. The results demonstrate consistent performance improvements across various speaker models and evaluation conditions, particularly in hard-case scenarios. The use of Equal Error Rate (EER) as a performance metric is appropriate and provides a clear measure of the system's effectiveness.
The paper provides sufficient details regarding the experimental setup, including the models used, training configurations, and evaluation metrics. However, the lack of publicly available code or dataset access limits reproducibility. Future work should consider releasing the VoxPhrase dataset and the trained models to enhance reproducibility.
One limitation is the reliance on a frozen speaker model, which may restrict the adaptability of the system to new speaker data. Additionally, while the hybrid approach shows improvements, the paper does not extensively explore the trade-offs between TD and TI enrollment under varying conditions, which could provide deeper insights into their individual contributions.
The proposed framework has significant implications for real-world applications in personalized keyword spotting and speaker verification systems, particularly in environments with short utterance durations. The findings could influence the design of future speaker verification systems, making them more robust to variations in speech length and content. This paper presents a novel hybrid approach to short-duration speaker verification that combines text-dependent and text-independent methods, demonstrating significant improvements in performance through rigorous experimentation. The methodology and results contribute valuable insights to the field, particularly in addressing the challenges of speaker verification in practical applications.
Audio-Language Models (ALMs) have shown remarkable success in zero-shot audio classification by aligning audio waveforms with text. Recent efforts to improve downstream performance focus on learning optimal text prompts. However, previous approaches focus on the text encoder, leaving the potential of learnable prompts within the audio encoder unexplored. In this paper, we propose a novel framework that introduces trainable prompts into the audio encoder to capture task-specific acoustic features. We demonstrate that integrating audio-side prompt learning with existing text-side approaches enhances few-shot adaptation. Through extensive experiments across 11 datasets show that integrating our method as a plug-and-play module alongside existing text prompt tuning generally leads to performance improvements. These findings suggest that explicitly modulating the audio representation space effectively complements text-only prompting approaches. The code is available at https://github.com/hyebin-c/aspl.
Primary: Korea Advanced Institute of Science and Technology
All Institutions: Korea Advanced Institute of Science and Technology
The main contribution of this paper is the introduction of Audio-Side Prompt Learning (ASPL), which effectively enhances few-shot learning in audio-language models by integrating trainable prompts into the audio encoder. This innovative approach not only addresses a significant gap in current research but also demonstrates substantial improvements in classification performance across diverse audio tasks, marking a meaningful advancement in the field of audio machine learning.
The proposed methodology introduces a novel framework for Audio-Side Prompt Learning (ASPL) that integrates trainable prompts into the audio encoder, enhancing few-shot learning in audio-language models. The approach is methodologically sound, employing a multi-level modulation strategy that targets specific stages of the audio processing pipeline. This innovative perspective addresses a critical gap in existing research, which has predominantly focused on text-side prompting, thus expanding the scope of prompt learning in multimodal contexts.
The experiments are robust, covering 11 diverse datasets that span various audio classification tasks. The results demonstrate a consistent performance improvement over existing methods, with the ASPL framework showing significant gains in accuracy across different few-shot settings. The use of a standard evaluation protocol and comprehensive ablation studies further strengthens the validity of the findings.
The paper provides sufficient implementation details, including architecture specifications, training protocols, and hyperparameter settings, which facilitate reproducibility. The availability of the code on GitHub enhances this aspect, allowing other researchers to replicate the experiments and build upon the work.
While the paper presents a strong case for the efficacy of ASPL, it does not address potential limitations regarding the generalizability of the approach across all audio tasks, especially those with significantly different characteristics from the datasets used. Additionally, the reliance on a fixed number of parameters may limit adaptability in more complex scenarios.
The implications of this research are substantial, as it opens new avenues for improving audio classification systems, particularly in few-shot learning scenarios. The findings could influence future developments in multimodal AI, enhancing the performance of audio-language models in real-world applications such as sound recognition and interactive AI systems. The main contribution of this paper is the introduction of Audio-Side Prompt Learning (ASPL), which effectively enhances few-shot learning in audio-language models by integrating trainable prompts into the audio encoder. This innovative approach not only addresses a significant gap in current research but also demonstrates substantial improvements in classification performance across diverse audio tasks, marking a meaningful advancement in the field of audio machine learning.
This paper addresses timbral ambiguity in instrument timbre transfer under fine-grained structural conditions. We argue this issue stems from instrument-specific expressive details in these conditions, which conflict with the target timbral properties. For example, imposing a violin's pitch-dominant vibrato contours onto a flute, which naturally exhibits loudness-dominant vibrato, impairs timbral fidelity. We propose AdaTT, a target-adaptive system that ensures high timbral fidelity across diverse timbre transfer scenarios within the ControlNet scheme. It selectively scales the frame-wise influence of pitch and loudness controls via text prompts to match the target instrument's identity. We also present a semi-automatic data construction pipeline to teach the model which expressive details to transform or preserve. Results show AdaTT achieves superior timbral fidelity and naturalness while retaining score-level content. Audio samples are available at https://dabinkim0.github.io/adatt/.
Primary: KAIST
All Institutions: KAIST
The main contribution of this paper is the development of AdaTT, a target-adaptive system for instrument timbre transfer that enhances timbral fidelity and naturalness while preserving score-level content. This work represents a significant advancement in the field of audio processing, addressing existing challenges in timbre transfer with innovative methodologies and robust experimental validation.
The proposed methodology, AdaTT, introduces a target-adaptive mechanism within the ControlNet framework, which innovatively scales the influence of pitch and loudness controls based on text prompts. This approach effectively addresses the challenge of timbral ambiguity in instrument timbre transfer, allowing for the preservation of expressive details while adapting to the target instrument's characteristics. The semi-automatic data construction pipeline is a notable contribution, as it alleviates the burden of manual annotation, enhancing the model's training efficiency. The integration of Control Scale Predictors (CSPs) and Text-Guided CSPs (TG-CSPs) for adaptive modulation is a significant advancement in the field.
The experiments are well-structured, utilizing a comprehensive dataset covering various instrument types and employing both objective and subjective evaluation metrics. The results demonstrate that AdaTT outperforms existing baselines in timbral fidelity and naturalness while maintaining score-level content preservation. The use of metrics like CLAP score and subjective ratings provides a robust framework for evaluating the model's performance, showcasing its effectiveness in real-world applications.
The paper provides sufficient details regarding the experimental setup, including the training process, data sources, and evaluation metrics. However, the absence of a publicly available code repository limits the reproducibility of the results. The authors should consider releasing their code and trained models to facilitate further research and validation of their findings.
The primary limitation identified is the model's restriction to monophonic audio, which may hinder its applicability in more complex musical contexts. Additionally, the method does not account for spatial characteristics such as reverberation, which could further enhance the realism of the generated audio. Future work should aim to extend these capabilities to polyphonic scenarios and integrate spatial audio features.
The implications of this research are significant for music production, composition, and arrangement, particularly for non-expert users who may lack instrumental proficiency. By enabling high-fidelity timbre transfer, AdaTT can democratize music creation, allowing a broader audience to engage with music technology. Furthermore, the techniques developed could inspire advancements in other areas of audio processing and generative models. The main contribution of this paper is the development of AdaTT, a target-adaptive system for instrument timbre transfer that enhances timbral fidelity and naturalness while preserving score-level content. This work represents a significant advancement in the field of audio processing, addressing existing challenges in timbre transfer with innovative methodologies and robust experimental validation.
Pathological speech from patients with neurodegenerative and neuromotor disorders is often acoustically distorted and linguistically fragmented, making pathological speech reconstruction necessary to recover intended textual content from distorted and incomplete speech recordings. Crucially, such recordings are rarely uniformly degraded: some words or short phrases remain reliable and can serve as audible anchors for reconstructing the corrupted surrounding content. We introduce Anchor-gated Phonetic Group Relative Policy Optimization (AP-GRPO), a GRPO framework with phonetic reward that aligns speech language models (SLMs) through audible-anchor preservation and inter-anchor phonetic compatibility to the original speech signal. AP-GRPO consists of: (i) an anchor-gated reward that matches reliable audible anchors in clear regions; and (ii) an inter-anchor phonetic alignment reward that evaluates whether recovered contents are phonetically supported by the corresponding corrupted inter-anchor speech span. Across four disease conditions, AP-GRPO improves faithful speech reconstruction, and the learned anchor constraint automatically adapts to each condition and thus reveals interpretable disease-specific profiles: conditions with severe articulatory degradation require stronger anchor enforcement, whereas milder impairment or linguistically impaired conditions rely more on phonetic alignment for inter-anchor recovery.
Primary: University of California Irvine
All Institutions: University of California Irvine, University of Illinois Chicago, Kennesaw State University
The main contribution of this paper is the introduction of AP-GRPO, a novel framework that significantly enhances the reconstruction of pathological speech by utilizing audible anchors and phonetic alignment. This work not only addresses a critical gap in the field of speech processing but also has the potential to impact clinical practices for individuals with speech impairments.
The methodology presented in this paper is innovative, particularly in its use of the Anchor-Gated Phonetic Group Relative Policy Optimization (AP-GRPO) framework. This approach leverages reliable audible anchors from pathological speech recordings to guide the reconstruction of distorted speech, which is a significant advancement over traditional methods that do not account for the non-uniform degradation of speech. The integration of phonetic alignment and anchor preservation rewards effectively addresses the challenges of reconstructing intelligible speech from severely degraded audio. The use of reinforcement learning to optimize the reconstruction process is well-justified and demonstrates a thoughtful approach to the problem.
The experimental evaluation is robust, with tests conducted across four different pathological speech conditions (ALS, cerebral palsy, dementia, and Parkinson's disease). The authors provide a comprehensive comparison against several baseline methods, demonstrating significant improvements in word error rate (WER) and character error rate (CER). The results are compelling, particularly the reduction of WER from 0.75 to 0.29 in severe cases, which indicates a meaningful enhancement in the quality of reconstructed speech. The use of various metrics (WER, CER, BLEU-4, and Content-F1) adds depth to the evaluation.
The paper provides detailed implementation details, including the data preprocessing steps, model architecture, and training parameters. However, the lack of a publicly available code repository or demo URL limits the reproducibility of the results. Future work should consider sharing the code and datasets to facilitate independent verification of the findings.
While the paper presents a strong methodology and results, it acknowledges several limitations. The reliance on high-quality anchor extraction and the potential for errors in this preprocessing step could impact the overall performance. Additionally, the method does not directly generate restored speech audio, which is a critical aspect for practical applications. The authors also note that the performance on the Parkinson's dataset is more limited, suggesting that further optimization may be necessary.
The potential applications of this research are significant, particularly in clinical settings where effective communication is essential for patients with neurodegenerative disorders. By improving the intelligibility of reconstructed speech, this work could enhance the quality of life for affected individuals and support caregivers in understanding their communicative intent. The findings could also pave the way for future advancements in speech technology, particularly in the realm of assistive communication devices. The main contribution of this paper is the introduction of AP-GRPO, a novel framework that significantly enhances the reconstruction of pathological speech by utilizing audible anchors and phonetic alignment. This work not only addresses a critical gap in the field of speech processing but also has the potential to impact clinical practices for individuals with speech impairments.
Codecfakes (CFs) are a type of speech deepfakes generated through Audio Language Models (ALMs), with Neural Audio Codecs (NACs) forming the core mechanism for speech encoding and generation. CFs exhibit distributional characteristics that differ from vocoder-based deepfakes, causing detectors trained on vocoder data to generalize poorly to CFs detection. Although this has led to the development of CF detection benchmarks, existing resources are largely confined to English -- and to a limited extent Chinese -- leaving South-East Asian (SEA) languages unexplored. To bridge this gap, we introduce SEA-CF, the first large-scale benchmark for CF detection spanning multiple SEA languages, diverse speaker profiles, and a wide range of NAC architectures. SEA-CF is constructed by synthesizing publicly available real speech corpora. Our experiments show that state-of-the-art (SOTA) CF detectors trained on English-centric datasets fail to generalize to SEA speech due to language-specific phonetic structures, tonal variations, and rich prosodic diversity. We further conduct a comprehensive zero-shot and fine-tuned evaluation of recent SOTA ALMs on SEA-CF. Fine-tuning the ALMs improves performance, however, these are very large being impractical for real-world application due to their scale, particularly in low-resource and latency-constrained settings. To address this limitation, we propose a novel small-ALM, GARUDA tailored for CF detection, which delivers strong performance while remaining lightweight. Extensive evaluations demonstrate that the proposed Small-ALM outperforms strong end-to-end and ALM-based baselines, establishing a new, practical direction for robust CF detection in SEA languages and beyond.
Primary: IIIT-Delhi
All Institutions: IIIT-Delhi, UPES, VBSPU
This paper introduces SEA-CF, the first large-scale benchmark for CF detection in SEA languages, and proposes GARUDA, a lightweight Small-ALM that outperforms existing models while addressing practical deployment challenges. The technical contributions are significant, with a strong focus on methodology and experimental validation, positioning this work as a valuable asset in the field of audio deepfake detection.
The methodology presented in the paper is robust, introducing the SEA-CF benchmark for CF detection in SEA languages, which is a significant advancement given the lack of resources in this area. The authors propose GARUDA, a lightweight Small-ALM that effectively combines dual-encoder architectures to capture semantic and prosodic features, which is innovative. The use of JS divergence as a loss function for aligning representations is a novel approach that enhances the model's performance. Overall, the methodology is well-structured and addresses practical deployment challenges.
The experimental evaluation is comprehensive, utilizing both zero-shot and fine-tuned settings to assess the performance of GARUDA against existing SOTA models. The results demonstrate significant improvements over baselines, with rigorous statistical testing (McNemar’s test) validating the findings. The paper effectively highlights the necessity of in-domain training and the limitations of existing models when applied to SEA languages, underscoring the importance of the proposed SEA-CF benchmark.
The paper provides sufficient details regarding the dataset construction, model architecture, and training procedures, which enhances reproducibility. The authors mention the use of publicly available datasets and provide a project URL for accessing the SEA-CF benchmark, which is crucial for other researchers looking to replicate or build upon this work.
While the paper makes significant contributions, it acknowledges limitations such as the incomplete coverage of all SEA languages and the current restriction of evaluations to available benchmarks. Future work is needed to expand the dataset and improve generalization across diverse generators.
The work has substantial implications for enhancing security against audio deepfakes in low-resource language contexts, addressing a critical gap in the current landscape of speech technology. By focusing on SEA languages, the research promotes inclusivity and provides tools that can be vital for protecting vulnerable communities against audio fraud. This paper introduces SEA-CF, the first large-scale benchmark for CF detection in SEA languages, and proposes GARUDA, a lightweight Small-ALM that outperforms existing models while addressing practical deployment challenges. The technical contributions are significant, with a strong focus on methodology and experimental validation, positioning this work as a valuable asset in the field of audio deepfake detection.
This paper proposes a geometrically constrained decentralized independent vector analysis (GC-Dec-IVA) method for distributed microphone arrays. Recently proposed Dec-IVA method enables source separation by exchanging only power-related statistics to exploit cross-array information. However, this initial attempt often provides negligible improvement over applying IVA locally at each array, mainly due to the potential permutation inconsistency among arrays and the strong cross-array dependency implied by its source model. To address these limitations, we incorporate direction-of-arrival (DOA) information to derive GC-Dec-IVA, which mitigates permutation mismatch across arrays and enhances source alignment. Furthermore, a new source model is introduced to weaken cross-array dependency, improving robustness against permutation inconsistency in noisy environments. Experiments show the proposed method improves both the separation performance and cross-array permutation consistency.
Primary: Waseda University
All Institutions: Waseda University, Nanjing University, Northwestern Polytechnical University, School of Electronic Information, Wuhan University, School of Intelligence Science and Technology
The main contribution of this paper is the introduction of the GC-Dec-IVA method, which effectively utilizes DOA information to improve source separation in decentralized microphone arrays. This work represents a substantial advancement in the field of audio signal processing, addressing critical challenges in blind source separation and enhancing the robustness of decentralized methods in practical applications.
The paper introduces a novel geometrically constrained decentralized independent vector analysis (GC-Dec-IVA) method that effectively incorporates direction-of-arrival (DOA) information to address the limitations of the existing Dec-IVA method. The proposed approach enhances source alignment and mitigates permutation inconsistencies across distributed microphone arrays. The methodology is well-structured, leveraging a new source model that reduces cross-array dependency, which is a significant improvement over previous models. The use of a maximum a posteriori (MAP) principle to derive the cost function is a solid theoretical foundation, and the iterative optimization algorithm appears to be robust.
The experiments conducted are comprehensive, utilizing simulated reverberant environments with varying numbers of microphone arrays and noise conditions. The performance metrics, including signal-to-distortion ratio improvement (SDRi) and signal-to-interference ratio improvement (SIRi), are appropriate for evaluating the effectiveness of the proposed methods. The results demonstrate a clear improvement in separation performance and permutation consistency, particularly in noisy environments, validating the proposed approach's effectiveness.
The paper provides sufficient detail regarding the experimental setup, including the generation of speech mixtures, noise conditions, and the parameters used for the algorithms. However, the lack of a publicly accessible code repository or demo URL limits the reproducibility of the results. Future work could benefit from sharing the implementation details to facilitate further research and validation by the community.
One limitation of the proposed method is its reliance on accurate DOA information, which may not always be available in practical scenarios. Additionally, while the results are promising, they are based on simulated environments, and real-world performance may vary due to unmodeled factors such as varying noise types and room acoustics.
The proposed GC-Dec-IVA method has significant potential applications in various fields, including teleconferencing, meeting transcription, and smart environments where multiple microphone arrays are deployed. By improving source separation in noisy conditions, the method could enhance communication clarity and effectiveness in real-world applications, thereby contributing to advancements in audio processing technologies. The main contribution of this paper is the introduction of the GC-Dec-IVA method, which effectively utilizes DOA information to improve source separation in decentralized microphone arrays. This work represents a substantial advancement in the field of audio signal processing, addressing critical challenges in blind source separation and enhancing the robustness of decentralized methods in practical applications.
Non-verbal vocalizations (NVs), such as laughter, sighs, and coughs, are important acoustic cues for emotion and intent. Existing speech quality assessment methods typically focus on overall naturalness, while non-verbal TTS evaluations mainly examine whether a target NV appears with the correct type and position. However, the perceptual quality of NV events themselves remains underexplored. To address this gap, we construct an NV-MOS dataset containing outputs from multiple NV-TTS systems and naturally occurring NV samples, with ratings collected from three acoustic experts on a perceptual quality scale. We further analyze audio-capable multimodal large language models such as Gemini and find clear inconsistencies between their scores and expert ratings. These results suggest that general-purpose multimodal models cannot reliably replace human judgments for NV quality assessment. We then propose NVMOS, to our knowledge the first model that can reliably predict the perceptual quality of NV events in speech. Experimental results show that, with a local NV-event focusing module, NVMOS reaches expert-level or stronger agreement with human MOS.
Primary: South China University of Technology
All Institutions: South China University of Technology, Tongji University, Foshan University
The paper introduces NV-MOS, an expert-rated dataset and modeling framework for perceptual quality assessment of non-verbal vocalizations in speech. This work significantly advances the understanding and evaluation of NVs, offering a dedicated approach that outperforms existing general-purpose models and highlighting the need for specialized tools in this domain.
The methodology presented in this paper is robust and well-structured. The authors construct the NV-MOS dataset, which is a significant contribution to the field, as it provides a dedicated resource for evaluating non-verbal vocalizations in speech. The proposed NVMOS model employs a novel local NV-event focusing module that effectively utilizes cross-attention mechanisms to assess the perceptual quality of NVs in a way that traditional models do not. This approach is innovative as it combines audio signal processing with textual context, allowing for a more nuanced evaluation of NVs.
The experimental evaluation is thorough, utilizing a well-defined dataset with expert ratings to validate the performance of the NVMOS model. The results demonstrate that NVMOS achieves a high level of agreement with human expert ratings, outperforming general-purpose multimodal models. The correlation metrics reported (Pearson, Spearman, and Kendall) provide a strong basis for assessing the model's effectiveness. Additionally, the ablation study effectively highlights the importance of the tag-centered query in improving model performance.
The paper provides sufficient details regarding the experimental setup, including data splitting, model architecture, and training procedures. However, the lack of a publicly available project URL or demo limits the reproducibility of the results. Future work should consider releasing the code and dataset to facilitate further research in this area.
One limitation of the study is the reliance on expert ratings, which, while valuable, may introduce subjectivity into the evaluation process. Additionally, the dataset may not cover all possible NV scenarios, potentially limiting the generalizability of the findings. The performance of the NVMOS model may also vary with different NV types or in more complex acoustic environments.
The implications of this research are significant, particularly in the fields of speech synthesis and human-computer interaction. By improving the quality assessment of non-verbal vocalizations, this work can enhance the expressiveness and naturalness of generated speech in applications such as virtual assistants, gaming, and emotional AI. The findings also highlight the limitations of current multimodal models, paving the way for more specialized approaches in audio quality assessment. The paper introduces NV-MOS, an expert-rated dataset and modeling framework for perceptual quality assessment of non-verbal vocalizations in speech. This work significantly advances the understanding and evaluation of NVs, offering a dedicated approach that outperforms existing general-purpose models and highlighting the need for specialized tools in this domain.
Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in understanding complex multimodal content. However, their performance in sentiment analysis exhibits acute sensitivity to prompt design, rendering static, uniformly applied prompts inherently suboptimal for capturing the nuanced multimodal cues that vary across inputs. To address this limitation, we propose a Multimodal Adaptive Few-Shot Prompting (MAF) framework, which dynamically retrieves and integrates query-relevant demonstrations to elicit the sentiment reasoning capabilities of MLLMs in a context-sensitive manner. MAF constructs a demonstration retrieval module that holistically encodes facial expressions, scene context, and textual semantics, with a lip movement amplitude detection mechanism introduced for accurate speaker identification in multi-person scenarios. Departing from conventional fixed-weight fusion, a lightweight coefficient generation network is trained to output query-conditioned fusion weights in real time, enabling weighted aggregation of multimodal similarity scores to retrieve the top-K most informative demonstrations. Prediction stability is further enhanced through majority voting over multiple candidate outputs generated by the MLLM. Extensive experiments on public benchmark datasets demonstrate that MAF achieves substantial and consistent performance improvements over the corresponding backbone variants and remains competitive with strong multimodal sentiment-analysis baselines.
Primary: Nanjing University of Posts and Telecommunications
All Institutions: Nanjing University of Posts and Telecommunications
The main contribution of this paper is the introduction of the MAF framework, which enhances multimodal sentiment analysis through adaptive retrieval and weighted fusion of demonstrations, improving prediction stability and accuracy. The comprehensive analysis of the technical contributions, methodology, and significance to the field highlights the potential for advancing sentiment analysis capabilities using multimodal large language models.
The proposed MAF framework introduces a novel approach to multimodal sentiment analysis by dynamically retrieving relevant demonstrations and adapting fusion weights based on input queries. The integration of facial, scene, and text features, along with a lightweight coefficient generator, enhances the adaptability of the model to various sentiment expressions. The majority voting mechanism further stabilizes predictions, addressing common issues in sentiment analysis using MLLMs. However, while the methodology is innovative, it builds upon existing concepts in retrieval-augmented generation and multimodal integration, which may limit its perceived novelty.
The experiments conducted on three public benchmark datasets (CMU-MOSEI, CH-SIMS v2.0, and MELD) demonstrate the effectiveness of the MAF framework. The results indicate consistent performance improvements over baseline models, showcasing the robustness of the proposed methods. The ablation studies provide valuable insights into the contributions of each component, confirming the importance of retrieval, adaptive weighting, and voting mechanisms. However, the lack of a comprehensive comparison with state-of-the-art models in terms of computational efficiency and scalability could be seen as a limitation.
The paper provides detailed implementation details, including hyperparameter settings and the architecture of the coefficient generator. However, the absence of a publicly available code repository or demo URL limits the reproducibility of the results. Future work should consider releasing the code to facilitate further research and validation of the proposed methods.
The primary limitations include the reliance on a fixed demonstration corpus, which may not generalize well to unseen data or diverse contexts. Additionally, the performance may be sensitive to the choice of multimodal features and the number of retrieved demonstrations, which could affect the model's adaptability. The lack of a comprehensive evaluation of the computational efficiency of the MAF framework compared to existing methods is also a concern.
The MAF framework has the potential to significantly enhance sentiment analysis applications in various domains, including social media monitoring, customer feedback analysis, and emotional recognition in human-computer interactions. By improving the robustness and accuracy of sentiment predictions, this research could lead to more effective tools for understanding human emotions in multimodal contexts. The main contribution of this paper is the introduction of the MAF framework, which enhances multimodal sentiment analysis through adaptive retrieval and weighted fusion of demonstrations, improving prediction stability and accuracy. The comprehensive analysis of the technical contributions, methodology, and significance to the field highlights the potential for advancing sentiment analysis capabilities using multimodal large language models.
A key challenge of speaker de-identification is the balance between privacy and utility. Many utility variables, such as the cognitive health status of the speaker, are correlated with the privacy variable, such as the speaker identity, violating the independence assumption held by the disentanglement-based approaches, causing leakage of private information and the loss of useful information for downstream tasks. To tackle this challenge, we propose a general framework, DDPO-VC, for speaker de-identification through reinforcement learning-based post-training with diffusion models. Learning from reward signals combining knowledge from privacy-focused and utility-focused teachers, our method outperforms various strong \deid/ methods in both privacy preservation and cognitive utility on two commonly used dementia speech benchmarks. Please check out our code\footnote{\href{https://github.com/cactuswiththoughts/DDPO-VC}{https://github.com/cactuswiththoughts/DDPO-VC}} and demo\footnote{\href{https://cactuswiththoughts.github.io/SpeakerDeID-Demo/}{https://cactuswiththoughts.github.io/SpeakerDeID-Demo/}}.
Primary: MIT CSAIL
All Institutions: MIT CSAIL, Boston University
The main contribution of this paper is the introduction of DDPO-VC, a novel framework for speaker de-identification that balances privacy and utility through reinforcement learning and diffusion models. This work represents a significant advancement in the field, addressing critical challenges in the intersection of privacy and cognitive utility in speech processing.
The proposed DDPO-VC framework effectively integrates reinforcement learning with diffusion models to address the dual challenge of privacy and utility in speaker de-identification. The methodology is well-structured, leveraging a conditional diffusion model and a novel reward mechanism that utilizes both privacy and utility teachers. This innovative approach allows for a more nuanced optimization of the privacy-utility tradeoff, which is critical in sensitive applications such as healthcare. The use of reinforcement learning to navigate complex correlations between variables is a significant advancement over traditional disentanglement methods.
The experiments are robust, utilizing two dementia speech benchmarks that are relevant and challenging. The results demonstrate clear superiority over existing methods in both privacy preservation and cognitive utility, with well-defined metrics such as AUC and EER. The comprehensive evaluation across multiple settings (zero-shot and fine-tuned) adds credibility to the findings. However, further details on the datasets and the specific configurations used in experiments would enhance the clarity of the evaluation.
The paper provides a GitHub repository and demo link, which is a positive aspect for reproducibility. However, the implementation details could be more explicit, particularly regarding hyperparameters and training procedures, to ensure that other researchers can replicate the results accurately.
One limitation noted is the potential for reward hacking due to the fixed nature of the privacy teacher. Additionally, the reliance on pretrained models for the privacy and utility teachers may limit the generalizability of the approach to other domains. The paper also acknowledges the need for more diverse evaluation metrics beyond naturalness and speaker similarity, indicating room for improvement in the evaluation framework.
The implications of this research are significant, particularly in fields where privacy is paramount, such as healthcare. By improving speaker de-identification methods, the framework can help protect sensitive information while still allowing for the utility of speech data in applications like dementia diagnosis and monitoring. The potential for broader applications in other audio domains and utility variables further enhances its relevance. The main contribution of this paper is the introduction of DDPO-VC, a novel framework for speaker de-identification that balances privacy and utility through reinforcement learning and diffusion models. This work represents a significant advancement in the field, addressing critical challenges in the intersection of privacy and cognitive utility in speech processing.
We introduce AudEdit, an inversion-free method for text-guided editing of real audio with a pretrained rectified-flow audio generator. Text-to-audio systems such as Stable Audio 3 already expose audio-to-audio editing by noising an input recording and denoising it under a new prompt, but this inversion-style route must trade prompt adherence against preservation of rhythm, transients, timbre, and long-range musical structure. Motivated by recent inversion-free flow editing in computer vision, we develop an audio-specific direct source-to-target ordinary differential equation for one-dimensional Stable Audio 3 latents: at each flow step, we compare the target- and source-conditioned velocity fields under a shared stochastic source marginal, and update the edited latent by their difference. The resulting editor requires no training, no paired edit data, no optimization, and no access to internal attention maps. Across sound-effect and music editing sets built from FSD50K and the Song Describer Dataset, AudEdit improves CLAP text alignment and audio preservation over SDEdit, ODE inversion, and FireFlow; for example, on sound effects it raises target-text CLAP similarity from 0.42 to 0.52 over the strongest baseline while reducing FAD from 65.70 to 50.37.
Primary: Nankai University
All Institutions: Nankai University
The main contribution of this paper is the introduction of AudEdit, a zero-shot text-guided audio editor that employs an inversion-free direct ODE for audio editing, significantly improving the trade-off between prompt adherence and source preservation. This work represents a meaningful advancement in the field of audio processing, addressing critical challenges in audio editing while leveraging state-of-the-art generative models.
The paper presents a novel approach to text-guided audio editing through an inversion-free method using pretrained rectified-flow audio models. The authors develop a direct source-to-target ordinary differential equation that allows for effective editing without the need for training or optimization. This methodology is innovative as it circumvents the common issues associated with inversion methods, particularly in preserving audio characteristics while adhering to new prompts. The integration of stochastic source marginals to refine the editing process is a noteworthy aspect that enhances the robustness of the approach.
The experiments are comprehensive, utilizing well-defined datasets for sound effects and music derived from established sources like FSD50K and the Song Describer Dataset. The evaluation metrics are robust, including both objective measures (like CLAP similarity and FAD) and subjective assessments (mean opinion scores). The results demonstrate clear improvements over baseline methods, indicating the effectiveness of the proposed method in achieving a balance between prompt adherence and source preservation.
The paper provides detailed implementation settings, including the configuration of the Stable Audio 3 model and the parameters used in experiments. However, the absence of a publicly available code repository or demo limits the reproducibility of the results. Future work could benefit from sharing the implementation to facilitate validation by the research community.
The method is primarily designed for controlled edits and may struggle with broader semantic rewrites that require significant changes to the audio content. The authors acknowledge that the approach inherits limitations from the Stable Audio 3 backbone, including its reliance on specific conditioning and the lack of explicit temporal controls. Additionally, the method may introduce artifacts in cases where the target prompt demands extensive alterations.
This research has significant implications for audio editing in creative industries, such as music production and sound design, where maintaining the integrity of the original audio while allowing for meaningful edits is crucial. The inversion-free approach could streamline workflows for audio professionals, enabling more intuitive and efficient editing processes. Furthermore, the findings may inspire further research into generative audio models and their applications in various multimedia contexts. The main contribution of this paper is the introduction of AudEdit, a zero-shot text-guided audio editor that employs an inversion-free direct ODE for audio editing, significantly improving the trade-off between prompt adherence and source preservation. This work represents a meaningful advancement in the field of audio processing, addressing critical challenges in audio editing while leveraging state-of-the-art generative models.
Large language model (LLM)-based text-to-speech (TTS) models have achieved remarkable voice cloning capabilities, raising concerns about potential deepfake misuse. Speech watermarking mitigates this by embedding traceable information into generated speech. Mainstream watermarking methods operate at the signal level (waveform or spectrogram), rendering the watermark vulnerable to generative attacks (e.g., neural codec and vocoder). To address this, we propose DuraMark, a robust information-level watermarking framework. It utilizes syllable duration editing to achieve watermark embedding. Specifically, DuraMark integrates a duration-controllable LLM-based TTS model to edit syllable durations during synthesis, coupled with a duration extractor to extract these durations for detection. Experiments demonstrate DuraMark's superior robustness against generative attacks, significantly outperforming signal-level baselines. Audio samples are available at https://muzw.github.io/duramark_demo/.
Primary: University of Science and Technology of China
All Institutions: University of Science and Technology of China, Institute of Forensic Science, Ministry of Public Security, The Hong Kong Polytechnic University
The main contribution of this paper is the introduction of DuraMark, a novel generative watermarking framework that embeds watermarks into synthesized speech by editing syllable durations, significantly improving robustness against generative attacks while preserving speech quality. This work represents a meaningful advancement in the field of audio processing and watermarking, addressing critical concerns related to deepfake technologies and the integrity of synthesized speech.
The proposed DuraMark framework introduces a novel approach to watermarking in LLM-based TTS systems by embedding watermarks at the information level through syllable duration editing. This method is innovative as it leverages a duration-controllable TTS model and a duration extractor, which allows for precise control over the watermarking process while maintaining the naturalness of the synthesized speech. The integration of these components is well-structured, and the methodology is clearly articulated, allowing for a thorough understanding of the process.
The experiments conducted are robust, utilizing a substantial dataset and comparing DuraMark against established signal-level watermarking methods. The evaluation metrics include True Positive Rate (TPR) under various attack scenarios, which is a relevant measure of robustness. The results demonstrate DuraMark's superior performance, particularly against generative attacks, which is a critical aspect of the paper's claims. The use of both objective and subjective metrics to assess speech naturalness further strengthens the experimental evaluation.
The paper provides sufficient detail regarding the experimental setup, including the datasets used and the training parameters. However, the absence of a public code repository limits reproducibility. While the methodology is clearly described, access to the code would enhance the ability of other researchers to validate and build upon this work.
One limitation is the reliance on a specific language (Mandarin Chinese) for the experiments, which may affect the generalizability of the findings to other languages or dialects. Additionally, while the paper demonstrates robustness against various attacks, it does not explore the performance of DuraMark under more extreme or novel attack scenarios that may arise in real-world applications.
The implications of this research are significant, particularly in the context of combating deepfake technologies and ensuring the integrity of synthesized speech. The DuraMark framework could be applied in various fields, including media, security, and digital forensics, where the authenticity of audio content is crucial. The potential for this technology to enhance trust in AI-generated content is noteworthy. The main contribution of this paper is the introduction of DuraMark, a novel generative watermarking framework that embeds watermarks into synthesized speech by editing syllable durations, significantly improving robustness against generative attacks while preserving speech quality. This work represents a meaningful advancement in the field of audio processing and watermarking, addressing critical concerns related to deepfake technologies and the integrity of synthesized speech.
Personalized text-to-speech (TTS) aims to clone the target speaker in the synthesized speech, imitating both the voice and speaking style. Current large language model (LLM)-based TTS methods ignore the style-specific prosodic patterns in generated speech, resulting in deficient style learning and thus limiting speaker similarity in synthesized speech. To this end, we investigate the prosody learning conditioned on the synthesized speech, and propose to predict the prosody of the current syllable based on previously predicted speech. Experimental results obtained on three datasets demonstrated the efficacy of the proposed dynamic prosody prediction method in enhancing the prosody learning capability, thereby improving the speaker similarity of the generated speech. Audio samples are available at https://muzw.github.io/dynapros/.
Primary: University of Science and Technology of China
All Institutions: University of Science and Technology of China, iFLYTEK
The main contribution of this paper is the introduction of a dynamic prosody prediction method that enhances speaker similarity in personalized TTS systems. This innovative approach, supported by comprehensive experimental validation, addresses key limitations in existing TTS technologies and has the potential to significantly impact the field of speech synthesis.
The proposed dynamic prosody prediction method represents a significant advancement in TTS technology by allowing for syllable-level prosody prediction based on previously generated speech. This approach addresses the limitations of existing methods that typically rely on static prosody modeling. The integration of prosody prediction into the speech generation process is well-justified and demonstrates a clear understanding of the challenges in personalized TTS systems. The methodology is sound, with a clear architecture and loss function defined, although the paper could benefit from more detailed explanations of the equations presented.
The experiments are comprehensive, utilizing three diverse datasets that cover a range of emotional and stylistic variations. The results are presented clearly, showing improvements in speaker similarity and prosody modeling capabilities. The use of both objective metrics (e.g., CER, emotion similarity) and subjective evaluations (e.g., MOS, preference tests) adds robustness to the findings. However, the paper could enhance its credibility by providing more detailed statistical analyses of the results, such as confidence intervals or significance testing.
The paper provides sufficient details regarding the experimental setup, including the datasets used, model architectures, and training procedures. The availability of the CosyVoice implementation and audio samples supports reproducibility. However, the lack of specific hyperparameter settings and training configurations for the proposed model could hinder complete reproducibility.
One limitation of the study is its focus on Mandarin Chinese, which may restrict the applicability of the findings to other languages or dialects. Additionally, while the proposed method shows promise in improving speaker similarity, the paper does not address potential challenges in real-world applications, such as the computational efficiency of the model during inference.
The proposed method has significant implications for the development of personalized TTS systems, particularly in applications such as virtual assistants, audiobooks, and entertainment. By improving speaker similarity, the approach could enhance user experience and engagement in various audio-related applications. Furthermore, the findings may inspire further research into dynamic prosody modeling in other languages and contexts. The main contribution of this paper is the introduction of a dynamic prosody prediction method that enhances speaker similarity in personalized TTS systems. This innovative approach, supported by comprehensive experimental validation, addresses key limitations in existing TTS technologies and has the potential to significantly impact the field of speech synthesis.
Text-to-audio (TTA) generation has made significant strides, yet achieving precise and consistent audio editing remains a major challenge. However, existing methods struggle to balance temporal consistency with background preservation. In this paper, we propose FreeSonic, a training-free framework leveraging the state-of-the-art Rectified Flow-based TangoFlux model. FreeSonic utilizes an optimized inversion-reverse process and joint text-audio attention maps for precise target segment extraction. For content editing, a novel scheduled attention decoupling confines modifications to target regions while preserving original acoustic context. Furthermore, task-oriented noise injection enhances versatility for tasks such as audio removal and non-rigid replacement. Extensive experimental results demonstrate that FreeSonic achieves a superior balance by providing a high-fidelity and efficient solution for precise and consistent audio editing. Project and demos: https://free-sonic.github.io/
Primary: Tsinghua University
All Institutions: Tsinghua University, Alibaba Group, Monash University, Renmin University of China, Fudan University
FreeSonic presents a training-free framework for precise audio editing that leverages advanced attention mechanisms and noise injection techniques. The paper's contributions are significant, offering a novel approach to addressing longstanding challenges in the field of audio editing while demonstrating strong experimental validation and potential for broader applications.
The methodology presented in FreeSonic is innovative, combining a training-free approach with advanced techniques such as Rectified Flow-based models and joint text-audio attention maps. The introduction of scheduled attention decoupling and task-oriented noise injection is particularly noteworthy as it allows for precise audio editing while maintaining background integrity. The paper effectively addresses the challenges of temporal consistency and background preservation in audio editing, which are critical for high-fidelity audio applications.
The experimental evaluation is robust, utilizing both quantitative metrics (FAD, KL, IS, FD, CLAP) and subjective assessments (Mean Opinion Score) to validate the effectiveness of FreeSonic. The results demonstrate superior performance compared to existing training-free and training-based methods across various editing tasks. The ablation studies further reinforce the significance of each component in the proposed framework, showcasing a thorough understanding of the model's capabilities and limitations.
The paper provides a clear description of the experimental setup, including datasets and evaluation metrics, which supports reproducibility. However, the specifics of the implementation details, such as hyperparameter settings and the exact architecture of the model, could be more explicitly detailed to enhance reproducibility further.
One limitation of the study is the reliance on the performance of a single model architecture (TangoFlux) without exploring the potential of other architectures or hybrid approaches. Additionally, while the training-free aspect is a significant advantage, it may limit the model's adaptability to more complex audio editing scenarios that could benefit from fine-tuning.
FreeSonic has the potential to significantly impact the field of audio editing and generation, particularly in applications requiring high fidelity and precision, such as music production, film editing, and interactive media. The training-free nature of the approach could democratize access to advanced audio editing tools, allowing non-experts to achieve professional-quality results. FreeSonic presents a training-free framework for precise audio editing that leverages advanced attention mechanisms and noise injection techniques. The paper's contributions are significant, offering a novel approach to addressing longstanding challenges in the field of audio editing while demonstrating strong experimental validation and potential for broader applications.
Speech deepfake detection is predominantly treated as an opaque classification task where all temporal frames are aggregated equally. This ignores that different phonetic categories carry vastly different amounts of discriminative information. To address this, we propose a phoneme-guided cross-attention framework that transforms detection into an interpretable, phonetically grounded process. We factorize the spoofing posterior $P(\text{spoofed}\mid X, W)$, conditioned on the acoustic representation $X$ and the phonetic posteriorgram $W$. The resulting factorization can be written as $P(\text{spoofed} \mid X, W) = \sum_{i=1}^{M} w_i \cdot P(\text{spoofed} \mid X, Z = z_i)$, where $M$ denotes the number of phonetic classes, $P(\text{spoofed} \mid X, Z = z_i)$ is the spoofing probability for the $i$-th phonetic class $z_i$ conditioned on $X$, and each $w_i$ is the prevalence of phonetic class $z_i$ in the utterance. Our transformer-based architecture instantiates this through a cross-attention block in which phonetic queries selectively probe information in acoustic keys and values, with softmax-normalized pooling supplying explicit phone-presence weights. Unlike prior approaches that rely heavily on post-hoc explainability methods, our framework offers phonetic-explainability-by-design. We evaluate the framework on an LJSpeech-derived corpus, ASVspoof 2019 LA, and ASVspoof 5 Track 1. Per-phone importance rankings reveal that discriminative power concentrates on articulatory categories that generative models struggle to reproduce faithfully. Stops, fricatives, affricates, nasals, and silence-boundary closures rank most discriminative, while periodic vowels and semivowels rank lower. Beyond competitive performance, our model provides structural interpretability, yielding an inspectable per-articulatory category breakdown of the final verdict.
Primary: University of Eastern Finland
All Institutions: University of Eastern Finland
This paper presents a novel phoneme-guided cross-attention framework for speech deepfake detection, significantly enhancing interpretability and performance. The methodology effectively integrates phonetic structures into the detection process, providing a clear basis for understanding model decisions and contributing valuable insights to the field of audio processing and explainable AI.
The proposed methodology introduces a phoneme-guided cross-attention framework that significantly enhances the interpretability of speech deepfake detection systems. By leveraging phonetic posteriorgrams (PPGs) as a structural interface, the framework allows for a detailed analysis of the contribution of each phonetic class to the detection decision. This contrasts with traditional models that produce a single score without insight into the phonetic structure. The probabilistic factorization of the spoofing posterior into per-phone contributions is a novel approach that provides a clear, interpretable mechanism for understanding model behavior, which is a significant advancement in the field of explainable AI in speech processing.
The experimental evaluation is robust, utilizing three datasets of varying complexity, including a controlled corpus and standard benchmarks like ASVspoof 2019 LA. The results demonstrate competitive performance while also providing insights into the discriminative power of different phonetic categories. The targeted phoneme-group ablation study further validates the importance of articulatory categories, confirming the model's ability to isolate and rank the contributions of different phonetic classes effectively.
The paper lacks explicit details regarding the implementation and availability of the code or models, which raises concerns about reproducibility. While the methodology is well-documented, the absence of a publicly accessible repository or demo limits the ability for other researchers to validate and build upon the findings.
One limitation is the reliance on the quality of the phonetic posteriorgrams, which may introduce noise or inaccuracies if the phoneme extraction process is not robust. Additionally, while the model shows promise in structured interpretability, it may still struggle with complex, real-world scenarios where the phonetic structure is less clear. The paper does not address potential biases in the datasets used for training and evaluation.
The implications of this work are significant, particularly in the context of forensic voice analysis and anti-spoofing measures in security systems. By enhancing the interpretability of deepfake detection, the framework could facilitate more reliable applications in legal and security settings, where understanding the basis of decisions is crucial. Furthermore, the integration of phonetic structures into detection systems may inspire new research avenues in both speech synthesis and recognition. This paper presents a novel phoneme-guided cross-attention framework for speech deepfake detection, significantly enhancing interpretability and performance. The methodology effectively integrates phonetic structures into the detection process, providing a clear basis for understanding model decisions and contributing valuable insights to the field of audio processing and explainable AI.
The rapid advancement of generative AI models is leading to more realistic deepfake media, encompassing the manipulation of audio, video, or both. This raises severe privacy and societal concerns. Numerous studies in this area have yielded promising intra-domain results; however, these models frequently exhibit decreased efficacy when faced with data from dissimilar domains. Consequently, recent deepfake detection approaches focus on enhancing the generalization ability through multiple techniques that incorporate all input modalities, including audio, images, and their interactions. In this regard, we propose the EAV-DFD method, a generalized deep ensemble audio-visual model (EAV-DFD) combined with a domain adaptation mechanism utilizing a teacher-student framework to enhance the model's ability to perform and generalize effectively across unseen domains. To evaluate the model's performance, we used the FakeAVCeleb dataset as the primary domain and the DFDC, Deepfake_TIMIT, and PolyGlotFake datasets as an unseen domain. Our experimental results demonstrate that the proposed framework is efficient in domain adaptation, improving AUC performance of the model by 4.09%, 17.94%, and 0.5% on three unseen datasets, using only a small portion of them to train the student model. This leads to a novel deepfake detection model capable of adapting to new domains and interpreting which modality has been manipulated, highlighting the potential of our approach for real-world applications.
Primary: Sharif University of Technology
All Institutions: Sharif University of Technology
The paper presents a novel deepfake detection model that effectively integrates audio-visual modalities through a teacher-student framework, demonstrating strong performance across multiple datasets and highlighting its potential for real-world applications. The comprehensive methodology and experimental validation contribute meaningfully to the ongoing efforts in combating deepfake technologies.
The proposed EAV-DFD model employs a robust teacher-student framework for domain adaptation, integrating audio, visual, and audio-visual modalities through an ensemble architecture. The methodology is well-structured, with clear delineation of the training processes for both teacher and student models, and the use of specialized loss functions enhances the model's adaptability to unseen domains. The incorporation of unimodal networks alongside the audio-visual network allows for effective handling of scenarios where one modality may be missing, which is a significant advantage in real-world applications.
The experiments are comprehensive, utilizing multiple datasets (FakeAVCeleb, DFDC, Deepfake_TIMIT, PolyGlotFake) to evaluate the model's performance across different domains. The results demonstrate significant improvements in AUC metrics, particularly in cross-domain generalization, which underscores the effectiveness of the proposed approach. The ablation studies provide valuable insights into the contributions of various components of the model, further validating the methodology.
The paper provides a GitHub repository link, which is crucial for reproducibility. However, detailed implementation specifics, such as hyperparameter settings and training configurations, could be more explicitly stated to facilitate easier replication of results by other researchers.
The model's performance may degrade under challenging conditions such as poor lighting or multi-speaker scenarios, indicating that further refinements are needed to enhance robustness. Additionally, the reliance on specific datasets may limit the generalizability of the findings to other types of deepfake detection tasks.
This research has significant implications for the field of deepfake detection, particularly in enhancing the reliability of media content in sensitive contexts such as politics and security. The model's ability to adapt to new deepfake generation methods without catastrophic forgetting is particularly relevant as generative AI technologies continue to evolve. The paper presents a novel deepfake detection model that effectively integrates audio-visual modalities through a teacher-student framework, demonstrating strong performance across multiple datasets and highlighting its potential for real-world applications. The comprehensive methodology and experimental validation contribute meaningfully to the ongoing efforts in combating deepfake technologies.
With the rapid deployment of speech generation systems in open environments, providing verifiable source attribution and copyright accountability for audio content has become critical. A gap in current research is the lack of a unified benchmark that systematically compares different watermark injection methods under realistic distribution shifts. To address this, we build VoxWatermark by applying 10 watermarking methods (4 neural and 6 traditional) with unified injection and annotation on multilingual, multi-source corpora, and introducing no-box, black-box, and white-box perturbations to simulate real recording and transmission conditions. Based on this benchmark, we propose AudioWMD as a robust baseline detector for large-scale, multi-method, cross-distribution settings. Results show that injection-method diversity and distribution shifts affect detection stability, while validating the effectiveness and scalability of AudioWMD. Dataset and code are publicly available.
Primary: Nanyang Technological University
All Institutions: Nanyang Technological University, University of Tehran
The paper presents VoxWatermark, a large-scale benchmark for audio watermark detection, and proposes the AudioWMD framework, significantly advancing the state of research in audio watermarking and detection methodologies. The comprehensive evaluation of various watermarking methods under realistic perturbations provides valuable insights into the robustness of detection systems, paving the way for future advancements in the field.
The paper introduces a comprehensive methodology for audio watermark detection by constructing the VoxWatermark benchmark, which systematically evaluates various watermarking methods under different perturbation scenarios. The proposed AudioWMD framework employs a two-stage detection process that incorporates query-response stability analysis, enhancing robustness against adversarial attacks. The methodology is well-structured, addressing a significant gap in the existing literature regarding the evaluation of watermark detection systems.
The experiments are rigorously designed, utilizing a large-scale dataset of over 126,000 hours of audio across multiple languages and perturbation types. The authors provide a detailed comparison of their AudioWMD detector against a baseline model (WMD), demonstrating superior performance across various out-of-domain (OOD) test sets. The results highlight the effectiveness of the proposed approach in maintaining detection stability under realistic conditions, although performance does degrade under certain adversarial attacks.
The authors have made their dataset and code publicly available, which is a significant step towards ensuring reproducibility. The detailed description of the experimental setup, including data partitioning and evaluation protocols, further supports the reproducibility of the results. However, the reliance on specific hyperparameters and configurations may still pose challenges for complete replication.
One limitation noted is the vulnerability of the AudioWMD detector to certain black-box attacks, indicating that while the model improves robustness, it is not entirely immune to sophisticated adversarial strategies. Additionally, the performance under no-box perturbations approaches chance levels, suggesting that further work is needed to enhance resilience against common audio processing distortions.
This research has significant implications for the fields of audio security and copyright protection, particularly as synthetic audio generation becomes more prevalent. The development of a robust watermark detection system is crucial for ensuring the integrity and authenticity of audio content in various applications, including media production and digital forensics. The paper presents VoxWatermark, a large-scale benchmark for audio watermark detection, and proposes the AudioWMD framework, significantly advancing the state of research in audio watermarking and detection methodologies. The comprehensive evaluation of various watermarking methods under realistic perturbations provides valuable insights into the robustness of detection systems, paving the way for future advancements in the field.
A model can learn that the piano piece FĂĽr Elise is calm and reflective by listening to the audio or by reading a text description, but does it matter which route that knowledge took when it is later at risk of being forgotten? Forgetting research in multimodal models measures what knowledge is lost under adaptation, yet has not asked whether acquisition route affects how easily that knowledge is forgotten. We call this untested premise the Pathway-Invariant Assumption. Music understanding enables a clean test because a music clip and a canonical text description can be aligned to the same perceptual content, allowing the same knowledge unit to enter a model through listening or reading while the target remains fixed. Across multiple architecturally distinct audio-language models, we observe a consistent asymmetry: text-pathway knowledge is forgotten more than matched audio-pathway knowledge under identical adaptation pressure. To attribute this effect to route rather than confounds, we introduce the Paired Pathway Controlled Protocol (PPCP), a three-phase design that establishes matched pathway baselines, activates both pathways under symmetric supervision on the same knowledge pool, and applies identical forgetting pressure to both pathways. The gap is stable across models and gain-controlled analyses, persists when contradictory overwrite is replaced by correct-label cross-domain learning, remains under single-modality pressure, and is not removed by lightweight replay. Two independent routing-depth controls confirm that the effect is not explained by architectural depth, pointing to input representation as the dominant factor. Under PPCP, our results demonstrate that forgetting is highly route-dependent, establishing acquisition route as a new analytical dimension for forgetting research and multimodal system design.
Primary: Institute of Information Engineering, CAS
All Institutions: Institute of Information Engineering, CAS, School of Cyber Security, UCAS, The University of Western Australia, Beihang University
This paper presents a significant contribution to the understanding of pathway-dependent forgetting in multimodal models, introducing a novel experimental protocol and providing compelling evidence that the route of knowledge acquisition affects retention. The rigorous methodology and comprehensive experimental evaluation enhance its relevance and potential impact in the field of machine learning, particularly in audio and music processing.
The methodology introduces the Paired Pathway Controlled Protocol (PPCP), which is a well-structured experimental framework that rigorously controls for variables affecting knowledge retention in multimodal models. The three-phase design effectively isolates the pathway as the primary variable, ensuring that the results are attributable to the acquisition route rather than confounding factors. The methodology is sound and addresses significant blind spots in existing research on forgetting in multimodal models.
The experiments are comprehensive, involving multiple architecturally distinct audio-language models and a variety of controls to validate the findings. The results consistently show that text-pathway knowledge is forgotten more than audio-pathway knowledge, providing robust evidence for the proposed hypothesis. The statistical analyses are thorough, and the use of controlled experiments enhances the credibility of the findings.
The paper provides detailed descriptions of the experimental setup, including training configurations and evaluation metrics, which supports reproducibility. The availability of the project URL with code further facilitates replication of the study.
While the study is robust, it primarily focuses on audio-language models within the music domain, which may limit the generalizability of the findings to other multimodal systems or domains. The paper also acknowledges that further exploration is needed to determine if the observed effects hold across different architectural families.
The implications of this research extend to the design of multimodal systems, suggesting that forgetting interventions should be pathway-aware. This could influence future work in continual learning, model editing, and unlearning, as well as applications in music understanding and retrieval systems. This paper presents a significant contribution to the understanding of pathway-dependent forgetting in multimodal models, introducing a novel experimental protocol and providing compelling evidence that the route of knowledge acquisition affects retention. The rigorous methodology and comprehensive experimental evaluation enhance its relevance and potential impact in the field of machine learning, particularly in audio and music processing.
While LALMs show promise on audio question answering, they fail to focus on question-relevant segments of audio and provide a clear, checkable reasoning process when dealing with complex audio reasoning. Reinforcement learning and tool-augmented prompting can help models better relate questions to audio but lack a reliable way to understand, integrate, and self-verify audio segments. To address this gap, we present EChO-Agent, a modular agent framework that reformulates complex audio QA as a planning, tool execution, evidence integration, and answer verification workflow. Experiments on MMAR benchmark show EChO-Agent improves both accuracy and rubric scores over baseline and ablation studies show evidence integration is the key factor.
Primary: Tianjin University
All Institutions: Tianjin University
The main contribution of this paper is the introduction of EChO-Agent, a modular framework that enhances audio reasoning through a structured pipeline for tool execution, evidence integration, and answer verification. This comprehensive analysis highlights the technical contributions and significance of the methodology in addressing existing challenges in audio question answering, establishing a foundation for future advancements in the field.
The proposed EChO-Agent framework introduces a structured four-stage pipeline that effectively addresses the limitations of existing Large Audio Language Models (LALMs) in audio reasoning tasks. By integrating tool-augmented observation, evidence integration, reasoning, and verification, the methodology emphasizes a systematic approach to audio question answering. The use of specialized audio tools for observations and a structured evidence chain for reasoning is innovative, as it allows for a more nuanced understanding of audio context and improves the model's ability to produce verifiable outputs. The framework's design is well thought out, ensuring that each component contributes to the overall goal of enhancing audio reasoning.
The experiments conducted on the MMAR benchmark provide a solid evaluation of the proposed method. The results demonstrate a significant improvement in accuracy and rubric scores compared to baseline models, indicating the effectiveness of the EChO-Agent framework. The ablation studies are particularly valuable, as they quantify the contributions of each component, reinforcing the importance of evidence integration and verification in the reasoning process. However, the paper could benefit from a more extensive comparison with a broader range of existing methods to contextualize its performance further.
The paper outlines a clear methodology and experimental setup, which aids in reproducibility. However, the lack of detailed implementation specifics, such as hyperparameters and the exact configurations used for the audio tools, limits the ability of other researchers to replicate the results fully. Providing access to code or supplementary materials would enhance reproducibility.
One limitation of the study is the reliance on specific audio tools, which may not generalize across all audio reasoning tasks. Additionally, while the framework shows promise, it has yet to be tested on a wider variety of datasets beyond the MMAR benchmark. The paper also does not address potential computational costs associated with the tool-augmented approach, which could impact scalability.
The EChO-Agent framework has the potential to significantly advance the field of audio reasoning and question answering, particularly in applications such as automated transcription, audio content analysis, and interactive audio systems. By improving the reliability and verifiability of audio reasoning, this work could lead to more robust AI systems capable of understanding complex audio environments, which is increasingly relevant in areas like virtual assistants, accessibility technologies, and multimedia content creation. The main contribution of this paper is the introduction of EChO-Agent, a modular framework that enhances audio reasoning through a structured pipeline for tool execution, evidence integration, and answer verification. This comprehensive analysis highlights the technical contributions and significance of the methodology in addressing existing challenges in audio question answering, establishing a foundation for future advancements in the field.
Large Audio-Language Models (LALMs) have shown strong performance on a wide range of audio understanding tasks, yet they still struggle with complex audio reasoning. A practical way to improve such capabilities is post-training, whose effectiveness critically depends on the quality and diversity of training data. However, existing audio-language datasets often contain substantial redundancy, where many samples are highly similar in acoustic content and thus provide overlapping supervisory signals. Such redundancy not only increases annotation cost, but also limits corpus diversity and reduces the effectiveness of post-training. To address this issue, we propose a redundancy-aware data construction pipeline for building reasoning-oriented supervision for LALMs. Specifically, we first perform acoustic similarity-based deduplication across raw audio datasets to improve corpus diversity. We then integrate existing audio captions and question-answer pairs into a unified multiple-choice format. Based on these unified annotations, we leverage Qwen3-30B to generate chain-of-thought (CoT) rationales for reasoning-oriented supervision. Based on this pipeline, we construct AudioDER, a reasoning-oriented post-training dataset containing approximately 191k samples spanning sound, speech, and music. Each sample consists of an audio clip, a multiple-choice question, four answer candidates, an audio caption, and a CoT rationale. Extensive experiments show that post-training on AudioDER consistently improves the performance of Qwen2-Audio-7B-Instruct on multiple audio reasoning benchmarks, including MMAU-mini, MMSU, and MMAR. We hope AudioDER can serve as a valuable resource for advancing audio reasoning research and the development of more capable LALMs.
Primary: National University of Defense Technology
All Institutions: National University of Defense Technology, Korea Advanced Institute of Science and Technology, Shanghai Jiaotong University
The main contribution of this paper is the introduction of AudioDER, a reasoning-oriented dataset designed to enhance the post-training of large audio-language models through a novel redundancy-aware construction pipeline. This work significantly advances the field by addressing the challenges of dataset redundancy and providing a comprehensive resource for improving audio reasoning capabilities in LALMs.
The proposed methodology is robust, focusing on a redundancy-aware data construction pipeline that effectively enhances the quality and diversity of training data for LALMs. The multi-stage process, which includes acoustic similarity-based deduplication, integration of existing annotations, and generation of CoT rationales, is well-structured and addresses key challenges in audio reasoning. The use of Qwen3-30B for rationale generation is particularly innovative, as it combines language understanding with audio processing to create a comprehensive dataset. The methodology is clearly articulated, with a logical flow from data collection to final dataset construction.
The experimental evaluation is thorough, demonstrating the effectiveness of the AudioDER dataset through extensive post-training experiments on multiple audio reasoning benchmarks. The results show consistent improvements in performance across various models, indicating that the dataset is not only well-constructed but also impactful in enhancing reasoning capabilities. The benchmarks chosen (MMAU-mini, MMSU, and MMAR) are relevant and challenging, providing a solid basis for evaluating the dataset's effectiveness.
The paper provides sufficient implementation details, including the architecture used (Qwen2-Audio-7B-Instruct), training parameters, and the experimental setup. However, the lack of a publicly available demo or interactive component limits the ease of reproducibility for external researchers. The open-source nature of the dataset is a positive aspect that encourages further exploration and validation by the community.
One limitation is the reliance on existing datasets for annotations, which may introduce biases inherent in those sources. Additionally, while the redundancy filtering process is beneficial, it may inadvertently remove samples that could contribute valuable diversity. The paper does not address potential scalability issues related to the dataset size or the computational resources required for post-training on larger models.
The AudioDER dataset has significant potential for advancing research in audio reasoning and LALMs. By providing a high-quality, structured dataset, it can facilitate the development of more capable audio understanding systems, which could have applications in various fields such as accessibility, education, and entertainment. The emphasis on reducing redundancy also highlights a critical area for improvement in dataset construction practices across machine learning. The main contribution of this paper is the introduction of AudioDER, a reasoning-oriented dataset designed to enhance the post-training of large audio-language models through a novel redundancy-aware construction pipeline. This work significantly advances the field by addressing the challenges of dataset redundancy and providing a comprehensive resource for improving audio reasoning capabilities in LALMs.
Explainable and trustworthy speech emotion recognition (SER) remains a challenging task to date, largely due to the scarcity of SER data with reliable speech emotion descriptor (SED) labels, such as prosodic features and speaker traits. This paper presents a confidence score and reinforcement learning (RL) based on-the-fly SED rectification approach for post-training SER systems on automatically annotated SED labels. Experiments on IEMOCAP and MELD suggest that explainable SER systems incorporating the proposed confidence score and RL-based SED rectification approach consistently outperform baselines without data selection or SED rectification. The best performing system, which integrates both components, surpasses the baseline without data selection and SED rectification, achieving SER gains of 2.9% and 3.3% absolute (3.7% and 5.4% relative) on IEMOCAP and MELD benchmarks, respectively.
Primary: The Chinese University of Hong Kong
All Institutions: The Chinese University of Hong Kong, Chinese Academy of Sciences, National Research Council Canada, Tsinghua University
The main contribution of this paper is the introduction of a confidence score and reinforcement learning-based approach for rectifying speech emotion descriptors in SER systems, which significantly improves both performance and explainability. This work addresses critical challenges in SER, providing a robust framework that could influence future research and applications in emotion recognition and related fields.
The paper proposes a novel methodology that integrates a confidence score-based data selection method and a reinforcement learning (RL)-based SED rectification approach for improving SER systems. The confidence estimation model (CEM) is well-structured, employing a multi-layer perceptron (MLP) to evaluate the reliability of automatically annotated SED labels. The RL-based SED Controller adds a dynamic element to the training process, allowing for real-time adjustments to SED labels, which is a significant advancement over static label approaches. The methodology is clearly articulated and demonstrates a thoughtful approach to addressing the limitations of existing SER systems.
The experiments conducted on the IEMOCAP and MELD datasets are comprehensive, comparing the proposed system against various baselines and existing state-of-the-art models. The results show consistent improvements in SER performance, with statistically significant gains in accuracy. The use of t-SNE visualizations to illustrate the clustering of emotion categories adds depth to the analysis, demonstrating the effectiveness of the proposed methods in enhancing the explainability and trustworthiness of SER systems.
The paper provides sufficient detail regarding the experimental setup, including model architectures, training procedures, and evaluation metrics. However, the absence of a publicly available code repository or demo limits reproducibility. The authors should consider releasing their code and trained models to facilitate further research and validation of their findings.
One limitation is the reliance on automatically annotated SED labels, which may still introduce noise despite the proposed rectification methods. Additionally, while the paper demonstrates improvements on two datasets, the generalizability of the approach to other SER tasks or languages remains untested. The impact of varying the threshold for confidence score selection on performance is also not thoroughly explored.
The advancements in explainable and trustworthy SER systems have significant implications for human-computer interaction, particularly in applications such as virtual assistants, mental health monitoring, and customer service automation. By enhancing the interpretability of emotion recognition systems, this research could foster greater user trust and acceptance of AI technologies in sensitive areas. The main contribution of this paper is the introduction of a confidence score and reinforcement learning-based approach for rectifying speech emotion descriptors in SER systems, which significantly improves both performance and explainability. This work addresses critical challenges in SER, providing a robust framework that could influence future research and applications in emotion recognition and related fields.
We present FoleyGenEx, a unified video-to-audio (VTA) framework integrating multi-modal control, frame-level temporal alignment, and fine-grained semantics, enabling synchronized, versatile audio synthesis for diverse tasks. Existing VTA methods either have multi-modal control but weak temporal alignment or strong alignment but lack reference audio conditioning and semantic precision. FoleyGenEx fills this gap via three core innovations: a conditional injection mechanism for audio-controlled VTA and Foley extension, a multi-modal dynamic masking strategy preserving training synchronization, and an adverb-based data augmentation algorithm leveraging signal processing and large language models to enhance textual supervision with nuanced semantics. Experiments on AudioCaps, VGGSound, and Greatest Hits demonstrate its competitive controllable VTA performance against existing methods. Demo samples are available at https://foleygenex.github.io/FoleyGenEx.
Primary: Nankai University
All Institutions: Nankai University, Kuaishou Technology
FoleyGenEx presents a significant advancement in video-to-audio generation, effectively addressing key limitations of existing methods through innovative architectural and methodological contributions. The integration of multi-modal control, temporal alignment, and semantic precision positions this work as a valuable addition to the field of generative audio systems.
The methodology introduced in FoleyGenEx is robust, integrating a conditional injection mechanism and a multi-modal dynamic masking strategy to enhance temporal alignment and semantic precision in video-to-audio generation. The use of adverb-based data augmentation is particularly innovative, addressing the scarcity of nuanced training data and enabling fine-grained control over audio generation. The architecture builds upon the MMDiT framework, which is a solid choice for cross-modal tasks, and the paper clearly delineates how each component contributes to the overall performance improvements.
The experiments conducted on multiple datasets (AudioCaps, VGGSound, and Greatest Hits) are comprehensive and demonstrate the effectiveness of FoleyGenEx against existing methods. The metrics used for evaluation, including distribution matching and semantic alignment, are appropriate for the task. The inclusion of subjective evaluations (Good, Same, Bad study) adds depth to the assessment of the model's performance, particularly regarding the adverb augmentation.
The paper provides sufficient implementation details, including training configurations and dataset descriptions, which would allow for reproducibility. However, the lack of a publicly available code repository limits the ease with which other researchers can replicate the results.
One limitation is the reliance on specific datasets, which may not generalize across all types of video-to-audio tasks. Additionally, the paper does not address potential biases in the training data or the implications of using large language models for data augmentation, which could affect the model's performance in real-world scenarios.
The advancements made in FoleyGenEx have significant implications for applications in multimedia content creation, enhancing user experience through synchronized audio generation. The ability to generate audio that is semantically aligned with video content can improve accessibility and engagement in various media formats, including film and gaming. FoleyGenEx presents a significant advancement in video-to-audio generation, effectively addressing key limitations of existing methods through innovative architectural and methodological contributions. The integration of multi-modal control, temporal alignment, and semantic precision positions this work as a valuable addition to the field of generative audio systems.
Current automated pipelines for audio-visual Question Answering (QA) generally adopt a ``video-caption-QA'' paradigm. However, these methods typically segment videos into short clips and generate separate descriptions for audio and visual modalities. This decoupled processing severs inherent associations between sounds and their visual sources, while independent clip processing often causes inconsistent descriptions of the same entity across segments. Furthermore, coupling long-text comprehension and QA synthesis into a single step often restricts models to localized events, yielding questions lacking long-term temporal connections and deep cross-modal reasoning. To address these issues, we propose an automated data engine featuring two mechanisms: (1) Entity-Anchored Video Scripting transforms videos into structured scripts, comprising summaries, main entity lists, and segment-wise audio-visual descriptions. The entity list serves as a global prior to ensure cross-segment referential consistency and reconstruct audio-visual associations. (2) Clue-Guided QA Generation prompts models to first mine cross-segment, multimodal clues from the script, and subsequently generate QA pairs based on these high-value clues. Leveraging this pipeline, we construct the instruction-tuning dataset OmniVideo-100K and a human-verified test set, OmniVideo-Test. Fine-tuning VITA-1.5, Qwen2.5-Omni-7B and Qwen3-Omni-30B on OmniVideo-100K yields performance gains of up to 20.59% on OmniVideo-Test, demonstrating strong generalization (up to 12.64% improvements) across established benchmarks like Daily-Omni and JointAVBench.
Primary: Nanjing University
All Institutions: Nanjing University, CASIA
The paper presents OmniVideo-100K, a novel dataset and methodology for enhancing audio-visual reasoning in question-answering tasks. This work significantly contributes to the field by addressing existing limitations in audio-visual QA systems and demonstrating substantial performance improvements through innovative data generation techniques.
The proposed methodology introduces a two-stage automated data generation pipeline that enhances audio-visual QA by integrating structured scripting and clue-guided QA generation. This approach addresses significant limitations in existing methods by ensuring cross-segment referential consistency and promoting deep cross-modal reasoning, which is a notable advancement in the field.
The experiments demonstrate robust performance improvements across various benchmarks, with fine-tuned models showing gains of up to 20.59% on the human-verified test set. The comprehensive evaluation on established benchmarks further validates the effectiveness of the proposed dataset and methodology.
The paper provides sufficient details regarding the experimental setup, including model configurations and dataset construction, which supports reproducibility. However, the absence of a public demo or interactive tool limits immediate accessibility for verification.
The paper does not address potential biases in the dataset construction process or the limitations of the automated pipeline in generating high-quality QA pairs. Additionally, the reliance on LLMs may introduce inherent biases or inaccuracies in the generated outputs.
The research has significant implications for advancing audio-visual understanding in AI, particularly in applications like video analysis, interactive media, and educational tools. The structured scripts and QA pairs can serve as valuable resources for further research and development in multimodal AI systems. The paper presents OmniVideo-100K, a novel dataset and methodology for enhancing audio-visual reasoning in question-answering tasks. This work significantly contributes to the field by addressing existing limitations in audio-visual QA systems and demonstrating substantial performance improvements through innovative data generation techniques.
Sound events are entities with semantic identities, locations, and trajectories, but current audio-language models usually reason about clips as global event content. Conversely, sound event localization models track source directions over time but offer limited semantic coverage for language reasoning. To address this gap, we introduce ST-AudioQA, a spatio-temporal audio QA dataset and benchmark built from first-order ambisonic (FOA) renderings of static and moving sound sources. Each scene provides source identity, activity, direction, distance, and motion metadata, enabling dense trajectory supervision and questions about what is sounding, where it is, how it moves, and how sources relate. We further propose ST-Audio Encoder, a time-resolved FOA audio encoder that learns event semantics together with source trajectories, and ST-AudioLM, which connects the audio tokens from the encoder to an LLM for spatio-temporal audio QA. Experiments show that this representation improves the semantic-localization tradeoff and yields stronger reasoning performance than static spatial and localization-oriented baselines.
Primary: Sony AI
All Institutions: Sony AI
The paper presents a novel approach to spatio-temporal audio language modeling, introducing a dataset and methodologies that enhance the understanding of dynamic sound sources. The comprehensive evaluation and innovative contributions position this work as a significant advancement in the field of audio processing and machine learning.
The methodology presented in this paper is robust and innovative, combining a novel spatio-temporal audio QA dataset with a specialized audio encoder and an audio-language model. The use of first-order ambisonic (FOA) renderings to create a controlled benchmark for dynamic sound sources is a significant advancement in the field. The proposed ST-Audio Encoder effectively learns event semantics alongside source trajectories, and the integration with a large language model (LLM) for audio QA is a noteworthy contribution. The structured approach to generating QA pairs from the rendered audio scenes demonstrates a clear understanding of the complexities involved in audio-language reasoning.
The experiments are comprehensive, comparing the proposed models against various baselines, including static and dynamic encoders. The results clearly indicate that the ST-Audio Encoder outperforms existing models in terms of semantic recognition and spatial localization. The evaluation metrics are well-defined, and the experiments cover a range of scenarios, from single-source perception to complex two-source grounding and compositional reasoning. This thorough evaluation strengthens the paper's claims regarding the effectiveness of the proposed methods.
The paper provides sufficient details on the implementation, including the architecture of the ST-Audio Encoder and the training procedures for both the encoder and the LLM. However, the lack of a publicly available demo or project URL limits the reproducibility aspect, as external researchers cannot easily verify the results or utilize the proposed methods without access to the code or data.
The primary limitation noted in the paper is the reliance on controlled synthetic rendering, which may not fully capture the complexities of real-world acoustic environments. Additionally, the benchmark simplifies dynamic scenes, potentially overlooking important acoustic phenomena such as Doppler effects and non-monotonic motion. The authors also acknowledge the need for broader real-world evaluation and the potential biases inherited from the dataset used.
The advancements in spatio-temporal audio language modeling have significant implications for various applications, including robotics, augmented reality, and immersive audio experiences. By improving the understanding and reasoning capabilities of audio-language models, this research paves the way for more sophisticated human-computer interaction systems that can interpret and respond to dynamic auditory environments. The paper presents a novel approach to spatio-temporal audio language modeling, introducing a dataset and methodologies that enhance the understanding of dynamic sound sources. The comprehensive evaluation and innovative contributions position this work as a significant advancement in the field of audio processing and machine learning.
Recent spatial self supervised audio models achieve high performance on localization tasks, raising questions about their encoding of microsecond interaural phase fine structures. We propose a psychoacoustic benchmark based on the binaural masking level difference to evaluate this. Using an equalization cancellation baseline and a GCC PHAT positive control we evaluate nine frozen audio models spanning binaural SSL, monaural SSL, and neural audio codecs. Four monaural negative controls yield zero BMLD confirming binaural specificity. Two general purpose binaural SSL models exhibit minimal phase sensitivity while dedicated binaural spatial SSL models achieve BMLD comparable to the analytical baseline. Progressive physical ablations show that general purpose binaural SSL models rely on spectro temporal interference textures rather than cross channel phase computation. High detection rates in speech reflect a confounding reliance on broadband envelopes rather than genuine phase encoding.
Primary: The Chinese University of Hong Kong
All Institutions: The Chinese University of Hong Kong, Jilin University, Hunan University, University of Electronic Science and Technology of China
This paper presents a critical evaluation of spatial audio models, revealing their limitations in phase encoding and suggesting directions for future research. The comprehensive methodology and significant findings contribute meaningfully to the field of machine learning in audio processing.
The methodology is robust, employing a psychoacoustic benchmark based on binaural masking level difference (BMLD) to evaluate various audio models. The authors effectively utilize a combination of frozen self-supervised learning models and analytical baselines to assess the encoding of interaural phase cues. The progressive physical ablation approach is particularly noteworthy as it isolates the detection mechanisms, providing clear insights into the models' reliance on spectro-temporal interference rather than genuine phase computation.
The experiments are comprehensive, involving nine different audio models and a variety of conditions to ensure thorough evaluation. The use of both synthetic targets and realistic speech excerpts enhances ecological validity. The results are presented clearly, with detailed comparisons against established baselines, and the statistical methods employed for significance testing are appropriate. The findings reveal critical insights into the limitations of current models in encoding phase information, which is essential for spatial audio perception.
The paper provides sufficient detail regarding the models, stimuli, and evaluation metrics, which supports reproducibility. However, the lack of publicly available code or datasets limits the ease with which others can replicate the experiments. Including a link to a GitHub repository or similar would enhance reproducibility significantly.
One limitation is the reliance on frozen models, which may not fully capture the dynamic nature of audio processing in real-world applications. Additionally, while the study identifies the shortcomings of general-purpose binaural models, it does not extensively explore potential solutions or improvements. The ecological confound in realistic speech conditions could also be further investigated to understand its implications better.
The findings have significant implications for the development of future spatial audio models, particularly in enhancing their ability to encode phase information, which is crucial for accurate sound localization. This research could influence the design of audio processing systems in various applications, including virtual reality, hearing aids, and immersive audio experiences. This paper presents a critical evaluation of spatial audio models, revealing their limitations in phase encoding and suggesting directions for future research. The comprehensive methodology and significant findings contribute meaningfully to the field of machine learning in audio processing.
We present target speaker tagging (TST), a task that integrates speaker diarization, verification, and identification into a unified workflow for multi-speaker conversations. Given long recordings and pre-enrolled speakers, TST detects and labels speech segments of known speakers while rejecting unknown ones. Despite its practical importance, research has been limited by the absence of suitable evaluation resources. To address this, we introduce TST-Bench, a large-scale synthetic benchmark with over 150 enrolled speakers, 300 sessions of 20-60 minutes, and reference annotations with global speaker labels. We define an evaluation protocol encompassing diarization and full-pipeline scenarios. Experiments on both real and synthetic data show that TST poses challenges not captured by conventional benchmarks, and that dedicated system design yields significant gains over naive integration of existing solutions. The benchmark dataset and evaluation protocols are publicly released.
Primary: NAVER Cloud Corporation
All Institutions: NAVER Cloud Corporation, NAVER Corporation
The paper presents a novel task and benchmark for target speaker tagging that integrates multiple aspects of speaker recognition, filling a critical gap in the field. The comprehensive methodology and robust experimental evaluation underscore its potential impact on real-world applications and future research directions.
The paper introduces the Target Speaker Tagging (TST) task, which is a novel integration of speaker diarization, verification, and identification. The methodology is well-structured, detailing the system's components and their interactions. The authors provide a clear definition of the TST task and articulate its significance in real-world applications. The construction of TST-Bench, a large-scale synthetic benchmark, is a significant methodological contribution, allowing for systematic evaluation of TST systems. The approach to combining speaker embeddings from multiple segments to enhance identification accuracy is particularly noteworthy.
The experimental setup is robust, utilizing both synthetic and real datasets to validate the proposed methods. The authors present thorough evaluations across different scenarios, demonstrating the effectiveness of their approach. Results indicate that the TST framework outperforms naive integrations of existing methods, highlighting the importance of dedicated system design. The use of metrics like Detection and Identification Rate (DIR) and False Alarm Rate (FAR) provides a comprehensive view of system performance.
The paper provides sufficient detail regarding the implementation of the TST system, including the use of specific models and techniques for speaker diarization and identification. However, the lack of a publicly accessible code repository or demo limits the reproducibility of the results. The authors mention that the dataset and evaluation scripts are publicly released, which is a positive aspect for reproducibility.
The paper acknowledges limitations related to the synthetic nature of TST-Bench, particularly the differences between synthetic and real conversational dynamics. The authors also note that while synthetic data allows for controlled experiments, it may not capture all the complexities of natural speech interactions. Additionally, the reliance on a specific type of speech (read speech) may not fully represent the variability found in spontaneous conversations.
The TST framework and benchmark have the potential to significantly advance the field of speaker recognition by providing a unified evaluation approach that reflects real-world challenges. This work could lead to improved systems for applications such as meeting transcription, voice-based services, and multi-session analytics. By addressing the limitations of existing benchmarks, the authors encourage further research and development in integrated speaker recognition systems. The paper presents a novel task and benchmark for target speaker tagging that integrates multiple aspects of speaker recognition, filling a critical gap in the field. The comprehensive methodology and robust experimental evaluation underscore its potential impact on real-world applications and future research directions.
Recent advances in speech generation have significantly improved the naturalness of synthetic speech, making spoofing detection increasingly challenging. A key limitation of current anti-spoofing systems is their limited robustness to unseen synthesis methods. In this work, we transform a self-supervised speech representation model into a Mixture-of-Experts (MoE) architecture to improve generalization. Feed-forward blocks in selected encoder layers are replaced by multiple expert networks controlled by a layer-wise gating mechanism, allowing experts to capture complementary acoustic patterns while preserving the representations learned during self-supervised pretraining. We further analyze the architectural choices affecting the performance of this MoE conversion and investigate the activation behavior of the experts. The proposed approach is evaluated on 14 spoofing datasets and reduces the macro EER from 5.46% to 4.81%, corresponding to 11.9% relative improvement over the baseline.
Primary: affiliation=1 Mickael
All Institutions: affiliation=1 Mickael, affiliation=1 Driss, affiliation=2 Khaled
The main contribution of this paper is the introduction of a Mixture-of-Experts architecture for speech anti-spoofing, which enhances generalization capabilities compared to traditional methods. The technical contributions are significant, as they provide a new framework for improving the robustness of speech models against evolving spoofing techniques, which is crucial in an era of advanced synthetic speech technologies.
The paper presents a novel approach by converting a self-supervised speech model into a Mixture-of-Experts (MoE) architecture, which is a significant departure from existing methods that utilize low-rank adaptations. The methodology is well-structured, detailing the conversion process, gating mechanisms, and expert activation strategies. The authors also conduct a comprehensive architectural study, analyzing the effects of various design choices on performance, which adds depth to the methodology.
The experimental setup is robust, utilizing 14 diverse spoofing datasets to evaluate the proposed approach. The reduction in macro EER from 5.46% to 4.81% demonstrates a meaningful improvement in performance. The paper includes detailed results across different configurations, providing a clear comparison with baseline methods and LoRA-based approaches, which strengthens the validity of the findings.
The implementation details are sufficiently described, including the training protocols, datasets, and evaluation metrics. However, the paper lacks a direct link to the code or models, which could hinder reproducibility for other researchers. Providing a project URL would enhance this aspect significantly.
While the proposed MoE architecture shows improved performance, the analysis of expert specialization indicates that there is no clear routing specialization across different synthesizers. This could suggest limitations in the model's ability to adapt to various spoofing techniques. Additionally, the increased number of parameters in the MoE configuration may raise concerns about model efficiency.
The work addresses a critical challenge in the field of speech processing, particularly in the context of anti-spoofing, which has significant implications for security and trust in voice technologies. The findings could influence future research directions in robust speech recognition and synthesis, as well as applications in security systems. The main contribution of this paper is the introduction of a Mixture-of-Experts architecture for speech anti-spoofing, which enhances generalization capabilities compared to traditional methods. The technical contributions are significant, as they provide a new framework for improving the robustness of speech models against evolving spoofing techniques, which is crucial in an era of advanced synthetic speech technologies.
Recent alignment-free non-autoregressive (NAR) text-to-speech (TTS) models formulate synthesis as a conditional infilling task, bypassing explicit duration predictors and external aligners. When speech is represented with neural codec tokens, the infilling problem becomes discrete, making Discrete Flow Matching (DFM), a Continuous-Time Markov Chain (CTMC) framework for discrete generation, a natural fit. However, inference-time control for stable low-step conditional infilling remains underexplored. We propose Mask, Sample, Revise, an inference-time CTMC stack for alignment-free DFM-TTS. The stack combines predictor-free guidance to strengthen text conditioning, prompt-matched conditional coupling to align the probability path with the acoustic prompt, and SC-ReMask, a schedule-constrained remasking mechanism that introduces token-to-mask transitions so early de-masking decisions can be revised. These components require no post-hoc fine-tuning and operate in a single tau-leaping sampler. Controlled ablations show that this stack improves intelligibility and robustness in the low-NFE prompted setting, outperforming unguided and guidance-only samplers with substantially more steps.
Primary: Federal University of Goiás
All Institutions: Federal University of Goiás, Elsa Speak
The paper presents G-DFlow-TTS, an innovative alignment-free text-to-speech system that significantly improves intelligibility and robustness through a novel inference stack. The methodology is well-articulated, and the experimental results demonstrate the effectiveness of the proposed approach, marking a meaningful contribution to the field of machine learning in audio synthesis.
The proposed methodology introduces a novel inference-time control stack for non-autoregressive text-to-speech synthesis, leveraging Continuous-Time Markov Chains (CTMC) to enhance discrete flow matching. The integration of predictor-free guidance, conditional coupling, and a remasking mechanism (SC-ReMask) represents a significant advancement in the field, allowing for revisable token generation. The approach is well-structured and addresses the limitations of existing models by focusing on inference-time controls rather than solely increasing sampling steps.
The experiments are robust, utilizing both objective metrics (WER, CER) and subjective evaluations (MOS) to assess the performance of the proposed G-DFlow-TTS system against baselines. The controlled ablations provide clear insights into the contributions of each component of the proposed stack, demonstrating significant improvements in intelligibility and robustness. The use of a well-defined dataset (LibriSpeech) further strengthens the evaluation.
The paper provides detailed implementation specifics, including model architecture, training parameters, and evaluation protocols, which enhances reproducibility. However, the lack of a public code repository limits the ability for independent verification of results.
One notable limitation is the reliance on a single dataset for evaluation, which may not fully capture the generalizability of the model across diverse speech patterns and languages. Additionally, the paper acknowledges that speaker similarity remains below that of larger external systems, indicating potential areas for improvement.
The advancements in alignment-free text-to-speech synthesis have significant implications for applications in voice assistants, audiobooks, and accessibility technologies. The ability to revise token decisions during generation could lead to more natural-sounding speech synthesis, enhancing user experience in various audio applications. The paper presents G-DFlow-TTS, an innovative alignment-free text-to-speech system that significantly improves intelligibility and robustness through a novel inference stack. The methodology is well-articulated, and the experimental results demonstrate the effectiveness of the proposed approach, marking a meaningful contribution to the field of machine learning in audio synthesis.
We present MaskedFOP, a system for closed-set polyglot speaker identification under two simultaneous challenges: the face modality is entirely absent at test time, and speech comes from Urdu, a language unseen during face-supervised training. The system integrates three complementary mechanisms. First, a modality-dropout dual-head network built on the Fusion and Orthogonal Projection (FOP) backbone forces the audio branch to develop independent discriminative power via per-sample face masking, ensuring that the audio encoder remains capable when face is absent. Second, two MaskedFOP instances trained on Emphasized Channel Attention, Propagation, and Aggregation in Time Delay Neural Network (ECAPA-TDNN) features with different random seeds produce complementary audio embeddings whose element-wise average yields a more robust 512-dimensional representation than any single model. Third, a two-stage cascaded inference procedure first refines multimodal labels through a fused Graph Label Propagation (GLP) pass (Stage 1), then assigns audio-only labels by cosine nearest-centroid (Stage 2), replacing the 70 sparse training prototypes with ~1,500 in-domain test-set centroids from Stage 1. Submitted to the POLY-SIM 2026 Grand Challenge, the system achieves a mean P-accuracy of 0.9989, placing first among all submissions evaluated on the challenge server. An ablation identifies cascaded seeding as the single largest gain (>8 pp on P4/P6). The code is available at https://github.com/Ayoub-Elkhouzari/POLY-SIM2026.
Primary: University Mohammed VI Polytechnic
All Institutions: University Mohammed VI Polytechnic
The paper presents MaskedFOP, a novel system for polyglot speaker identification that excels in scenarios with missing visual modalities, achieving state-of-the-art performance in a challenging evaluation setting. The integration of advanced techniques such as modality dropout, multi-seed averaging, and cascaded label propagation showcases a significant advancement in the field of speaker recognition and multimodal learning.
The methodology presented in MaskedFOP is innovative, integrating a dual-head modality-dropout network with a cascaded graph label propagation approach. The use of per-sample face masking during training effectively enhances the audio branch's robustness when the visual modality is absent. The multi-seed averaging technique for audio embeddings further improves the stability of the representations. The two-stage inference process, which refines multimodal labels before assigning audio-only labels, is a significant advancement in handling missing modalities in speaker identification.
The experimental evaluation is thorough, leveraging the POLY-SIM 2026 Grand Challenge dataset, which provides a robust benchmark for assessing the model's performance under challenging conditions. The reported mean P-accuracy of 0.9989 is impressive, particularly given the complexities of cross-lingual speaker identification and missing visual modalities. The ablation studies effectively demonstrate the contributions of each component, highlighting the importance of the cascaded seeding strategy.
The paper provides sufficient implementation details, including hyperparameters and training procedures, which are crucial for reproducibility. The availability of the code on GitHub further enhances the potential for other researchers to replicate the results. However, the reliance on fixed pre-extracted features might limit the adaptability of the approach to other datasets or modalities.
The primary limitations include the closed-set assumption, which may not generalize to open-set scenarios, and the dependence on English-trained features, which could affect performance on other languages or dialects. Additionally, the transductive nature of the inference process requires the entire unlabeled test partition at once, which may not be feasible in all applications.
The proposed system has significant implications for biometric recognition systems, particularly in multilingual and multimodal contexts. It could enhance applications in security, user authentication, and personalized services where speaker identification is critical. The methodology could also inspire future research in cross-modal learning and robust speaker recognition under varying conditions. The paper presents MaskedFOP, a novel system for polyglot speaker identification that excels in scenarios with missing visual modalities, achieving state-of-the-art performance in a challenging evaluation setting. The integration of advanced techniques such as modality dropout, multi-seed averaging, and cascaded label propagation showcases a significant advancement in the field of speaker recognition and multimodal learning.
We show that the three movements of Beethoven's "Moonlight Sonata" (Op. 27 No. 2) instantiate three distinct machine learning architectures -- not by analogy, but by structural correspondence. Through computational analysis of the score (entropy, Jensen-Shannon divergence, dissonance, hand distributional overlap, self-similarity matrices, temporal memory decay, and contextual pitch embeddings), we establish four counterintuitive findings: (1) perceived musical "temperature" is governed by throughput, not distributional width; (2) the lightest movement carries the highest dissonance; (3) the movements implement streaming, recurrent, and periodic positional encoding memory architectures; and (4) the same pitch class acquires different contextual identities across movements, analogous to contextual vs.static embeddings in NLP -- and unsupervised clustering recovers the tonal structure without music-theoretic input. We construct a reverse sonification (decoding analytical features back into MIDI) and quantify the chirality of the encode-decode cycle: what distributions preserve and sequential ordering destroys. Prompted by a listener's observation that the decoded piece sounds like "mirror isomers that can't be superimposed," the chirality measurement reveals reconstruction loss increasing monotonically with n-gram order. Bootstrap baselines and subsample checks confirm all movements carry sequential information above noise, though raw values are confounded by sample size. Cross-domain comparison shows natural language has higher chirality than music, reflecting stronger sequential constraints.
Primary: Claude Code / Opus 4.6
All Institutions: Claude Code / Opus 4.6, API / Fable 5, Independent researcher
The main contribution of this paper is the establishment of a formal structural isomorphism between Beethoven's "Moonlight Sonata" and machine learning architectures, revealing deep connections between music and computational mechanisms. This work significantly advances the understanding of both domains and proposes a novel methodology that integrates human perception with computational analysis, offering new insights into the nature of music and its relationship with machine learning.
The paper employs a novel approach by establishing a structural isomorphism between musical compositions and machine learning architectures, utilizing various computational analyses such as Shannon entropy, Jensen-Shannon divergence, and self-similarity matrices. The methodology is rigorous, employing both quantitative and qualitative analyses, and introduces a reverse sonification process that allows for the exploration of chirality in the encode-decode cycle. This feedback loop between human perception and computational analysis is a significant methodological innovation.
The experiments conducted are comprehensive, analyzing Beethoven's "Moonlight Sonata" across its three movements. The authors provide a detailed breakdown of the metrics used, including entropy, dissonance, and memory decay, and validate their findings through bootstrap baselines and subsampling checks. The results are presented clearly, demonstrating the structural correspondences between music and ML mechanisms, with counterintuitive findings that challenge existing assumptions about music theory and machine learning.
The paper includes a repository with all code, data, figures, and generated MIDI files, which enhances reproducibility. However, the analysis operates at a symbolic level rather than a signal level, which may limit the ability to fully reproduce the auditory experience of the original music. The authors transparently report their methods and findings, including limitations related to sample size and the metrics used.
The analysis is limited to symbolic representations of music, neglecting aspects like timbre and dynamics that are crucial for a complete understanding of musical perception. Additionally, the reverse sonification process simplifies rhythmic structures, which could lead to a loss of important musical information. The chirality measurement is also bounded by n-gram order, potentially overlooking higher-order dependencies in musical structure.
This research has the potential to influence multiple fields, including computational musicology, machine learning, and cognitive neuroscience. By establishing a formal correspondence between music and ML mechanisms, it opens avenues for interdisciplinary research and applications, such as improved music generation models and enhanced understanding of musical cognition. The findings could also inspire new methodologies in analyzing other forms of art and complex systems. The main contribution of this paper is the establishment of a formal structural isomorphism between Beethoven's "Moonlight Sonata" and machine learning architectures, revealing deep connections between music and computational mechanisms. This work significantly advances the understanding of both domains and proposes a novel methodology that integrates human perception with computational analysis, offering new insights into the nature of music and its relationship with machine learning.
This paper investigates the fragility of post-hoc explanation methods in audio deepfake detection. While previous work on explanation manipulation focused on images using standard $L_p$ metrics, we introduce a psychoacoustic framework that optimizes inaudible perturbations to decouple model attributions from final classifications. We evaluate this vulnerability across state-of-the-art architectures under strict prediction-preserving constraints. By evaluating the manipulation cost through domain-specific perceptual audio quality metrics alongside explanation alignment criteria, our framework demonstrates that an adversary can systematically distort automated explanation heatmaps while preserving the predicted deepfake label. Full code available at: https://github.com/cncPomper/Audio-XAI
Primary: Warsaw University of Technology
All Institutions: Warsaw University of Technology
This paper provides a crucial investigation into the vulnerabilities of audio deepfake detection systems, demonstrating that attribution maps can be manipulated while preserving predictions and audio quality. The innovative psychoacoustic approach and thorough experimental evaluation contribute significantly to the understanding of explainability in audio models, marking a step forward in the field.
The paper introduces a novel psychoacoustic framework for manipulating audio model attributions while preserving predictions. This approach is innovative as it adapts adversarial attacks from the image domain to audio, incorporating perceptual metrics that are more relevant to human auditory perception. The methodology is well-structured, utilizing a combination of established XAI techniques (Grad-CAM and LRP) and new psychoacoustic constraints, which is a significant advancement in the field of audio explainability.
The experiments are comprehensive, utilizing a diverse set of architectures and a well-defined dataset (SONICS). The evaluation of the manipulation cost through perceptual audio quality metrics is particularly noteworthy, as it aligns the technical assessment with human auditory experience. The results clearly demonstrate the effectiveness of the proposed method in manipulating attribution maps while maintaining high audio fidelity, which is a crucial aspect for practical applications in audio deepfake detection.
The authors provide a GitHub repository with full code and configurations, which enhances reproducibility. However, the paper could benefit from more detailed documentation on the experimental setup and hyperparameter choices to facilitate easier replication of results by other researchers.
One limitation is the focus on specific architectures and datasets, which may not generalize across all audio models or applications. Additionally, while the psychoacoustic framework is innovative, the paper does not extensively discuss potential countermeasures against such attacks, which could be critical for real-world applications.
The findings have significant implications for the field of explainable AI, particularly in audio applications. By highlighting the vulnerabilities in current explanation methods, this research can inform the development of more robust and trustworthy audio classification systems. The work also raises ethical considerations regarding the potential misuse of adversarial techniques in manipulating model interpretations. This paper provides a crucial investigation into the vulnerabilities of audio deepfake detection systems, demonstrating that attribution maps can be manipulated while preserving predictions and audio quality. The innovative psychoacoustic approach and thorough experimental evaluation contribute significantly to the understanding of explainability in audio models, marking a step forward in the field.
Turn-taking in multi-party spoken conversations remains a fundamental challenge for voice-based agents, particularly under dynamic floor competition and varying user expectations. We propose ModeratorLM, a role-playing voice agent that conditions turn-taking behavior on an explicitly assigned role in multi-party settings. The system is built on a speech large language model operating in chunk-wise streaming manner. We further introduce a reasoning-augmented variant that incorporates chain-of-thought reasoning over conversational context and the assigned role. We construct RolePlayConv, a large-scale synthetic dataset of spoken multi-party conversations with diverse assistant roles. Experiments on real-world meeting data and RolePlayConv show improved turn-taking precision by over 40% and recall by more than 70%, while substantially reducing false-positive interruptions compared to non-role-conditioned baselines.
Primary: Amazon AGI
All Institutions: Amazon AGI, IIT Kharagpur
The main contribution of this paper is the introduction of ModeratorLM, a role-playing voice agent that enhances turn-taking in multi-party conversations through role conditioning and reasoning. This work represents a significant advancement in the field of conversational AI, addressing a critical challenge in multi-party interactions and providing a novel dataset for future research.
The proposed methodology introduces ModeratorLM, a role-playing voice agent that utilizes a speech large language model (LLM) to manage turn-taking in multi-party conversations. The approach is innovative in its use of role conditioning to influence turn-taking behavior, which is a significant advancement over traditional models that do not consider role dynamics. The integration of chain-of-thought reasoning in the ModeratorLM-Think variant adds an additional layer of sophistication, allowing the model to better interpret conversational context. The construction of the RolePlayConv dataset is also a notable contribution, as it provides a tailored resource for training and evaluating role-conditioned agents in multi-party settings. However, the reliance on synthetic data may raise questions about the generalizability of the findings.
The experiments conducted demonstrate a clear improvement in turn-taking precision and recall when using the ModeratorLM models compared to non-role-conditioned baselines. The use of both real-world meeting data and the synthetic RolePlayConv dataset strengthens the evaluation. The metrics reported, including precision, recall, F1-score, and reactive miss rate, provide a comprehensive view of the model's performance. The ablation studies further validate the importance of dynamic chunking and the role of reasoning in enhancing model performance. However, the lack of extensive human evaluations beyond the small-scale study may limit the robustness of the claims regarding role fidelity.
The paper provides a detailed description of the training and evaluation setup, including the architecture of the models, the dataset construction process, and the evaluation metrics. However, there is no mention of code or data availability, which is crucial for reproducibility in machine learning research. The absence of a demo or project URL also hinders the ability for others to replicate the work.
One significant limitation is the reliance on synthetic data for training the RolePlayConv dataset, which may not fully capture the complexities of real-world multi-party conversations. Additionally, while the model shows improved performance in turn-taking, it remains conservative, missing some valid response opportunities, which could affect user experience in practical applications. The paper does not address potential biases in the dataset or the model's performance across diverse demographics.
The development of role-conditioned voice agents has the potential to significantly enhance the usability of conversational AI in various applications, such as virtual assistants, customer service, and collaborative tools. By improving turn-taking behavior, these agents can facilitate more natural and effective interactions in multi-party settings. However, ethical considerations regarding the deployment of such technology, especially in sensitive contexts, must be carefully evaluated. The main contribution of this paper is the introduction of ModeratorLM, a role-playing voice agent that enhances turn-taking in multi-party conversations through role conditioning and reasoning. This work represents a significant advancement in the field of conversational AI, addressing a critical challenge in multi-party interactions and providing a novel dataset for future research.
Multi-talker speech recognition is often addressed by combining automatic speech recognition (ASR) and speaker diarization in a pipeline system. Recently, LLM-based approaches have shown promise by jointly modeling semantic and speaker information, but they typically require large-scale multi-talker corpora that are costly to annotate. In this paper, we investigate how to efficiently train an LLM-based system with limited real-recorded data while maintaining high accuracy in speaker attribution. We propose several strategies: (1) a dual-encoder architecture to extract semantic and speaker features, (2) a feature interleaving format to merge these features as the inputs to the LLM, (3) a length-aware speaker ID loss to enhance diarization capability, and (4) an adaptive threshold strategy for ASR loss computation to mitigate hallucinations caused by speech overlaps. These strategies balance training between ASR and diarization tasks. Our system outperforms open-source baseline approaches, achieving relative improvements of 18% on the AliMeeting corpus and 24% on the Aishell4 corpus.
Primary: Huawei Technologies, China
All Institutions: Huawei Technologies, China
The main contribution of this paper is the development of an end-to-end model for multi-talker ASR that balances ASR and diarization tasks through innovative architecture and loss function design. This work represents a meaningful advancement in the field of speech recognition, particularly in handling overlapping speech, and demonstrates the potential of LLMs in improving speaker attribution accuracy.
The paper introduces a dual-encoder architecture that effectively extracts semantic and speaker features, employing innovative strategies such as feature interleaving and a length-aware speaker ID loss. The adaptive threshold strategy for ASR loss computation is particularly noteworthy, as it addresses the common issue of hallucinations in overlapping speech. The methodology is well-structured and demonstrates a clear understanding of the challenges in multi-talker ASR and diarization.
The experiments are comprehensive, utilizing two significant corpora (AliMeeting and Aishell4) to validate the proposed methods. The reported improvements over baseline systems are substantial, with relative gains of 18% and 24% in performance metrics. The evaluation metrics, including Character Error Rate (CER) and concatenated minimum-permutation character error rate (cpCER), are appropriate for assessing the effectiveness of the system.
The paper provides sufficient details about the model architecture, training process, and evaluation metrics, which facilitates reproducibility. However, the absence of publicly available code or datasets limits the ability of other researchers to replicate the findings fully.
One limitation is the reliance on limited real-recorded data, which may affect the generalizability of the model. Additionally, while the adaptive loss masking strategy shows promise, its effectiveness in more diverse or challenging datasets remains to be validated.
The proposed system has significant implications for real-world applications in multi-talker environments, such as meetings and conferences, where accurate speaker attribution is crucial. The integration of ASR and diarization in a unified model could enhance various applications, including automated transcription services and interactive voice response systems. The main contribution of this paper is the development of an end-to-end model for multi-talker ASR that balances ASR and diarization tasks through innovative architecture and loss function design. This work represents a meaningful advancement in the field of speech recognition, particularly in handling overlapping speech, and demonstrates the potential of LLMs in improving speaker attribution accuracy.
Large language model (LLM)-based text-to-speech (TTS) systems enable prompt-conditioned emotional control but struggle with fine-grained emotion intensity due to the semantic -- acoustic gap between text and speech. To address this challenge, we formulate emotion intensity control in LLM-based TTS as a learning-to-rank problem and propose Emo-LiPO, a listwise preference optimization framework that aligns prompt-conditioned speech generation with relative emotion intensity expressed in text. Emo-LiPO explicitly models global intensity ordering within each emotion under fixed transcripts, enabling more faithful and continuous emotional expression. We further construct ESD-plus, a multi-speaker dataset with explicit emotion intensity variations, to support fine-grained emotion modeling and evaluation. Experiments on ESD-plus demonstrate that Emo-LiPO significantly improves emotion accuracy and intensity controllability over both supervised- and DPO-based LLM TTS baselines, with particularly pronounced gains at high intensity levels.
Primary: The Chinese University of Hong Kong
All Institutions: The Chinese University of Hong Kong, Agency for Science, Technology and Research, National University of Singapore, Shenzhen Loop Area Institute, Shenzhen Research Institute of Big Data
The main contribution of this paper is the introduction of Emo-LiPO, a listwise preference optimization framework that significantly enhances fine-grained emotion intensity control in LLM-based TTS systems. This work addresses critical challenges in the field, providing a novel methodological approach and a valuable dataset for future research.
The paper proposes Emo-LiPO, a novel listwise preference optimization framework that reformulates emotion intensity control in LLM-based TTS as a learning-to-rank problem. This approach is innovative as it explicitly models global intensity ordering, addressing the semantic-acoustic gap in existing methods. The methodology is well-structured, including a comprehensive description of the problem formulation, the construction of the ESD-plus dataset, and the multi-stage optimization process. The use of a rule-based preference construction strategy for generating training data is a significant strength, as it allows for a more controlled and systematic evaluation of emotion intensity.
The experiments are robust, utilizing both automatic and human evaluations to assess the performance of Emo-LiPO against multiple baselines. The results demonstrate significant improvements in emotion accuracy and intensity controllability, particularly at higher intensity levels. The inclusion of various metrics for evaluation, such as WER, NISQA, and human preference comparisons, adds depth to the experimental assessment. The dataset ESD-plus is well-constructed, providing a solid foundation for evaluating the proposed method.
The paper provides a link to the GitHub repository containing the code, which is a positive aspect for reproducibility. However, detailed implementation specifics, such as hyperparameters and training configurations, are not fully disclosed in the text, which could pose challenges for other researchers attempting to replicate the results.
One limitation is the reliance on a single dataset (ESD-plus) for evaluation, which may affect the generalizability of the findings. Additionally, while the method shows improvements in emotion intensity control, the paper does not extensively discuss potential biases in the dataset or the implications of the rule-based preference construction strategy.
The Emo-LiPO framework has significant implications for the development of more expressive and controllable TTS systems, enhancing applications in areas such as virtual assistants, audiobooks, and entertainment. By improving fine-grained emotion intensity control, this research could lead to more engaging and human-like interactions in various audio applications. The main contribution of this paper is the introduction of Emo-LiPO, a listwise preference optimization framework that significantly enhances fine-grained emotion intensity control in LLM-based TTS systems. This work addresses critical challenges in the field, providing a novel methodological approach and a valuable dataset for future research.
While low-latency interaction is critical for spoken dialogue, cascaded architectures are often bottlenecked by reactive turn-completion detection. We propose Endpoint Anticipation, shifting from reactive detection to proactive forecasting of end-of-turn signals. Our speech-based model anticipates endpoints upto 2.56 seconds in advance, enabling speculative execution of LLM and TTS pipelines on partial context. We introduce metrics to quantify the trade-off between realized latency reduction and computational redundancy. Evaluation across conversational and task-oriented datasets shows our model consistently outperforms competitive VAP-based baselines. Integration with the Unmute framework demonstrates a 505 ms average latency reduction with a 28.4% increase in speculative computation, effectively masking sequential bottlenecks to enable complex reasoning in real-time speech-to-speech interaction.
Primary: Brno University of Technology
All Institutions: Brno University of Technology, Carnegie Mellon University
The paper presents a significant advancement in low-latency spoken dialogue systems through the introduction of endpoint anticipation, which allows for proactive processing of user speech. This innovative approach, combined with a robust evaluation framework, positions the work as a valuable contribution to the field of audio and machine learning.
The paper introduces a novel approach to endpoint anticipation in spoken dialogue systems, shifting from reactive to proactive detection of end-of-turn signals. The dual-stream audio representation and the use of independent binary classification tasks for different anticipation horizons are well-structured and innovative. The proposed metrics for evaluating the trade-off between latency reduction and computational redundancy are a significant contribution to the field, allowing for a more nuanced understanding of system performance. The integration with the Unmute framework demonstrates practical applicability, although the paper could benefit from clearer explanations of the model architecture and training procedures.
The evaluation is thorough, utilizing two diverse datasets (SpokenWOZ and Switchboard) to assess the model's performance across various anticipation horizons. The results show a consistent improvement over the VAP baseline, with a notable average latency reduction of 505 ms. The introduction of specific metrics like Median Realized Anticipation and Expected Redundant Computation provides valuable insights into the model's efficiency and effectiveness. However, the paper could enhance its experimental rigor by including more comprehensive ablation studies to analyze the impact of different components of the model.
The authors mention that they will open-source their implementation, which is a positive step towards reproducibility. However, the paper lacks detailed information on hyperparameter tuning, model training specifics, and the exact configurations used in experiments, which could hinder replication efforts by other researchers.
One limitation is the reliance on specific datasets, which may not capture the full variability of real-world conversational speech. Additionally, while the model shows promise in structured dialogues, its performance in more spontaneous, open-domain conversations remains uncertain. The trade-off between latency reduction and computational redundancy, while quantified, may still lead to inefficiencies in certain scenarios, especially in longer dialogues.
The proposed framework has significant implications for real-time spoken dialogue systems, particularly in applications requiring low-latency interactions, such as virtual assistants and customer service bots. By enabling speculative execution of downstream processes, the model could enhance user experience in conversational AI, making interactions feel more natural and responsive. The open-source nature of the project may also foster further research and development in this area. The paper presents a significant advancement in low-latency spoken dialogue systems through the introduction of endpoint anticipation, which allows for proactive processing of user speech. This innovative approach, combined with a robust evaluation framework, positions the work as a valuable contribution to the field of audio and machine learning.
Self-supervised learning advances audio representation for multimedia analysis. However, prevailing data-centric approaches rely on massive real-world corpora, increasing training costs, curation burdens, and privacy barriers. To address this, we present AudioPG, a procedural synthesis framework eliminating real audio recordings during pre-training. AudioPG trains a Transformer-based masked autoencoder on waveforms generated on-the-fly from basic acoustic primitives and composition rules. The encoder transfers effectively to real audio benchmarks, achieving 90.60% accuracy on ESC-50, 0.546 mAP on FSD50K, 88.17% on UrbanSound8K, and 97.03% on Speech Commands V2. Notably, pre-training completes in under 20 minutes on a single GPU. Latent space analysis reveals physical factors, including fundamental frequency and relative intensity, emerge in orthogonal subspaces, making representations linearly decodable. These results establish procedural synthesis as an efficient, interpretable pre-training signal when large-scale corpora are unavailable. Our code is available at: https://github.com/Freyliu0516/audioPG.
Primary: East China Normal University
All Institutions: East China Normal University, Fudan University, Shanghai Jiao Tong University, Southeast University
The paper presents AudioPG, a procedural synthesis framework for audio representation learning that eliminates the need for real recordings. This innovative approach not only enhances efficiency and interpretability in audio learning but also opens new avenues for research in self-supervised learning and audio synthesis.
The methodology presented in the paper is innovative, utilizing a procedural audio synthesis framework (AudioPG) that generates audio waveforms on-the-fly without relying on real-world audio recordings. This approach leverages basic acoustic primitives and composition rules, allowing for a systematic exploration of audio representation learning. The use of a Transformer-based masked autoencoder to reconstruct log-Mel spectrograms is a well-established technique, but the novelty lies in the complete detachment from real data during pre-training, which is a significant advancement in the field. The detailed description of the procedural synthesizer and its components showcases a robust understanding of sound synthesis principles, enhancing the interpretability of the learned representations.
The experimental evaluation is thorough, with performance metrics reported on multiple real-world benchmarks (ESC-50, UrbanSound8K, FSD50K, and Speech Commands V2). The results demonstrate that the AudioPG framework achieves competitive accuracy levels, indicating effective transfer learning capabilities from synthetic to real audio tasks. The paper also includes an ablation study that quantifies the contributions of various synthesizer components, providing insights into the model's performance dynamics. However, the reliance on a single GPU for pre-training may limit the generalizability of the findings to larger-scale applications.
The paper includes a link to the code repository, which is essential for reproducibility. However, the details regarding the specific configurations and hyperparameters used during training and evaluation could be more explicit to facilitate easier replication of the results by other researchers. The description of the datasets and evaluation protocols is adequate, but clearer guidelines on the setup would enhance reproducibility.
One limitation identified in the study is the semantic gap between the physical attributes captured by the procedural generator and the high-level semantic categories required for accurate classification in real-world tasks. The model struggles with fine-grained distinctions in audio classification, particularly in cases where acoustic similarities lead to misclassifications. Additionally, the lack of high-level semantic modeling in the procedural synthesis may restrict its applicability in more complex audio understanding tasks.
The potential applications of this research are significant, particularly in scenarios where large-scale audio datasets are unavailable due to privacy or resource constraints. The procedural generation approach could democratize access to audio representation learning, enabling researchers and practitioners to develop models without the burden of extensive data curation. Furthermore, the insights gained from the latent space analysis may inform future work in audio synthesis and representation learning, bridging the gap between physical sound properties and semantic understanding. The paper presents AudioPG, a procedural synthesis framework for audio representation learning that eliminates the need for real recordings. This innovative approach not only enhances efficiency and interpretability in audio learning but also opens new avenues for research in self-supervised learning and audio synthesis.
Training neural networks (NNs) for speech enhancement (SE) in distant speech-capturing scenarios requires paired distorted and clean reference speech signals. While such data are often generated through simulation, the mismatch between simulated and real recordings significantly limits SE accuracy. To address this issue, we propose Close-to-Distant microphone Projection (C2D projection), a method that generates paired data from real recordings captured by close and distant microphones. C2D projection estimates an optimal projection matrix that transforms close-microphone inputs into clean reference signals aligned with distant-microphone recordings, while simultaneously performing denoising. We show this projection can be effectively realized using a variant of the Parametric Multichannel Wiener Filter (PMWF). Experimental results demonstrate that an NN trained with C2D-projected data outperforms the state-of-the-art Guided Source Separation (GSS) on the challenging CHiME6 dinner party ASR task under oracle diarization, when using the enhanced output from GSS as an auxiliary input to the NN.
Primary: NTT, Inc.
All Institutions: NTT, Inc.
The paper presents a novel approach to generating training targets for speech enhancement in real-world scenarios, significantly improving upon existing methods. The technical contributions, particularly in the formulation of the C2D projection method and its empirical validation, highlight its potential impact on the field of audio processing and machine learning.
The proposed Close-to-Distant microphone Projection (C2D projection) method is a significant advancement in generating training targets for speech enhancement in real-world scenarios. The methodology effectively addresses the challenge of obtaining paired clean and distorted speech signals by leveraging recordings from close and distant microphones. The use of a projection matrix derived from a variant of the Parametric Multichannel Wiener Filter (PMWF) is innovative, as it allows for simultaneous denoising and alignment of signals, which is crucial for training neural networks. The paper provides a clear mathematical formulation and rationale for the method, making it accessible for replication and further research.
The experimental evaluation is robust, utilizing the CHiME6 and CHiME8 datasets, which are well-regarded benchmarks in the field of speech enhancement and automatic speech recognition. The results demonstrate that the C2D projection method outperforms the state-of-the-art Guided Source Separation (GSS) approach under both matched and mismatched conditions. The use of objective metrics such as tcpWER and DNSMOS adds credibility to the findings. However, the paper could benefit from additional qualitative assessments or user studies to further validate the improvements in speech intelligibility and quality.
The paper includes sufficient detail regarding the implementation of the C2D projection method and the training of the neural network, referencing publicly available code for the model. However, the reproducibility could be enhanced by providing more explicit details on the training process, hyperparameters, and the specific configurations used for the experiments, as well as making the generated datasets available.
One limitation noted is the reliance on oracle diarization labels during training, which may not be feasible in practical applications. Additionally, while the method shows robustness against some mismatches in training and test conditions, there are scenarios where performance degradation is observed, indicating that further work is needed to enhance the method's adaptability to diverse environments.
The C2D projection method has significant implications for real-world applications in speech enhancement, particularly in environments where distant microphones are used, such as in meetings or public speaking events. The ability to generate high-quality training targets from real recordings can lead to improved performance in automatic speech recognition systems, potentially enhancing user experiences in various audio-based applications. The findings could also inspire further research into novel training techniques for other audio processing tasks. The paper presents a novel approach to generating training targets for speech enhancement in real-world scenarios, significantly improving upon existing methods. The technical contributions, particularly in the formulation of the C2D projection method and its empirical validation, highlight its potential impact on the field of audio processing and machine learning.
Neural speech codecs based on Vector-Quantized VAEs (VQ-VAEs) are core audio tokenizers for speech LLMs, yet their reconstruction fidelity is bottlenecked by quantization error. Modifying the quantizer or increasing model capacity are common fixes, but they complicate downstream language modeling. Our core idea is to align the decoder's internal feature manifolds when processing both the quantized tokens and their original continuous embeddings, using a lightweight feature-mapping loss. This requires minimal training overhead and no inference-time changes. Applied to XCodec2, self-guidance improves all reconstruction metrics, achieving state-of-the-art low-bitrate performance. Notably, it enables a 4x codebook reduction without fidelity loss, which downstream TTS experiments show significantly improves LLM-based synthesis by simplifying the token modeling space. Multiple statistical observations and visualizations corroborate the enhanced internal manifold alignment in the decoder. Extensive experiments confirm its generality across various inductive biases. Self-guidance thus establishes an efficient, broadly applicable method for high-fidelity neural audio coding.
Primary: Tsinghua University
All Institutions: Tsinghua University, Pengcheng Laboratory
The paper presents self-guidance, a novel training mechanism that enhances the fidelity of neural speech codecs by aligning decoder outputs for quantized and continuous latent representations. This contribution is significant as it addresses a critical bottleneck in audio coding, providing a practical solution that improves reconstruction quality while simplifying downstream language modeling tasks.
The proposed self-guidance mechanism introduces a novel approach to enhance the robustness of VQ-VAE-based neural speech codecs against quantization artifacts. By aligning the decoder's internal feature manifolds through a lightweight feature-mapping loss, the methodology effectively mitigates the impact of quantization error without requiring significant changes to the model architecture or inference process. This innovative approach is well-justified and supported by thorough theoretical grounding and empirical validation.
The experiments are comprehensive, utilizing the LibriSpeech dataset to evaluate reconstruction performance across various codebook sizes and quantization methods. The results demonstrate significant improvements in reconstruction metrics, establishing state-of-the-art performance for low-bitrate speech codecs. The inclusion of subjective evaluations further strengthens the findings, providing a well-rounded assessment of the proposed method's efficacy.
The paper provides sufficient implementation details, including model configurations and training procedures, which facilitate reproducibility. The use of open-source code for the baseline models enhances transparency and allows for independent verification of results.
While the self-guidance mechanism shows promise, the paper acknowledges that it does not completely eliminate quantization artifacts, indicating that some residual distortion may persist. Additionally, the validation is primarily focused on neural speech codecs, and the applicability to other audio domains remains to be explored.
The proposed method has significant implications for improving audio compression technologies, potentially enhancing accessibility in telecommunications and low-bandwidth applications. However, the potential for misuse in generating deceptive audio content must be considered, emphasizing the need for responsible deployment of such technologies. The paper presents self-guidance, a novel training mechanism that enhances the fidelity of neural speech codecs by aligning decoder outputs for quantized and continuous latent representations. This contribution is significant as it addresses a critical bottleneck in audio coding, providing a practical solution that improves reconstruction quality while simplifying downstream language modeling tasks.
Melodic material in Hindustani music is presented in relation to a tonic, usually sustained by the tanpura, a four-stringed drone instrument. Rooted in Hindustani music, 'The Moving Drone' sets the traditionally static drone into motion that, throughout the performance, gains increasing agency transitioning from reactive to more proactive roles. The work employs four independent loopers in Max/MSP to function as 'virtual' drones. They are populated cyclically in real-time as the vocalist improvises, creating an organic and evolving feedback loop between the voice and the virtual drone. This relationship further evolves melodically by pitch shifting the loops, which introduces a dimension of sudden, explicit movement. Then it changes timbrally, via the integration of GaMaDHaNi, a singer conditioned pitch-to-voice generative AI model to resynthesize looped audio. While current music AI approaches prioritize high-fidelity and realism of generated content which has sparked anxiety over job replacement for the music community, this work intentionally utilizes low-fidelity generative outputs, further necessitating human interpretation and situational context in order to be complete. 'The Moving Drone' positions technology and generative AI within established socio-cultural musical practices, proposing a virtual drone as an active, responsive, and co-creative musical agent.
Primary: Massachusetts Institute of Technology
All Institutions: Massachusetts Institute of Technology, Harvard University
The main contribution of this paper is the innovative exploration of agency in music through the integration of generative AI and traditional Hindustani music practices. This work significantly advances the discourse on the role of technology in artistic expression, proposing a model where AI acts as a collaborative partner rather than a mere tool, thus enriching the creative landscape in music.
The methodology employed in "The Moving Drone" is innovative in its integration of traditional Hindustani music with modern technology, specifically through the use of Max/MSP and generative AI. The paper outlines a clear structure for the performance, detailing how the drone's agency is manipulated through three distinct movements. Each movement explores different aspects of musical improvisation and interaction with technology, showcasing a thoughtful approach to blending human creativity with AI-generated outputs. The use of pitch shifting and the GaMaDHaNi model to resynthesize audio adds depth to the methodology, allowing for a dynamic interaction between the vocalist and the virtual drone.
The paper does not provide traditional experimental results or quantitative metrics commonly found in machine learning research. Instead, it focuses on a performance-based evaluation, which is appropriate given the artistic nature of the work. The description of the three movements serves as a qualitative assessment of the system's capabilities. However, the lack of formalized evaluation metrics (e.g., listener studies or comparative analysis) limits the ability to rigorously assess the technical impact of the system.
The paper includes some implementation details, such as the use of Max/MSP and the specific setup for the performance, but it lacks comprehensive documentation that would allow for full reproducibility. There are references to figures and technical sheets that are not provided in the text, which would be necessary for others to replicate the setup and results.
One significant limitation is the reliance on a single performance context, which may not generalize to other musical settings or styles. Additionally, the paper acknowledges that the theoretical framework is still a work in progress, indicating that the full potential of the proposed methods has not yet been realized. The use of low-fidelity generative outputs may also limit the appeal to audiences accustomed to high-fidelity music production.
The work has potential implications for the intersection of AI and music, particularly in how generative models can be integrated into traditional music practices without displacing human musicians. By advocating for a more nuanced understanding of AI's role in music creation, the paper contributes to ongoing discussions about the ethical and cultural ramifications of AI in the arts. It also opens avenues for further exploration of AI as a co-creative partner rather than a replacement for human artists. The main contribution of this paper is the innovative exploration of agency in music through the integration of generative AI and traditional Hindustani music practices. This work significantly advances the discourse on the role of technology in artistic expression, proposing a model where AI acts as a collaborative partner rather than a mere tool, thus enriching the creative landscape in music.
Audio and music generation based on flexible multimodal control signals is a widely applicable topic, with the following key challenges: 1) a unified multimodal modeling framework, 2) large-scale, high-quality training data, and 3) the prohibitive inference cost of multi-step diffusion sampling. As such, we propose AudioX-Turbo, a unified and efficient framework for anything-to-audio generation that integrates varied multimodal conditions (i.e., text, video, and audio signals) in this work. AudioX-Turbo follows a teacher-student paradigm. The teacher AudioX-Base is built on a Multimodal Diffusion Transformer with a Multimodal Adaptive Fusion module that aligns diverse multimodal inputs for high-fidelity synthesis, and is then distilled into the few-step student AudioX-Turbo via Distribution Matching Distillation adapted to flow matching, complemented by a diffusion-based discriminator for high-quality few-step generation. To support the training of AudioX-Turbo, we construct a large-scale, high-quality dataset, IF-caps-Pro, comprising approximately 9.2M samples curated through a two-stage data collection and annotation pipeline. We benchmark AudioX-Turbo across a wide range of tasks, finding that our model achieves superior performance, especially on text-to-audio and text-to-music generation, while operating at only 4 sampling steps and requiring approximately 25x fewer function evaluations (NFE) than multi-step baselines. These results demonstrate that our method is capable of audio generation under flexible multimodal control, showing efficient and powerful instruction-following capabilities. The code and datasets will be available at https://zeyuet.github.io/AudioX-Turbo/.
Primary: The Hong Kong University of Science and Technology
All Institutions: The Hong Kong University of Science and Technology, Tsinghua University, Noiz AI, Independent Researcher
The main contribution of this work is the introduction of AudioX-Turbo, a unified framework for efficient anything-to-audio generation that significantly reduces inference costs while maintaining high-quality output across multiple modalities. This work represents a meaningful advancement in the field of audio generation, combining innovative methodologies with practical applications.
The methodology presented in this paper is robust, utilizing a teacher-student paradigm to enhance efficiency in audio generation. The integration of a Multimodal Diffusion Transformer with a Multimodal Adaptive Fusion module is particularly noteworthy, as it allows for the alignment of diverse multimodal inputs, which is crucial for high-fidelity audio synthesis. The proposed Distribution Matching Distillation method is innovative and effectively reduces the inference cost associated with multi-step diffusion sampling. The two-stage data construction pipeline for creating a large-scale dataset is also a significant contribution, addressing the common issue of data scarcity in multimodal training.
The experiments are comprehensive, benchmarking AudioX-Turbo against state-of-the-art methods across various tasks, including text-to-audio and video-to-audio generation. The results demonstrate that the proposed model achieves superior performance while significantly reducing the number of function evaluations required for inference. The use of both subjective and objective evaluation metrics strengthens the credibility of the findings. However, the paper could benefit from more detailed comparisons with a broader range of existing models to fully contextualize its performance.
The paper provides a clear outline of the implementation details, including architecture specifications, training protocols, and evaluation metrics. The availability of the code and datasets is a positive aspect that enhances reproducibility. However, the paper could improve by including more specific hyperparameter settings and training configurations to facilitate easier replication of results by other researchers.
One limitation is the reliance on a large-scale dataset, which may not be readily available for all researchers. Additionally, while the model shows impressive performance, there may be edge cases or specific scenarios where the model's generalization capabilities could be further tested. The paper does not fully address potential biases in the dataset or the implications of using large-scale models in real-world applications.
The implications of this research are significant, as it opens new avenues for automated audio generation in various fields, including entertainment, gaming, and content creation. The ability to generate high-quality audio from diverse multimodal inputs can enhance user experiences and streamline production processes. Furthermore, the findings may inspire future research in multimodal AI systems and their applications in other domains. The main contribution of this work is the introduction of AudioX-Turbo, a unified framework for efficient anything-to-audio generation that significantly reduces inference costs while maintaining high-quality output across multiple modalities. This work represents a meaningful advancement in the field of audio generation, combining innovative methodologies with practical applications.
Speech enhancement models typically apply uniform capacity across all frequencies, disregarding the non-uniform spectral resolution of human hearing. We propose BASENet, a frequency-adapted architecture that partitions the spectrum into Bark-scale bands and assigns each a scaled-capacity encoder derived from critical-band density, automatically granting deeper branches to perceptually dense low frequencies and lighter ones to high frequencies. A cross-band attention module captures harmonic dependencies across bands through compact frequency-pooled representations at linear complexity. Built on inverted residual blocks with dense connectivity and a convolutional recurrent network, BASENet achieves 3.55 PESQ and STOI~96% on VoiceBank+DEMAND with only 0.83M parameters and 7.3 G~MACs, the fewest parameters among all methods with PESQ > 3.50. A causal variant (3.44 PESQ) surpasses several non-causal baselines, confirming suitability for real-time streaming on resource-constrained devices.
Primary: Thales SIX GTS
All Institutions: Thales SIX GTS
BASENet introduces a frequency-adapted speech enhancement network that effectively allocates encoder depth based on auditory principles, achieving a strong balance between performance and computational efficiency. This work represents a meaningful contribution to the field of audio processing, particularly in enhancing speech intelligibility in challenging acoustic environments.
The methodology presented in BASENet is innovative, leveraging perceptual principles from the human auditory system to inform the architecture's design. The use of Bark-scale bands for frequency adaptation and the introduction of a cross-band attention mechanism are significant advancements. The architecture's ability to dynamically allocate encoder capacity based on critical-band density is a novel approach that addresses limitations in existing models that apply uniform capacity across the frequency spectrum. The integration of causal processing for real-time applications further enhances its applicability in practical scenarios, such as hearing aids.
The experiments are well-structured, utilizing the VoiceBank+DEMAND dataset, which is a standard benchmark for speech enhancement tasks. The reported results demonstrate that BASENet achieves competitive performance metrics, specifically a PESQ score of 3.55 with significantly fewer parameters than comparable models. The ablation studies provide valuable insights into the contributions of various components of the architecture, reinforcing the importance of the proposed methods. However, the paper could benefit from additional comparisons with more recent state-of-the-art models to further contextualize its performance.
The paper provides sufficient implementation details, including architecture specifications, training procedures, and hyperparameters, which facilitate reproducibility. However, the absence of a publicly available code repository or demo limits the ability for independent verification of results. Including such resources would significantly enhance the paper's reproducibility.
One limitation is the lack of a comprehensive evaluation against a wider array of state-of-the-art models, particularly those that utilize more advanced techniques such as self-supervised learning or generative models. Additionally, while the model is lightweight, its performance in extremely noisy conditions or with diverse accents is not thoroughly explored, which could affect its generalizability.
The implications of BASENet are significant, particularly in the fields of assistive technology and real-time communication systems. By improving speech enhancement in resource-constrained environments, this work could enhance accessibility for individuals with hearing impairments and improve clarity in voice communication applications. The model's design principles could also inspire future research in audio processing, particularly in leveraging perceptual characteristics for model optimization. BASENet introduces a frequency-adapted speech enhancement network that effectively allocates encoder depth based on auditory principles, achieving a strong balance between performance and computational efficiency. This work represents a meaningful contribution to the field of audio processing, particularly in enhancing speech intelligibility in challenging acoustic environments.
Learning-based speech compression has achieved promising low-bitrate performance, but many neural speech codecs still describe quantized latents with preset-rate discrete symbols or apply entropy coding only after symbol generation. Such designs decouple representation learning from probability modeling, limiting their ability to exploit the non-uniform usage and temporal dependencies of learned speech latents. In this paper, we benchmark neural speech compression from a rate--distortion perspective and further investigate entropy-constrained coding for low-bitrate speech compression. We first formulate a unified learning-based speech coding pipeline and provide a benchmark-style analysis of recent neural speech codecs, showing that explicit probability modeling remains underexplored in learned speech compression. We then propose ECC, an Entropy-Constrained Codec that combines scalar quantization with a learned entropy model. ECC integrates hyperprior-based side information, channel-wise context modeling, latent residual prediction, and lightweight temporal modeling to estimate latent likelihoods for rate estimation during training and arithmetic coding during inference. To further improve low-bitrate efficiency, ECC introduces entropy skip, which omits highly predictable residual symbols using decoder-available scale estimates without transmitting additional skip masks. Extensive experiments show that ECC achieves a favorable low-bitrate rate--distortion trade-off over conventional and neural codec baselines, reducing BD-rate by 39.9% on ViSQOL and 76.3% on PESQ on average over two widely-used test sets. Ablation and diagnostic studies further validate the effectiveness of entropy modeling. Project Page: https://avery-xu.github.io/ECC-demo/
Primary: Shanghai Jiao Tong University
All Institutions: Shanghai Jiao Tong University
The main contribution of this paper is the introduction of ECC, a novel Entropy-Constrained Codec that significantly improves low-bitrate speech compression through joint optimization of representation learning and probability modeling. The comprehensive benchmarking and innovative methodologies presented in this work mark a significant advancement in the field of neural speech codecs, addressing critical challenges in efficient speech representation and transmission.
The paper presents a comprehensive methodology for neural speech compression through the proposed Entropy-Constrained Codec (ECC). The methodology integrates scalar quantization with a learned entropy model, emphasizing the importance of joint optimization of representation learning and probability modeling. The use of hyperprior-based side information, channel-wise context modeling, and latent residual prediction demonstrates a sophisticated approach to improve the rate-distortion trade-off. The introduction of entropy skip to omit predictable residual symbols is a noteworthy innovation that enhances efficiency without additional signaling. The unified formulation and benchmarking of existing codecs provide a solid foundation for understanding the advancements in the field.
The experiments conducted are extensive, involving both objective and subjective evaluations across multiple datasets, including LibriTTS and VCTK. The results indicate that ECC significantly outperforms conventional and recent neural codec baselines in terms of BD-rate reductions and perceptual quality metrics. The use of multiple evaluation metrics (ViSQOL, PESQ, STOI, etc.) strengthens the reliability of the findings. Ablation studies further validate the effectiveness of the proposed entropy modeling and architectural choices, showcasing a rigorous experimental design.
The paper provides detailed descriptions of the experimental setup, including datasets, training procedures, and evaluation metrics. However, the lack of a publicly available code repository limits reproducibility. While the methodology is well-documented, the absence of implementation details may pose challenges for other researchers attempting to replicate the results.
Some limitations include the reliance on specific datasets for training and evaluation, which may affect the generalizability of the results to other speech domains or languages. Additionally, the complexity of the ECC model may hinder its deployment in real-time applications due to computational demands. The paper could also benefit from a discussion on the trade-offs between model complexity and performance.
The advancements in neural speech compression presented in this paper have significant implications for low-bitrate communication systems, particularly in mobile and real-time applications. The proposed methods could enhance the quality of speech transmission in constrained environments, benefiting various industries, including telecommunications and streaming services. The focus on entropy-constrained coding could inspire further research into efficient coding strategies across different audio and speech processing tasks. The main contribution of this paper is the introduction of ECC, a novel Entropy-Constrained Codec that significantly improves low-bitrate speech compression through joint optimization of representation learning and probability modeling. The comprehensive benchmarking and innovative methodologies presented in this work mark a significant advancement in the field of neural speech codecs, addressing critical challenges in efficient speech representation and transmission.
Distilling a large speech foundation model (SFM) into an efficient student model has been successfully applied to low-resource environments. Although distillation reduces inference latency, it requires an additional student model training. However, the training efficiency of SFM distillation remains underexplored. In this work, we explore training acceleration of SFM distillation to speed up model deployment. We examine the potential of stacking, in which the model depth is progressively increased through training until the target model depth is reached. While existing stacking methods improve training speed, they suffer from performance degradation. To handle this limitation, we propose interleaved stacking, a novel stacking method that consistently preserves layer position throughout the stacking process. This property is particularly critical in SFMs, in which each layer encodes distinct layer-specific knowledge. We validate the effectiveness of the proposed method on SUPERB.
Primary: Seoul National University
All Institutions: Seoul National University
The main contribution of this paper is the introduction of interleaved stacking for SFM distillation, which preserves layer-specific knowledge and enhances training efficiency. This work significantly advances the field of speech processing by providing a novel approach to knowledge distillation that addresses existing limitations and demonstrates strong empirical results.
The paper introduces a novel stacking method termed "interleaved stacking" for the distillation of speech foundation models (SFMs). This approach addresses the limitations of existing stacking methods by maintaining consistent layer positions throughout the training process, which is crucial for preserving layer-specific knowledge in SFMs. The methodology is well-structured, with a clear explanation of how interleaved stacking differs from traditional stacking methods and the rationale behind it. The integration of intermediate-level knowledge distillation losses further enhances the proposed method's effectiveness, demonstrating a thoughtful consideration of the challenges in knowledge distillation.
The experiments are robust, utilizing the SUPERB benchmark to validate the proposed method across various speech processing tasks. The results indicate that interleaved stacking outperforms existing methods significantly, showcasing improvements in performance metrics such as phoneme error rate (PER) and word error rate (WER). The paper also includes a comparative analysis against models trained without stacking, reinforcing the advantages of the proposed approach. However, the paper could benefit from additional details on the experimental setup, such as hyperparameter tuning and the specific configurations used for the training process.
The paper provides a reasonable level of detail regarding the experimental setup, including model architectures, training parameters, and evaluation metrics. However, the lack of a publicly accessible code repository or demo URL limits the reproducibility of the results. Future work should consider making the code available to facilitate validation of the findings by the research community.
One limitation of the study is the potential overfitting to the SUPERB benchmark, which may not fully represent the diversity of real-world speech processing tasks. Additionally, while the proposed method shows significant improvements, it remains to be seen how it performs in more complex scenarios or with different types of speech data. The paper also does not address the computational cost of implementing interleaved stacking compared to traditional methods, which could be a consideration for practical applications.
The proposed method has significant implications for deploying efficient speech processing models in low-resource environments, making it particularly relevant for applications in real-time speech recognition and natural language processing. By improving the training efficiency of SFMs, the research contributes to the broader goal of making advanced machine learning technologies more accessible and practical for various applications. The main contribution of this paper is the introduction of interleaved stacking for SFM distillation, which preserves layer-specific knowledge and enhances training efficiency. This work significantly advances the field of speech processing by providing a novel approach to knowledge distillation that addresses existing limitations and demonstrates strong empirical results.
Audio watermarking aims to embed identifiable information into audio while remaining imperceptible. Existing methods adopt high-fidelity, low-energy designs to preserve perceptual quality, but the resulting watermarks lack robustness under suppression by speech reconstruction models. Improving robustness is challenging due to the inherent robustness-fidelity trade-off in existing designs, where increasing watermark energy improves robustness but reduces fidelity. To address this problem, we propose a feature-aligned watermarking method that aligns the watermark with the original speech feature distribution, allowing higher watermark energy to improve robustness while preserving imperceptibility. We use a pretrained speech codec to generate a pseudo-speech watermark and fuse it into the spectrogram of the input audio, with VAD loss and perceptual losses guiding embedding within voiced regions. Experiments show that our method maintains imperceptibility comparable to existing approaches while substantially improving robustness under both seen and unseen speech reconstruction models.
Primary: Shenzhen International Graduate School, Tsinghua University
All Institutions: Shenzhen International Graduate School, Tsinghua University, Pengcheng Laboratory, Independent Researcher
The main contribution of this paper is the development of a feature-aligned watermarking method that effectively balances robustness and imperceptibility in audio watermarking. This work significantly advances the field of audio processing by addressing the critical challenges posed by speech reconstruction models, providing a robust solution that maintains audio quality and imperceptibility.
The proposed methodology introduces a feature-aligned watermarking approach that leverages a pretrained speech codec to generate a pseudo-speech watermark, which is then embedded into the audio spectrogram. The integration of voice activity detection (VAD) loss and perceptual losses is a significant enhancement, allowing the watermark to be embedded within voiced regions, thus maintaining imperceptibility while improving robustness against various speech reconstruction models. The architecture is well-structured, with clear delineation between the embedder and decoder components, and the use of a feature pyramid for watermark extraction is innovative and well-justified.
The experiments are comprehensive, utilizing multiple datasets (VCTK, LibriSpeech, LJSpeech) and a variety of speech reconstruction models to assess robustness. The evaluation metrics, including bit-wise accuracy (ACC) and false attribution rate (FAR), are appropriate for the task. The subjective ABX tests and VISQOL MOS scores provide a solid basis for assessing perceptual quality, demonstrating that the proposed method achieves competitive results compared to existing watermarking techniques. The ablation studies further validate the contributions of specific components of the methodology.
The paper provides sufficient implementation details, including model architecture, training protocols, and loss functions, which facilitate reproducibility. However, the lack of a publicly available code repository may hinder full reproducibility for some researchers.
One limitation is the potential degradation in fidelity when embedding higher energy watermarks, which, while addressed, may still affect certain applications where audio quality is paramount. Additionally, the method's performance under extreme distortions or aggressive compression could be further explored, as indicated by some performance drops in the experiments.
The proposed watermarking technique has significant implications for copyright protection and content attribution in modern audio applications, especially in environments where speech reconstruction is prevalent, such as voice calls and online meetings. The ability to maintain imperceptibility while enhancing robustness could lead to wider adoption of watermarking technologies in commercial applications. The main contribution of this paper is the development of a feature-aligned watermarking method that effectively balances robustness and imperceptibility in audio watermarking. This work significantly advances the field of audio processing by addressing the critical challenges posed by speech reconstruction models, providing a robust solution that maintains audio quality and imperceptibility.
Recent respiratory sound classification (RSC) studies largely rely on CLS-token driven self-attention architectures such as the Audio Spectrogram Transformer (AST). While effective at modeling global context, recent analyses suggest a low-pass filtering behavior that may reduce sensitivity to localized abnormal patterns. In this work, we investigate State Space Models (SSMs) as an alternative backbone for RSC. Using the Distilled Audio State Space model, we analyze intermediate representations through spectral response curves and observe stronger preservation of mid-to-high spatial-frequency components. Based on these observations, we introduce spectral-aware layer regularization using Gaussian convolution applied to selected layers. We further propose Dual-Axis Patch-Mix contrastive learning tailored to SSM-based audio models for robust representation learning. Experiments on the ICBHI benchmark show that our approach achieves 64.48% score, outperforming the AST baseline by 5%. Code is available at https://github.com/RSC-Toolkit/Lung-SRAD.
Primary: RSC LAB, MODULABS, Republic of Korea
All Institutions: RSC LAB, MODULABS, Republic of Korea; Department of Electronic Engineering, Wonkwang University, Republic of Korea; AI Convergence Research Institute, Wonkwang University, Republic of Korea
This paper introduces a novel approach to respiratory sound classification using State Space Models, addressing key limitations of existing Transformer-based methods. The technical contributions, including spectral-aware regularization and contrastive learning, are well-founded and demonstrate potential for significant impact in the field of medical audio analysis.
The paper presents a novel approach to respiratory sound classification (RSC) by leveraging State Space Models (SSMs) as an alternative to traditional Transformer architectures. The introduction of spectral-aware layer regularization and Dual-Axis Patch-Mix contrastive learning is well-motivated and addresses specific limitations of existing methods, particularly the low-pass filtering behavior of self-attention mechanisms. The methodology is clearly articulated, with a strong theoretical foundation and empirical validation of the proposed techniques.
The experiments are conducted on the ICBHI benchmark, which is a relevant dataset for RSC. The results demonstrate a clear improvement over the baseline Audio Spectrogram Transformer (AST) model, achieving a score of 64.48%. The paper provides a thorough analysis of the performance metrics, including sensitivity and specificity, which are crucial for medical applications. However, the paper could benefit from additional comparisons with more recent models in the field to further contextualize its contributions.
The authors provide sufficient details regarding the experimental setup, including training parameters and evaluation metrics, which enhances reproducibility. The availability of the code on GitHub is a positive aspect, enabling other researchers to replicate the findings.
One limitation of the study is the reliance on a single dataset (ICBHI), which may affect the generalizability of the results. Additionally, while the proposed methods show improvements, the paper does not explore the potential trade-offs in computational efficiency or model complexity compared to existing architectures.
The proposed method has significant implications for the field of respiratory sound classification, which is critical for diagnosing various respiratory conditions. By improving the sensitivity to abnormal lung sounds, this research could enhance clinical decision-making and patient outcomes in respiratory health. This paper introduces a novel approach to respiratory sound classification using State Space Models, addressing key limitations of existing Transformer-based methods. The technical contributions, including spectral-aware regularization and contrastive learning, are well-founded and demonstrate potential for significant impact in the field of medical audio analysis.
Content moderation is critical for online video platforms to ensure content safety, protect creators, and sustain positive user experiences. Beyond filtering harmful content, platforms must guarantee content authenticity at scale so that users are exposed to diverse, original videos rather than low-value reproductions. We present MatchLM2Lite, a real-time, production-grade reproduced content identification (RCI) system that leverages the powerful understanding of a multimodal large language model (MLLM) distilled into a small and fast-inference model. Our system jointly models video, audio, and text signals, operating on pairs of videos to produce fine-grained reproduction scores. The system comprises two modules, MatchLM and MatchLite, and a two-stage training recipe. First, our high-capacity MLLM, MatchLM, serves as a teacher model to define the upper bound of RCI performance. Its capabilities are then distilled into a compact student model, MatchLite. This design allows MatchLite to deliver low-latency, high-throughput inference on video pairs while preserving much of MatchLM's accuracy, making it suitable for integration into real-time recommendation systems. MatchLM achieves an F1-score improvement of +8.57 compared to our previous production model. After knowledge distillation, MatchLite retains a +6.55 gain in F1-score while reducing computational cost by 35x. Deployed at scale, MatchLM2Lite enables efficient, pairwise multimodal RCI, stably serving online traffic at high queries per second (QPS) with an end-to-end latency below 30 seconds. This system has reduced the reproduced video view rate on our platform by 2.5% without degrading user engagement, demonstrating its effectiveness in a large-scale production environment.
Primary: National University of Singapore
All Institutions: National University of Singapore, TikTok
The paper presents a novel framework for reproduced content identification that leverages multimodal large language models, demonstrating significant improvements in efficiency and accuracy for real-time applications in content moderation. The technical contributions, particularly in knowledge distillation and multimodal integration, are poised to impact the field of machine learning and content governance significantly.
The methodology presented in this paper is robust, employing a two-stage training framework that effectively utilizes knowledge distillation to transfer the capabilities of a high-capacity multimodal large language model (MLLM) to a lightweight model suitable for real-time applications. The joint modeling of video, audio, and text signals is a significant advancement, as it addresses the limitations of existing methods that primarily focus on visual encoders. The architecture design of both MatchLM and MatchLite is well thought out, allowing for efficient processing and accurate reproduction identification through a multimodal approach.
The experimental evaluation is thorough, with extensive offline experiments and online A/B testing on a large-scale platform. The results demonstrate significant improvements in F1 scores and computational efficiency, validating the effectiveness of the proposed system in a real-world setting. The paper provides detailed comparisons with baseline models and ablation studies that highlight the contributions of different components, enhancing the credibility of the findings.
While the paper provides a comprehensive description of the models and training processes, it lacks specific URLs or repositories for code and data, which could hinder reproducibility. The absence of publicly available benchmarks or datasets also limits the ability of other researchers to replicate the results independently.
One limitation is the reliance on proprietary data and the lack of a publicly available dataset for reproduced content identification, which restricts broader validation of the proposed methods. Additionally, the paper does not address potential biases in the training data or the implications of deploying such a system at scale, which could affect fairness and accuracy in content moderation.
The proposed MatchLM2Lite framework has significant implications for content moderation in online platforms, potentially improving user experiences by reducing the prevalence of reproduced content. The integration of multimodal signals could enhance the understanding of content authenticity, benefiting creators and users alike. However, the ethical considerations surrounding automated content moderation and the potential for overreach in content filtering must be carefully managed. The paper presents a novel framework for reproduced content identification that leverages multimodal large language models, demonstrating significant improvements in efficiency and accuracy for real-time applications in content moderation. The technical contributions, particularly in knowledge distillation and multimodal integration, are poised to impact the field of machine learning and content governance significantly.
Language models (LMs) have become one of the most prominent paradigms in modern generative modeling. While making them faster has been the main focus of real-time deployment, speed alone is not enough. Many real-world applications, such as synchronized translation and voice synthesis, also require precise alignment between generation and external signals, both in terms of generation content and timing. We refer to this problem as \textit{frame-synchronous streaming inference}. To address it, we present StreamMUSE, an inference system that performs LM generation in response to an external signal stream within a client-server architecture. The client continuously sends high-frequency inference requests based on the most recent inputs and receives outputs synchronized to the external clock, while the server executes model inference. We demonstrate the framework through a live music accompaniment task, showing how real-time synchronization can be achieved across different deployment environments with varying round-trip latencies. We further model the relationship between system hyperparameters and round-trip latency, and evaluate how different environments affect optimal configurations to achieve real-time performance. Experimental results show a consistent correspondence between system real-time performance and music quality, demonstrating the effectiveness of the proposed framework. The project is open source. Relevant code and the latest updates are available at https://stream-muse-webpage.vercel.app/#audio-library.
Primary: Mohamed bin Zayed University of Artificial Intelligence
All Institutions: University of Science and Technology of China, Mohamed bin Zayed University of Artificial Intelligence, University of California, San Diego, Wuhan University, New York University
The main contribution of this paper is the introduction of a novel real-time language model inference system tailored for live music accompaniment generation, demonstrating the feasibility of frame-synchronous streaming inference in a client-server architecture. The comprehensive analysis of the technical contributions, methodology, and significance to the field highlights the potential for impactful advancements in interactive music systems.
The paper presents a novel approach to real-time language model inference in the context of live music accompaniment generation. The authors introduce a client-server architecture that allows for frame-synchronous streaming inference, which is essential for maintaining musical coherence and timing. The methodology is well-structured, focusing on the interplay between inference intervals and generation lengths to optimize responsiveness and quality. The use of a tick-based system for temporal resolution is particularly innovative, allowing for precise synchronization with musical elements. The mathematical modeling of round-trip latency and its impact on system performance is a significant contribution, providing a theoretical foundation for the practical implementation of their system.
The experimental setup is robust, with evaluations conducted across three different environments (local, local-server, and remote-server) to assess the system's performance under varying conditions. The metrics used for evaluation, including Interaction Success Rate (ISR), Staleness, and various music quality metrics (JSD, FMD, CR, UR), are comprehensive and relevant. The results demonstrate a clear correlation between system responsiveness and music quality, validating the proposed framework. However, while the experiments are thorough, additional comparisons with existing state-of-the-art systems could strengthen the claims of superiority.
The paper provides sufficient detail regarding the implementation, including the architecture, training details, and evaluation metrics. The open-source nature of the project, with a dedicated URL for accessing the code and updates, enhances reproducibility. However, the paper could benefit from more explicit instructions or a README file in the repository to facilitate easier replication of the experiments.
One limitation is the reliance on specific datasets (e.g., POP909) for training and evaluation, which may affect the generalizability of the results. Additionally, the system's performance may vary significantly with different musical genres or styles, which is not thoroughly explored in the paper. The impact of network conditions on real-time performance could also be more extensively analyzed, particularly in real-world scenarios.
The proposed system has the potential to revolutionize live music performance by enabling real-time accompaniment generation that is both musically coherent and responsive to live input. This could have significant implications for musicians, educators, and the entertainment industry, enhancing collaborative performances and interactive music experiences. The framework also opens avenues for further research in real-time generative models across various domains beyond music. The main contribution of this paper is the introduction of a novel real-time language model inference system tailored for live music accompaniment generation, demonstrating the feasibility of frame-synchronous streaming inference in a client-server architecture. The comprehensive analysis of the technical contributions, methodology, and significance to the field highlights the potential for impactful advancements in interactive music systems.
Zero-shot text-to-speech (TTS) relies on robust speech representations. However, current speech tokenizers face a fundamental trade-off: acoustic codecs preserve high-fidelity audio but lack linguistic constraints, causing content errors during generation, whereas semantic tokens from self-supervised learning (SSL) models ensure precise text alignment but discard some acoustic information. To bridge this gap, we propose SARA, a dual-stream VAE that directly fuses a frozen SSL semantic anchor with a dedicated residual acoustic encoder. This effectively mitigates the dilemma, creating an efficient and compact latent space without relying on complex regularizers. SARA achieves superior reconstruction quality over strong baselines. Furthermore, in downstream zero-shot TTS tasks, it yields highly natural and expressive synthesis quality, and maintains robust generation performance even under accelerated inference, offering a favorable trade-off between synthesis speed and computational cost.
Primary: Xiamen University
All Institutions: Xiamen University, DiDi Global Inc.
The main contribution of this paper is the introduction of SARA, a dual-stream VAE that effectively integrates semantic and acoustic representations for improved zero-shot TTS performance. This innovative approach addresses existing challenges in speech synthesis, offering a promising direction for future research in high-fidelity speech generation.
The paper introduces SARA, a dual-stream variational autoencoder (VAE) that integrates semantic and acoustic representations to address the trade-off between reconstruction fidelity and generative controllability in zero-shot text-to-speech (TTS) systems. The methodology is well-structured, leveraging a frozen self-supervised learning (SSL) model for semantic encoding and a residual acoustic encoder for capturing detailed acoustic features. This architectural innovation allows for efficient integration of both streams without the need for complex regularization, which is a notable improvement over existing methods. The use of adversarial training to enhance perceptual quality further strengthens the approach.
The experimental evaluation is robust, utilizing extensive datasets such as LibriTTS and LibriHeavy for training and testing. The authors provide a comprehensive comparison against strong baselines, demonstrating SARA's superior performance in terms of reconstruction fidelity and downstream TTS tasks. The metrics used, including PESQ, STOI, WER, and subjective evaluations like CMOS and SMOS, are appropriate for assessing both objective and perceptual quality. The results indicate significant improvements in content accuracy and speaker similarity, validating the effectiveness of the proposed framework.
The paper provides detailed implementation specifics, including training configurations, dataset descriptions, and evaluation metrics, which enhance reproducibility. However, the absence of a publicly available code repository limits the ease with which other researchers can replicate the results. The authors could improve reproducibility by sharing their code and trained models.
One limitation is the reliance on a frozen SSL model, which may limit adaptability to new or diverse datasets. Additionally, while the dual-stream architecture effectively mitigates the semantic-acoustic trade-off, it introduces additional complexity that may not be necessary for all applications. The paper also does not explore the scalability of the model in multilingual settings, which could be a significant area for future research.
The advancements presented in SARA have the potential to significantly enhance the quality of TTS systems, making them more applicable in various domains such as virtual assistants, audiobooks, and accessibility technologies. The ability to generate high-fidelity speech with accurate content representation could improve user experiences and broaden the reach of TTS applications. The main contribution of this paper is the introduction of SARA, a dual-stream VAE that effectively integrates semantic and acoustic representations for improved zero-shot TTS performance. This innovative approach addresses existing challenges in speech synthesis, offering a promising direction for future research in high-fidelity speech generation.
Precise note-level annotations are critical for training automatic music transcription (AMT) systems, in particular note-onset labels, which form a core component of many recent AMT systems. However, high-quality annotations for real-world recordings are scarce. Sequence-level score--audio alignment methods such as dynamic time warping provide only coarse correspondence, making a local refinement step necessary. This refinement step, known as snapping, adjusts aligned score onsets using peaks in a neural onset posteriorgram and often determines whether weakly aligned score--audio pairs become usable training data at all. Despite its practical importance, snapping is typically treated as a simple post-processing heuristic and implemented with greedy local decisions. We present a systematic analysis of snapping strategies for training instrument-agnostic transcribers, demonstrating that snapping is essential for learning from weakly aligned data. Building on this, we formulate snapping as a per-pitch assignment problem and solve it via bipartite graph matching, yielding context-aware onset decisions under overlapping refinement windows and uncertain initial alignments. Extensive cross-dataset experiments across piano, chamber, and orchestral recordings show improved onset alignment and transcription accuracy over greedy snapping, with gains increasing for wider snapping windows and coarser initial alignments. Qualitative examples are provided on our project page: https://abhirupsaha8.github.io
Primary: International Audio Laboratories Erlangen
All Institutions: International Audio Laboratories Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg, Fraunhofer Institute for Integrated Circuits IIS
The main contribution of this paper is the introduction of a graph-based approach to snapping in automatic music transcription, which improves alignment accuracy and transcription performance. This work represents a significant step forward in addressing the challenges of note-onset detection in complex musical recordings, providing a robust framework that can be applied to various instruments and ensemble types.
The paper presents a novel approach to refining note-onset alignments in automatic music transcription by framing the snapping process as a bipartite graph matching problem. This method effectively addresses the limitations of traditional greedy approaches by ensuring global consistency and robustness against overlapping refinement windows. The systematic analysis of snapping strategies and the formulation of the problem demonstrate a clear advancement in the methodology of AMT.
The experiments are comprehensive, utilizing multiple datasets (MusicNet, MAESTRO, URMP, etc.) to validate the proposed method across different musical contexts. The results indicate significant improvements in transcription accuracy and onset alignment, especially under conditions of coarse initial alignments. The evaluation metrics are well-defined, focusing on note-level precision, recall, and F1 scores, which are appropriate for the task.
The paper provides sufficient detail on the methodology and experimental setup, allowing for reproducibility. However, specific implementation details, such as the exact configurations of the bipartite matching algorithms used, could be more thoroughly documented to enhance reproducibility further.
While the proposed method shows promise, it may still struggle with highly complex orchestral pieces where overlapping notes are more frequent. Additionally, the reliance on neural onset posteriorgrams could introduce variability based on the quality of the underlying models used for this task.
The advancements in this paper could significantly enhance the field of music information retrieval, particularly in automatic music transcription, by enabling more accurate and reliable systems. This could have implications for music education, music analysis, and the development of music-related applications that rely on precise note recognition. The main contribution of this paper is the introduction of a graph-based approach to snapping in automatic music transcription, which improves alignment accuracy and transcription performance. This work represents a significant step forward in addressing the challenges of note-onset detection in complex musical recordings, providing a robust framework that can be applied to various instruments and ensemble types.
Multi-talker conversational automatic speech recognition data are often used to train speaker diarization models. Because such data prioritize semantic continuity, pauses and boundary margins are included within speech segments, resulting in loose annotations. Models trained on such data tend to internalize mechanisms that reproduce this looseness, although tight speech intervals are sometimes preferable for downstream applications. In this paper, we address the novel task of enabling models to produce tight predictions using loose labels. Our method generates tighter pseudo labels using causal and anticausal models, which are inherently incapable of learning loosening behavior. We further propose a co-training scheme that iteratively tightens labels and updates both models for more progressive refinement. Experimental results show that the proposed method recovers about 70 % of the tightening effect achieved by ideal tight-label training and improves downstream performance.
Primary: NTT, Inc., Japan
All Institutions: NTT, Inc., Japan
The main contribution of this paper is the introduction of a novel method for generating tight boundary predictions in speaker diarization using causal-anticausal consistency, which significantly improves the performance of models trained on loosely annotated data. The comprehensive analysis of the methodology and experimental results underscores its potential to advance the field of audio processing and speaker diarization.
The proposed methodology effectively addresses the challenge of producing tight predictions from loose annotations in speaker diarization. By leveraging causal and anticausal models, the authors ingeniously isolate the functionalities of detecting speech segments and managing boundary conditions. The co-training scheme is a notable innovation that facilitates progressive refinement of labels, enhancing the model's ability to produce tighter outputs. The approach is well-structured and justified, with a clear rationale for each step in the methodology.
The experiments conducted are thorough and demonstrate the effectiveness of the proposed methods across various datasets. The results indicate a significant reduction in diarization error rates (DER) when using the proposed tightening methods compared to baseline models. The inclusion of multiple tightening strategies (basic, VAD, SC) and their comparative analysis adds depth to the evaluation. However, the paper could benefit from more extensive ablation studies to further dissect the contributions of each component.
The paper provides a detailed description of the experimental setup, including model architectures, training procedures, and evaluation metrics. However, the lack of publicly available code or datasets limits reproducibility. Future work should consider releasing the implementation to facilitate validation and further research by the community.
The primary limitation identified is the reliance on some ideal tight labels for validation, which may not be feasible in all scenarios. Additionally, the model's performance might be constrained by the size and diversity of the datasets used, and the potential for over-tightening could negatively impact downstream applications. The authors acknowledge these limitations and suggest that future work should explore larger datasets without tight labels.
This research has significant implications for real-world applications in speaker diarization, particularly in scenarios where precise speaker segmentation is critical, such as in automated transcription services and conversational AI systems. By improving the accuracy of diarization models, this work can enhance the quality of multi-speaker audio processing, leading to better user experiences in various applications. The main contribution of this paper is the introduction of a novel method for generating tight boundary predictions in speaker diarization using causal-anticausal consistency, which significantly improves the performance of models trained on loosely annotated data. The comprehensive analysis of the methodology and experimental results underscores its potential to advance the field of audio processing and speaker diarization.
Spoken dialogue models typically start from text LLM backbones, yet reasoning often degrades when conditioning on speech instead of text. We attribute part of this modality gap to a temporal-granularity mismatch: speech tokens are temporally redundant and far longer than text under matched semantics, diluting per-token semantic density and weakening text-native reasoning dynamics. We study speech token design as a representation selection problem and sweep frame rates under a frozen LLM backbone with a fixed information rate. To make low frame rates feasible, we introduce factorized FSQ and a lightweight non-autoregressive audio LM head, scaling capacity to nearly 300\,bits/frame without sacrificing efficient prediction. With the bottleneck removed, we sweep frame rates (50$\rightarrow$2.08\,Hz) and alignment depth, and observe a consistent best regime for speech QA at 4.17\,Hz with intermediate-layer representation alignment.
Primary: University of Surrey
All Institutions: Hong Kong University of Science and Technology, Tencent, University of Surrey, Chinese University of Hong Kong, Hong Kong Baptist University, Hong Kong Polytechnic University, Independent Researcher
The paper makes a significant contribution by systematically investigating how speech representation design impacts text-native reasoning in LLMs, introducing innovative methodologies that enhance cross-modal understanding. The comprehensive analysis of frame rates and representation alignment provides valuable insights for future research and applications in multimodal AI systems.
The paper presents a novel approach to speech representation design by systematically exploring the impact of frame rates and representation alignment on the reasoning capabilities of frozen LLMs. The introduction of factorized finite scalar quantization (FSQ) and a lightweight non-autoregressive audio LM head is particularly innovative, addressing the information bottleneck at low frame rates. The methodology is well-structured, focusing on controlled experiments that isolate the effects of speech tokenization on performance metrics. The use of contrastive learning for representation alignment across intermediate LLM layers is a significant methodological advancement that enhances the model's ability to bridge the modality gap between speech and text.
The experiments are comprehensive, utilizing a well-defined dataset (LibriSpeech) and a progressive training pipeline that includes multiple stages (ASR, TTS, and speech QA). The results demonstrate a clear understanding of the relationship between frame rate and model performance, with empirical findings supporting the proposed hypotheses. The U-shaped performance trends observed in ASR and TTS tasks provide valuable insights into the optimal operational regimes for speech QA. However, the reliance on a single dataset may limit the generalizability of the findings.
The paper provides sufficient details regarding the architecture, training procedures, and evaluation metrics, which should facilitate reproducibility. However, the lack of publicly available code or data may hinder independent verification of results. The authors acknowledge limitations in generalizing findings beyond the specific datasets used, which is a critical consideration for reproducibility in broader contexts.
The study is limited by its focus on English read speech, which may not generalize to conversational or noisy speech scenarios. The frozen LLM approach may impose a performance ceiling, and the lack of acoustic modeling could restrict the model's applicability. Additionally, the comparison with other methods is constrained by the unavailability of baseline training data and code, making it difficult to assess relative performance comprehensively.
This research has significant implications for the development of multimodal dialogue systems, particularly those that integrate speech and text processing. The findings could inform future designs of speech tokenizers and LLMs, enhancing their reasoning capabilities in real-world applications. The insights regarding frame rate and representation alignment could lead to more efficient and effective speech processing systems, potentially benefiting various industries, including customer service, education, and accessibility technologies. The paper makes a significant contribution by systematically investigating how speech representation design impacts text-native reasoning in LLMs, introducing innovative methodologies that enhance cross-modal understanding. The comprehensive analysis of frame rates and representation alignment provides valuable insights for future research and applications in multimodal AI systems.
Accurate and robust multimodal speaker identification is essential for multimedia understanding and biometric authentication. However, real-world polyglot scenarios pose two key challenges: speaker-discriminative representations should generalize across languages, and the model should remain reliable when face information is unavailable. To address these challenges, we propose MRAF, a Missing-Token Prompted Reliability-Aware Fusion framework for polyglot speaker identification across complete-modality, missing-face, and cross-lingual scenarios. MRAF represents unavailable face inputs with a learnable missing token instead of fixed zero-valued features, providing a trainable representation of the missing visual state. This design reduces the distribution gap caused by missing inputs and allows subsequent reliability estimation and cross-modal fusion to operate within a unified token space. To adaptively integrate modalities with different reliability, MRAF further introduces a reliability-aware cross-attention fusion module, which estimates face and audio reliability scores, normalizes them into modality weights, and applies these weights to token representations before bidirectional cross-attention. In this way, the model can emphasize reliable modality cues while suppressing unreliable ones. During training, MRAF jointly optimizes multi-branch classification losses, audio-only knowledge distillation, and center loss to improve speaker discrimination and missing-modality robustness. Experiments on the official POLY-SIM 2026 test set demonstrate the effectiveness of the proposed framework. In the final evaluation, MRAF achieves 100% accuracy on P3 and P5, and obtains competitive results on the more challenging missing-face settings P4 and P6. The source code will be released at https://github.com/MSA-LMC/MRAF.
Primary: Hefei University of Technology
All Institutions: Hefei University of Technology, Intelligent Interconnected Systems Laboratory of Anhui Province
The paper presents MRAF, a framework that innovatively addresses the challenges of polyglot speaker identification under incomplete modality conditions. Its contributions lie in the introduction of a learnable missing token and a reliability-aware fusion mechanism, which collectively enhance the model's robustness and accuracy in real-world applications.
The proposed MRAF framework introduces a novel approach to handling missing modalities in polyglot speaker identification by employing a learnable missing token, which enhances the model's ability to generalize across different conditions. The reliability-aware cross-attention fusion module is a significant innovation, allowing the model to dynamically adjust the contribution of each modality based on their estimated reliability. This is a sophisticated method that improves robustness and performance in challenging scenarios, particularly when one modality is missing.
The experiments conducted on the POLY-SIM 2026 test set demonstrate the effectiveness of MRAF, achieving impressive accuracy metrics, particularly in complete-modality settings. The results are well-presented, with clear comparisons to baseline methods and ablation studies that validate the contributions of different components of the model. The use of a diverse dataset with real-world variations adds credibility to the findings.
The paper provides sufficient details regarding the experimental setup, including model architecture, training parameters, and evaluation protocols. However, the lack of a demo or interactive implementation limits the ease of reproducibility for external researchers. The authors do mention that the source code will be available, which is a positive aspect for future validation.
While the model shows strong performance, it may struggle in scenarios with significant noise or low-quality inputs, as noted in the limitations section. Additionally, the reliance on specific datasets may limit the generalizability of the findings to other contexts or languages not represented in the training data.
The advancements in multimodal speaker identification have significant implications for applications in biometric authentication, multimedia retrieval, and human-computer interaction. The ability to effectively handle missing modalities could enhance the robustness of systems in real-world applications, making them more reliable in diverse environments. The paper presents MRAF, a framework that innovatively addresses the challenges of polyglot speaker identification under incomplete modality conditions. Its contributions lie in the introduction of a learnable missing token and a reliability-aware fusion mechanism, which collectively enhance the model's robustness and accuracy in real-world applications.
We present a quality-adaptive angular-margin learning framework that improves feature generalization by enforcing intra-class compactness and inter-class separability. Our framework, titled QLung, introduces a no-reference audio quality margin derived from spectral entropy and root-mean-square energy, which adaptively scales angular margins based on recording quality. To this end, we propose a log-scaled angular margin that stabilizes training under severe class imbalance. We also use an angular classifier that normalizes features and class weights, ensuring margin penalties are applied consistently on the unit hypersphere. Our approach improves in-distribution performance on the ICBHI dataset by 2.46\% over the cross-entropy baseline, and most significantly, achieves the strongest out-of-distribution performance on the SPRSound dataset compared to prior state-of-the-art methods. Code is available at https://github.com/RSC-Toolkit/QLung.
Primary: RSC LAB, MODULABS, Republic of Korea
All Institutions: RSC LAB, MODULABS, Republic of Korea, Department of Electronic Engineering, Wonkwang University, Republic of Korea, AI Convergence Research Institute, Wonkwang University, Republic of Korea
The paper presents QLung, a novel framework for respiratory sound classification that addresses quality variability and class imbalance through innovative angular-margin learning techniques. The comprehensive methodology and rigorous experimental validation position this work as a meaningful contribution to the field of machine learning in audio processing.
The proposed QLung framework introduces a quality-adaptive angular-margin learning approach that effectively addresses the challenges of low-quality recordings and class imbalance in respiratory sound classification. The dual-factor angular margin formulation is innovative, combining a no-reference audio quality margin with a log-scaled class imbalance margin, which enhances the model's ability to learn discriminative features. The use of an angular classifier that normalizes features and class weights is a significant methodological contribution, ensuring that the model focuses on angular similarity rather than feature magnitudes. This is particularly relevant in the context of respiratory sounds, where subtle differences are crucial for accurate classification.
The experiments are well-structured, utilizing the ICBHI and SPRSound datasets, which are standard benchmarks in the field. The reported improvements of 2.46% over the baseline and superior out-of-distribution performance demonstrate the effectiveness of the proposed method. The ablation studies provide insights into the contributions of each component of the QLung framework, reinforcing the robustness of the findings. However, additional comparisons with more recent methods could further validate the claims.
The paper provides sufficient implementation details, including the architecture, training parameters, and data preprocessing steps, which facilitate reproducibility. The availability of the code on GitHub enhances transparency and allows other researchers to replicate the results.
While the approach shows promising results, the reliance on the quality of the audio recordings remains a potential limitation. The method may not generalize well to datasets with significantly different characteristics or noise profiles. Additionally, the performance metrics could benefit from further exploration of other evaluation criteria beyond specificity and sensitivity.
The QLung framework has significant implications for clinical applications, particularly in the diagnosis of respiratory diseases where accurate classification of lung sounds is critical. By improving model robustness against low-quality recordings and class imbalance, this research could enhance the reliability of automated diagnostic tools in healthcare settings. Moreover, the methodologies developed could inspire further research in other audio classification tasks facing similar challenges. The paper presents QLung, a novel framework for respiratory sound classification that addresses quality variability and class imbalance through innovative angular-margin learning techniques. The comprehensive methodology and rigorous experimental validation position this work as a meaningful contribution to the field of machine learning in audio processing.
Voice-based screening offers a scalable and non-invasive way to assess neurodegenerative diseases such as Alzheimer's disease (AD) and Parkinson's disease (PD), but their staging remains challenging due to the difficulty of integrating heterogeneous data. This paper presents NeurMLLM, an efficient multimodal generative framework for neurodegenerative disease staging. NeurMLLM first encodes the spectrograms and Mel-frequency cepstral coefficients of audio data with vision transformers and projects their representations into the embedding space of a large language model (LLM), where they are concatenated with transcript and demographic instruction tokens as a single unified sequence. The LLM is then instruction-tuned via Low-Rank Adaptation using task prompts to autoregressively predict a constrained label token, enabling a generative classification. By evaluating on the Bridge2AI-Voice dataset for fine-grained staging of AD and PD, we observe that NeurMLLM achieves strong performance, consistently outperforming classical machine learning methods and existing LLM-based approaches. The results show the high potential of multimodal LLMs in neurodegenerative disease staging, improving staging accuracy and supporting accessible deployment.
Primary: The University of Texas at San Antonio
All Institutions: The University of Texas at San Antonio, Texas A&M University
The main contribution of this paper is the introduction of NeurMLLM, a multimodal generative framework that effectively integrates acoustic features, transcripts, and demographic context for the fine-grained staging of neurodegenerative diseases. This innovative approach not only outperforms existing methods but also highlights the potential of multimodal LLMs in clinical applications, paving the way for future advancements in the field.
The methodology presented in this paper is innovative, particularly in its integration of multimodal data (acoustic features, transcripts, and demographic information) within a unified framework. The use of vision transformers for encoding audio data and the instruction-tuning of a large language model (LLM) through Low-Rank Adaptation (LoRA) is a notable advancement. The generative classification approach, which reformulates the task as constrained label-token generation, is a significant departure from traditional classification methods, allowing for better alignment of multimodal evidence with clinical stages.
The experiments are comprehensive and well-structured, utilizing the Bridge2AI-Voice dataset for evaluating the proposed NeurMLLM framework. The results demonstrate a clear performance advantage over classical machine learning methods and existing LLM-based approaches, showcasing the effectiveness of the proposed multimodal architecture. The evaluation metrics, including macro-AUROC, accuracy, macro-F1, and macro-recall, are appropriate for the task and provide a robust assessment of model performance.
The paper provides sufficient details regarding the model architecture, training procedures, and evaluation protocols, which would allow for reproducibility. However, the absence of a publicly available code repository or demo URL limits the practical reproducibility of the results.
The study acknowledges limitations such as the small cohort size, which may introduce performance variance and necessitates validation on larger datasets. Additionally, while the constrained label-token generation approach shows promise, the underlying mechanisms contributing to its effectiveness require further exploration. The reliance on specific LLM backbones may also limit the generalizability of the findings.
The proposed framework has significant implications for the field of neurodegenerative disease screening, offering a scalable and non-invasive method for early detection. By leveraging voice-based biomarkers, this approach could enhance accessibility to diagnostic tools and improve patient outcomes. The integration of multimodal data also opens avenues for further research into personalized medicine and targeted interventions. The main contribution of this paper is the introduction of NeurMLLM, a multimodal generative framework that effectively integrates acoustic features, transcripts, and demographic context for the fine-grained staging of neurodegenerative diseases. This innovative approach not only outperforms existing methods but also highlights the potential of multimodal LLMs in clinical applications, paving the way for future advancements in the field.
Evaluating generative spatial audio for First-Order Ambisonics (FOA) remains challenging due to a limited understanding of how metrics respond to changes in spatial parameters such as azimuth and elevation. We propose a framework to analyze metric sensitivity along continuous spatial trajectories, drawing on principles of sensitivity analysis in parametric sound synthesis. Using controlled FOA scenes with increasing scene complexity, we define three desiderata for metric behavior: Responsiveness, Smoothness, and Symmetry. We assess standard distribution-based and sample-based metrics, including Fréchet Audio Distance (FAD), intensity vectors, and acoustic maps. Our findings show that FAD using localization-specific embeddings and acoustic maps yield high Responsiveness and robust Smoothness and Symmetry across conditions, while intensity vectors degrade with increasing scene complexity. This is the first step towards investigating the sensitivity of metrics for generative spatial audio.
Primary: New York University
All Institutions: New York University, Sony Group Corporation
This paper presents a pioneering framework for evaluating generative spatial audio metrics, addressing a critical gap in the understanding of metric sensitivity. The comprehensive methodology and experimental design contribute valuable insights into the performance of various metrics, paving the way for future advancements in the field.
The paper introduces a novel framework for sensitivity analysis of generative spatial audio metrics, focusing on three key desiderata: Responsiveness, Smoothness, and Symmetry. The methodology is well-structured, employing a systematic approach to evaluate the performance of various metrics under controlled spatial parameter changes. The use of a custom dataset with increasing scene complexity and the definition of clear metrics for evaluation are commendable. However, the reliance on synthetic data may limit the generalizability of the findings.
The experiments are comprehensive, involving a large dataset of 68,400 samples across various conditions, including clean and noisy environments. The evaluation of multiple metrics provides a robust comparison, and the results are clearly presented. The analysis of how metrics respond to different complexities and noise conditions is insightful, although the paper could benefit from more detailed statistical analysis to support the claims made.
The paper includes a GitHub repository for the project, which is a positive aspect for reproducibility. However, the specifics of the implementation details, such as the exact configurations used for the experiments and the datasets, could be more thoroughly documented to enhance reproducibility.
One significant limitation is the focus on artificially synthesized FOA data, which may not fully capture the complexities of real-world audio scenarios. Additionally, the study is limited to a small set of metrics, and future work is needed to expand the framework to include a broader range of evaluation metrics and real-world data.
The findings of this study have the potential to significantly impact the field of spatial audio generation by providing a clearer understanding of how different metrics behave under varying conditions. This could lead to improved evaluation standards and methodologies in the development of generative audio models, ultimately enhancing the quality of immersive audio experiences. This paper presents a pioneering framework for evaluating generative spatial audio metrics, addressing a critical gap in the understanding of metric sensitivity. The comprehensive methodology and experimental design contribute valuable insights into the performance of various metrics, paving the way for future advancements in the field.
This paper presents a novel data-free and training-free compression approach for speech foundation models using channelwise clustering via k-means. More fine-grained, mixed sparsity pruning by layer-level varying number of parameter clusters is also explored. Experiments conducted on the LibriSpeech dataset suggest that when operating with pruning sparsity of 50% on HuBERT-large, consistent WER reductions of 27.73%/18.61% absolute (34.37%/21.91% relative) over the magnitude-based pruning were obtained on the test-clean and test-other subsets before fine-tuning and 0.19%/0.79% absolute (3.36%/4.62% relative) after fine-tuning with only 3 epochs. Similar WER reductions of 2.86%/5.02% absolute (59.21%/55.29% relative) were observed against magnitudebased pruning on Whisper-large-v3 at 10% sparsity, all with no significant WER increase relative to the uncompressed baseline.
Primary: The Chinese University of Hong Kong
All Institutions: The Chinese University of Hong Kong, National Research Council Canada
The paper presents a novel data-free and training-free compression method for speech foundation models through parameter clustering. This approach significantly enhances the efficiency of ASR systems while maintaining competitive performance, marking a meaningful advancement in the field of machine learning and speech technology.
The proposed methodology introduces a novel parameter clustering technique that diverges from traditional pruning methods, focusing on data-free and training-free compression. The use of k-means clustering to group similar parameters is innovative, particularly in the context of speech foundation models. The mixed sparsity approach, which assigns varying numbers of clusters based on layer-wise variance, adds an additional layer of sophistication that enhances the model's adaptability and performance. However, the paper could benefit from a more detailed explanation of the clustering process and its implications on model interpretability.
The experiments conducted on the LibriSpeech dataset are robust, demonstrating significant improvements in word error rates (WER) compared to magnitude-based pruning. The results indicate that the proposed method not only maintains performance but also achieves notable reductions in WER across different sparsity levels. The fine-tuning process is well-structured, although the paper could provide more clarity on the specific configurations and hyperparameters used during fine-tuning for reproducibility.
While the paper outlines the experimental setup and the models used, it lacks specific implementation details such as code availability or a clear description of the clustering algorithm's parameters. This omission could hinder reproducibility for other researchers attempting to validate the findings or build upon the work.
One limitation of the approach is its reliance on the assumption that similar parameters can be effectively clustered without significant loss of information. This may not hold true for all model architectures or datasets. Additionally, while the results are promising, the paper does not explore the long-term effects of the proposed compression on model performance in diverse real-world applications.
The implications of this research are significant, particularly for deploying speech models in resource-constrained environments, such as mobile devices. By enabling efficient model compression without the need for extensive data or training, this work could facilitate broader accessibility and usability of advanced speech recognition technologies in everyday applications. The paper presents a novel data-free and training-free compression method for speech foundation models through parameter clustering. This approach significantly enhances the efficiency of ASR systems while maintaining competitive performance, marking a meaningful advancement in the field of machine learning and speech technology.