Large language models (LLMs) have recently achieved impressive results in speech recognition across multiple modalities, including Auditory Speech Recognition (ASR), Visual Speech Recognition (VSR), and Audio-Visual Speech Recognition (AVSR). Despite this progress, current LLM-based approaches typically address each task independently, training separate models that raise computational and deployment resource use while missing potential cross-task synergies. They also rely on fixed-rate token compression, which restricts flexibility in balancing accuracy with efficiency. These limitations highlight the need for a unified framework that can support ASR, VSR, and AVSR while enabling elastic inference. To this end, we present Omni-AVSR, a unified audio-visual LLM that combines efficient multi-granularity training with parameter-efficient adaptation. Specifically, we adapt the matryoshka representation learning paradigm to efficiently train across multiple audio and visual granularities, reducing its inherent training resource use. Furthermore, we explore three LoRA-based strategies for adapting the backbone LLM, balancing shared and task-specific specialization. Experiments on LRS2 and LRS3 show that Omni-AVSR achieves comparable or superior accuracy to state-of-the-art baselines while training a single model at substantially lower training and deployment resource use. The model also remains robust under acoustic noise, and we analyze its scaling behavior as LLM size increases, providing insights into the trade-off between performance and efficiency.
Primary: Imperial College London
All Institutions: Imperial College London
The main contribution of this paper is the development of Omni-AVSR, a unified audio-visual LLM that effectively integrates multiple speech recognition modalities while optimizing for resource efficiency. This work significantly advances the field by addressing the limitations of existing models and proposing a novel methodology that balances performance and efficiency in multimodal speech recognition tasks.
The paper introduces a unified framework, Omni-AVSR, which leverages a matryoshka representation learning paradigm to facilitate multi-granularity training across ASR, VSR, and AVSR tasks. This approach is innovative as it aims to reduce resource consumption while maintaining performance, addressing a significant gap in the current literature where these tasks are treated independently. The use of LoRA-based strategies for parameter-efficient adaptation is also a notable contribution, allowing for shared and task-specific model specialization.
The experiments conducted on the LRS2 and LRS3 datasets are comprehensive, demonstrating that Omni-AVSR achieves comparable or superior accuracy to existing state-of-the-art models while significantly lowering resource use. The robustness of the model under acoustic noise and the analysis of scaling behavior as LLM size increases provide valuable insights into performance-efficiency trade-offs, enhancing the credibility of the results.
The paper provides a GitHub repository link for code access, which is essential for reproducibility. However, further details on the experimental setup, hyperparameters, and specific configurations used in the training process would enhance reproducibility.
One limitation is the reliance on fixed-rate token compression, which may still impose constraints on flexibility despite the proposed improvements. Additionally, the paper does not extensively discuss the potential challenges in deploying the unified model in real-world scenarios, particularly in diverse acoustic environments.
The unified approach to multimodal speech recognition has the potential to streamline applications in various fields, including human-computer interaction, accessibility technologies, and automated transcription services. By reducing resource requirements, it could lead to more efficient deployment of speech recognition systems in resource-constrained environments. The main contribution of this paper is the development of Omni-AVSR, a unified audio-visual LLM that effectively integrates multiple speech recognition modalities while optimizing for resource efficiency. This work significantly advances the field by addressing the limitations of existing models and proposing a novel methodology that balances performance and efficiency in multimodal speech recognition tasks.
Recent advancements in speech synthesis technology have enriched our daily lives, with high-quality and human-like audio widely adopted across real-world applications. However, malicious exploitation like voice-cloning fraud poses severe security risks. Existing defense techniques struggle to address the production large language model (LLM)-based speech synthesis. While previous studies have considered the protection for fine-tuning synthesizers, they assume manually annotated transcripts. Given the labor intensity of manual annotation, end-to-end (E2E) systems leveraging automatic speech recognition (ASR) to generate transcripts are becoming increasingly prevalent, e.g., voice cloning via commercial APIs. Therefore, this E2E speech synthesis also requires new security mechanisms. To tackle these challenges, we propose E2E-VGuard, a proactive defense framework for two emerging threats: (1) production LLM-based speech synthesis, and (2) the novel attack arising from ASR-driven E2E scenarios. Specifically, we employ the encoder ensemble with a feature extractor to protect timbre, while ASR-targeted adversarial examples disrupt pronunciation. Moreover, we incorporate the psychoacoustic model to ensure perturbative imperceptibility. For a comprehensive evaluation, we test 16 open-source synthesizers and 3 commercial APIs across Chinese and English datasets, confirming E2E-VGuard's effectiveness in timbre and pronunciation protection. Real-world deployment validation is also conducted. Our code and demo page are available at https://wxzyd123.github.io/e2e-vguard/.
Primary: The University of Hong Kong
All Institutions: The University of Hong Kong, Beijing University of Posts and Telecommunications, CSIRO’s Data61, National University of Singapore, Responsible AI Research (RAIR) Centre, Shenzhen International Graduate School, The University of Adelaide, Tsinghua University
The main contribution of this paper is the introduction of E2E-VGuard, a novel framework that proactively defends against adversarial attacks in production LLM-based speech synthesis, addressing both timbre and pronunciation vulnerabilities. This work is a substantial step forward in securing speech synthesis technologies, with implications for both academic research and practical applications in the field.
The methodology presented in E2E-VGuard is innovative, utilizing an encoder ensemble with a feature extractor to enhance the protection of timbre in speech synthesis. The incorporation of a psychoacoustic model to ensure perturbative imperceptibility is a significant advancement, addressing the challenge of maintaining audio quality while implementing security measures. The approach to countering ASR-targeted adversarial examples is particularly noteworthy, as it reflects a deep understanding of the vulnerabilities in current systems.
The experimental evaluation is robust, involving 16 open-source synthesizers and 3 commercial APIs across diverse datasets in both Chinese and English. This breadth of testing enhances the credibility of the findings. The results demonstrate E2E-VGuard's effectiveness in protecting against both timbre and pronunciation attacks, which is critical for real-world applications. However, the paper could benefit from a more detailed discussion of the metrics used to evaluate effectiveness.
The authors provide a demo page and mention that the code is available, which supports reproducibility. However, the paper could improve by including more detailed implementation instructions and specific configurations used during experiments to facilitate independent verification of results.
One limitation is the potential dependency on the quality of the ASR systems used, as variations in ASR performance could impact the effectiveness of the proposed defenses. Additionally, the paper does not extensively discuss the scalability of the proposed solution in real-world scenarios or its performance against more sophisticated adversarial attacks.
The implications of this research are significant, as it addresses critical security concerns in speech synthesis technology, which is increasingly used in various applications, including virtual assistants and automated customer service. By providing a proactive defense framework, E2E-VGuard has the potential to enhance trust in voice-based systems and mitigate risks associated with voice-cloning fraud. The main contribution of this paper is the introduction of E2E-VGuard, a novel framework that proactively defends against adversarial attacks in production LLM-based speech synthesis, addressing both timbre and pronunciation vulnerabilities. This work is a substantial step forward in securing speech synthesis technologies, with implications for both academic research and practical applications in the field.
This paper revisits the neural vocoder task through the lens of audio restoration and propose a novel diffusion vocoder called BridgeVoC. Specifically, by rank analysis, we compare the rank characteristics of Mel-spectrum with other common acoustic degradation factors, and cast the vocoder task as a specialized case of audio restoration, where the range-space spectral (RSS) surrogate of the target spectrum acts as the degraded input. Based on that, we introduce the Schrodinger bridge framework for diffusion modeling, which defines the RSS and target spectrum as dual endpoints of the stochastic generation trajectory. Further, to fully utilize the hierarchical prior of subbands in the time-frequency (T-F) domain, we elaborately devise a novel subband-aware convolutional diffusion network as the data predictor, where subbands are divided following an uneven strategy, and convolutional-style attention module is employed with large kernels for efficient T-F contextual modeling. To enable single-step inference, we propose an omnidirectional distillation loss to facilitate effective information transfer from the teacher model to the student model, and the performance is improved by combining target-related and bijective consistency losses. Comprehensive experiments are conducted on various benchmarks and out-of-distribution datasets. Quantitative and qualitative results show that while enjoying fewer parameters, lower computational cost, and competitive inference speed, the proposed BridgeVoC yields stateof-the-art performance over existing advanced GAN-, DDPMand flow-matching-based baselines with only 4 sampling steps. And consistent superiority is still achieved with single-step inference.
Primary: Chinese Academy of Sciences
All Institutions: Chinese Academy of Sciences, University of Chinese Academy of Sciences, Tsinghua University, Tencent AI Lab
This paper presents a significant contribution to the field of neural vocoders by framing the vocoder task as an audio restoration problem and introducing innovative methodologies that enhance performance and efficiency. The comprehensive experiments and results validate the effectiveness of the proposed approach, making it a valuable addition to the literature on audio processing.
The paper introduces a novel approach to neural vocoding by framing it as an audio restoration problem, which is a significant shift from traditional methods. The use of the Schrödinger bridge framework for diffusion modeling is innovative and provides a fresh perspective on the vocoder task. The proposed subband-aware convolutional diffusion network (BridgeVoC) effectively leverages hierarchical prior knowledge in the time-frequency domain, which enhances the model's ability to reconstruct audio waveforms. The introduction of an omnidirectional distillation loss for single-step inference is also a noteworthy contribution, as it addresses the common challenge of information transfer in model distillation.
The experiments conducted are comprehensive, utilizing various benchmarks and out-of-distribution datasets, which strengthens the validity of the results. The quantitative and qualitative analyses demonstrate that BridgeVoC achieves state-of-the-art performance with fewer parameters and lower computational costs compared to existing models. The paper provides detailed metrics, including MCD, PESQ, and VISQOL, which are essential for evaluating audio quality.
The paper includes a demo link and mentions the availability of code, which is crucial for reproducibility. However, the details regarding the implementation, such as hyperparameters and training configurations, could be more explicitly stated to facilitate easier replication of the results by other researchers.
One limitation is the reliance on specific datasets for training and evaluation, which may affect the generalizability of the model. Additionally, while the single-step inference method shows promise, it may not achieve the same quality as multi-step methods in all scenarios, particularly in complex audio environments.
The proposed method has significant implications for real-time audio processing applications, such as speech synthesis and enhancement, where computational efficiency and audio quality are critical. By addressing the performance-inference dilemma, this work could lead to advancements in various fields, including telecommunications, entertainment, and assistive technologies. This paper presents a significant contribution to the field of neural vocoders by framing the vocoder task as an audio restoration problem and introducing innovative methodologies that enhance performance and efficiency. The comprehensive experiments and results validate the effectiveness of the proposed approach, making it a valuable addition to the literature on audio processing.
We introduce VoiceCraft-X, an autoregressive neural codec language model which unifies multilingual speech editing and zero-shot Text-to-Speech (TTS) synthesis across 11 languages: English, Mandarin, Korean, Japanese, Spanish, French, German, Dutch, Italian, Portuguese, and Polish. VoiceCraft-X utilizes the Qwen3 large language model for phoneme-free cross-lingual text processing and a novel token reordering mechanism with time-aligned text and speech tokens to handle both tasks as a single sequence generation problem. The model generates high-quality, natural-sounding speech, seamlessly creating new audio or editing existing recordings within one framework. VoiceCraft-X shows robust performance in diverse linguistic settings, even with limited per-language data, underscoring the power of unified autoregressive approaches for advancing complex, real-world multilingual speech applications. Audio samples are available at https://zhishengzheng.com/voicecraft-x/.
Primary: University of Texas at Austin
All Institutions: University of Texas at Austin
VoiceCraft-X presents a significant advancement in multilingual speech synthesis and editing by unifying these tasks within a single autoregressive framework. The model's innovative approach and robust performance across multiple languages underscore its potential to impact real-world applications in speech technology.
The methodology presented in VoiceCraft-X is innovative, leveraging a unified autoregressive approach to handle both multilingual speech editing and zero-shot TTS synthesis. The use of the Qwen3 model for phoneme-free text processing and the novel token reordering mechanism that aligns text and speech tokens in a single sequence generation task is particularly noteworthy. This approach simplifies the pipeline and enhances the model's ability to generate coherent and natural-sounding speech across multiple languages. The integration of speaker embeddings and the autoregressive prediction of audio tokens further contribute to the model's robustness. However, while the methodology is sound, it builds upon existing techniques in the field without introducing radical new concepts.
The experimental evaluation is thorough, utilizing a diverse dataset of approximately 32K hours across 11 languages, which is a significant achievement. The authors conducted both subjective and objective evaluations, comparing their model against several state-of-the-art baselines. The results demonstrate competitive performance, particularly in English and other European languages, while also highlighting the model's capability in lower-resource languages. The inclusion of detailed metrics such as WER, SIM-o, CMOS, and subjective scores provides a comprehensive view of the model's effectiveness. However, the reliance on a limited dataset for some languages may affect the generalizability of the results.
The paper provides a substantial amount of detail regarding the training process, model architecture, and evaluation metrics, which is essential for reproducibility. The authors mention their use of specific datasets, training configurations, and evaluation methodologies, which aids in replicating the experiments. However, the lack of a publicly available code repository at the time of publication may hinder full reproducibility for some researchers.
The authors acknowledge several limitations, including the relatively small scale of their training data compared to state-of-the-art models, particularly for lower-resource languages. This limitation may restrict the model's ability to capture the full range of linguistic nuances. Additionally, the current implementation only supports 11 languages, which is a fraction of the global linguistic diversity. The authors also note the ethical implications of their work, particularly concerning the potential for misuse of the technology in creating deepfakes or unauthorized voice cloning.
The development of VoiceCraft-X has significant implications for various applications, including voice assistants, content dubbing, and accessibility tools. The model's ability to perform high-quality speech synthesis and editing across multiple languages can enhance user experiences in diverse linguistic contexts. However, the ethical concerns surrounding the misuse of such technology highlight the need for responsible deployment and the establishment of safeguards to prevent malicious applications. The authors' commitment to a responsible release and advocacy for community-driven safety measures is commendable. VoiceCraft-X presents a significant advancement in multilingual speech synthesis and editing by unifying these tasks within a single autoregressive framework. The model's innovative approach and robust performance across multiple languages underscore its potential to impact real-world applications in speech technology.
Generating expressive and controllable human speech is one of the core goals of generative artificial intelligence, but its progress has long been constrained by two fundamental challenges: the deep entanglement of speech factors and the coarse granularity of existing control mechanisms. To overcome these challenges, we have proposed a novel framework called MF-Speech, which consists of two core components: MF-SpeechEncoder and MF-SpeechGenerator. MF-SpeechEncoder acts as a factor purifier, adopting a multi-objective optimization strategy to decompose the original speech signal into highly pure and independent representations of content, timbre, and emotion. Subsequently, MF-SpeechGenerator functions as a conductor, achieving precise, composable and fine-grained control over these factors through dynamic fusion and Hierarchical Style Adaptive Normalization (HSAN). Experiments demonstrate that in the highly challenging multi-factor compositional speech generation task, MF-Speech significantly outperforms current state-of-the-art methods, achieving a lower word error rate (WER=4.67%), superior style control (SECS=0.5685, Corr=0.68), and the highest subjective evaluation scores(nMOS=3.96, sMOS_emotion=3.86, sMOS_style=3.78). Furthermore, the learned discrete factors exhibit strong transferability, demonstrating their significant potential as a general-purpose speech representation.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of MF-Speech, a framework that effectively disentangles speech factors and achieves fine-grained, compositional control in speech generation. This work significantly advances the state-of-the-art in controllable speech synthesis, addressing key challenges in the field and providing a robust foundation for future research and applications.
The proposed MF-Speech framework presents a novel approach to disentangling speech factors—content, timbre, and emotion—through a multi-objective optimization strategy. The architecture of MF-SpeechEncoder and MF-SpeechGenerator is well-structured, with specific modules designed for each factor, enhancing the purity and independence of representations. The use of dynamic fusion and Hierarchical Style Adaptive Normalization (HSAN) for fine-grained control is innovative, allowing for a more nuanced synthesis of speech that can adaptively modulate style and content. However, the complexity of the model may pose challenges in terms of implementation and understanding.
The experiments conducted are comprehensive, comparing MF-Speech against multiple state-of-the-art methods across various metrics. The results show significant improvements in word error rate (WER), style control, and subjective evaluation scores, indicating the effectiveness of the proposed framework. The use of both objective and subjective evaluation methods strengthens the reliability of the findings. However, the paper could benefit from additional details on the datasets used and the specific configurations for the experiments.
The paper provides a reasonable level of detail regarding the architecture and training process of MF-Speech. However, it lacks specific implementation details such as hyperparameter settings and the exact training procedure, which could hinder reproducibility. The inclusion of a demo URL is a positive aspect, as it allows for practical evaluation of the model's capabilities.
One limitation is the potential for overfitting due to the complexity of the model and the reliance on a specific dataset (ESD dataset). Additionally, while the framework shows promise in generating expressive speech, the generalization to diverse speakers and emotions remains to be fully validated. The subjective evaluation, while valuable, is based on a limited number of participants, which may not capture the full range of user experiences.
The advancements presented in MF-Speech have significant implications for applications in voice synthesis, personalized digital assistants, and media production. By enabling fine-grained control over speech attributes, the framework can enhance user interaction and experience in various domains, including entertainment, accessibility, and education. The potential for transferability of the learned factors also opens avenues for further research and application in related fields. The main contribution of this paper is the introduction of MF-Speech, a framework that effectively disentangles speech factors and achieves fine-grained, compositional control in speech generation. This work significantly advances the state-of-the-art in controllable speech synthesis, addressing key challenges in the field and providing a robust foundation for future research and applications.
Generating expressive and controllable human speech is one of the core goals of generative artificial intelligence, but its progress has long been constrained by two fundamental challenges: the deep entanglement of speech factors and the coarse granularity of existing control mechanisms. To overcome these challenges, we have proposed a novel framework called MF-Speech, which consists of two core components: MF-SpeechEncoder and MF-SpeechGenerator. MF-SpeechEncoder acts as a factor purifier, adopting a multi-objective optimization strategy to decompose the original speech signal into highly pure and independent representations of content, timbre, and emotion. Subsequently, MF-SpeechGenerator functions as a conductor, achieving precise, composable and fine-grained control over these factors through dynamic fusion and Hierarchical Style Adaptive Normalization (HSAN). Experiments demonstrate that in the highly challenging multi-factor compositional speech generation task, MF-Speech significantly outperforms current state-of-the-art methods, achieving a lower word error rate (WER=4.67%), superior style control (SECS=0.5685, Corr=0.68), and the highest subjective evaluation scores(nMOS=3.96, sMOS_emotion=3.86, sMOS_style=3.78). Furthermore, the learned discrete factors exhibit strong transferability, demonstrating their significant potential as a general-purpose speech representation.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of the MF-Speech framework, which effectively disentangles speech factors and achieves fine-grained control in speech generation. This work significantly advances the field of generative speech synthesis by addressing critical challenges and demonstrating superior performance in empirical evaluations.
The proposed MF-Speech framework presents a robust methodology for disentangling speech factors (content, timbre, and emotion) through a multi-objective optimization strategy. The architecture of the MF-SpeechEncoder, with its three-stream design, effectively addresses the challenge of factor entanglement by ensuring high purity and independence of representations. The MF-SpeechGenerator enhances control granularity through dynamic fusion and Hierarchical Style Adaptive Normalization (HSAN), allowing for fine-grained compositional control over generated speech. This dual-component approach is innovative and addresses significant challenges in speech generation.
The experiments conducted are extensive and well-structured, demonstrating the effectiveness of MF-Speech against state-of-the-art methods. The use of both subjective and objective evaluation metrics provides a comprehensive assessment of the framework's performance. The results indicate that MF-Speech outperforms existing methods in key areas such as word error rate (WER), style control, and subjective evaluation scores, validating the proposed methodology's effectiveness.
The paper provides detailed descriptions of the experimental setup, including datasets, training details, and evaluation metrics. However, the lack of a publicly available code repository limits reproducibility. While the methodology is clearly articulated, access to the implementation would enhance the ability of other researchers to replicate the results.
One limitation is the absence of a diverse range of datasets for training and evaluation, which may affect the generalizability of the model across different speech contexts and languages. Additionally, while the framework shows promise in compositional control, the subjective evaluations indicate that there may still be room for improvement in naturalness and expressiveness.
The MF-Speech framework has significant implications for various applications, including virtual assistants, personalized voice synthesis, and media content creation. By enabling fine-grained control over speech characteristics, it can enhance user experience in interactive systems and contribute to advancements in human-computer interaction. The main contribution of this paper is the introduction of the MF-Speech framework, which effectively disentangles speech factors and achieves fine-grained control in speech generation. This work significantly advances the field of generative speech synthesis by addressing critical challenges and demonstrating superior performance in empirical evaluations.
Instruction-guided text-to-speech (TTS) research has reached a maturity level where excellent speech generation quality is possible on demand, yet two coupled biases persist: accent bias, where models default to dominant phonetic patterns, and linguistic bias, where dialect-specific lexical and cultural cues are ignored. These biases are interdependent, as authentic accent generation requires both accent fidelity and localized text. We present Contextual Linguistic Adaptation and Retrieval for Inclusive TTS sYnthesis (CLARITY), a backbone-agnostic framework that addresses these biases through dual-signal optimization: (i) contextual linguistic adaptation that localizes input text to the target dialect, and (ii) retrieval-augmented accent prompting (RAAP) that supplies accent-consistent speech prompts. Across twelve English accents, CLARITY improves accent accuracy and fairness while maintaining strong perceptual quality.
Primary: Duke Kunshan University
All Institutions: Duke Kunshan University, Singapore Institute of Technology
The main contribution of this paper is the introduction of CLARITY, a novel framework for mitigating accent and linguistic biases in text-to-speech generation, leveraging dual-signal optimization through contextual linguistic adaptation and accent retrieval. This work significantly advances the field of TTS by addressing critical biases that affect speech authenticity and user experience, showcasing a rigorous methodology and comprehensive evaluation.
The methodology presented in CLARITY is innovative, leveraging large language models (LLMs) for contextual linguistic adaptation and accent retrieval. The dual-signal optimization approach effectively addresses accent and linguistic biases in TTS systems, which are often overlooked in existing literature. The framework's backbone-agnostic nature allows it to be integrated with various TTS models, enhancing its applicability. The use of LLMs to parse user instructions and adapt text for specific dialects demonstrates a strong understanding of sociolinguistic factors, making the approach both practical and theoretically sound.
The experiments are robust, involving a diverse set of twelve English accents and employing both objective and subjective evaluation metrics. The use of human listening tests alongside automated metrics like NISQA and accent accuracy provides a comprehensive assessment of the system's performance. The ablation studies further validate the effectiveness of the proposed methods, showcasing improvements in accent fidelity and fairness. However, the results could benefit from a larger participant pool in subjective evaluations to enhance statistical significance.
The paper provides a clear description of the experimental setup, datasets, and evaluation metrics, which aids reproducibility. The availability of code and audio samples on GitHub is a significant step towards ensuring that other researchers can replicate the findings. However, more detailed instructions on the setup and execution of experiments would enhance reproducibility further.
One limitation is the potential for bias in the LLMs used for text adaptation and evaluation, which may affect the results. Additionally, while the framework shows promise for English accents, its applicability to other languages and dialects remains untested. The reliance on specific datasets could also limit generalizability, and the paper does not address how the model would perform with less represented accents or languages.
The implications of this research are significant, particularly in promoting inclusivity in TTS systems. By addressing accent and linguistic biases, CLARITY has the potential to enhance user experience in applications such as virtual assistants, audiobooks, and language learning tools. The framework could also inform future research on bias mitigation in AI systems, contributing to more equitable technology. The main contribution of this paper is the introduction of CLARITY, a novel framework for mitigating accent and linguistic biases in text-to-speech generation, leveraging dual-signal optimization through contextual linguistic adaptation and accent retrieval. This work significantly advances the field of TTS by addressing critical biases that affect speech authenticity and user experience, showcasing a rigorous methodology and comprehensive evaluation.
Recognizing speaker intent in long audio dialogues among speakers has a wide range of applications, but is a non-trivial AI task due to complex inter-dependencies in speaker utterances and scarce annotated data. To address these challenges, an end-to-end framework, namely DialogGraph-LLM, is proposed in the current work. DialogGraph-LLM combines a novel Multi-Relational Dialogue Attention Network (MR-DAN) architecture with multimodal foundation models (e.g., Qwen2.5-Omni-7B) for direct acoustic-to-intent inference. An adaptive semi-supervised learning strategy is designed using LLM with a confidence-aware pseudo-label generation mechanism based on dual-threshold filtering using both global and class confidences, and an entropy-based sample selection process that prioritizes high-information unlabeled instances. Extensive evaluations on the proprietary MarketCalls corpus and the publicly available MIntRec 2.0 benchmark demonstrate DialogGraph-LLM's superiority over strong audio and text-driven baselines. The framework demonstrates strong performance and efficiency in intent recognition in real world scenario audio dialogues, proving its practical value for audio-rich domains with limited supervision. Our code is available at https://github.com/david188888/DialogGraph-LLM.
Primary: South China Normal University
All Institutions: South China Normal University, Xiamen Rekey Medical Technology Co., LTD
The paper presents DialogGraph-LLM, a novel framework for audio dialogue intent recognition that combines graph-based modeling with advanced semi-supervised learning techniques, demonstrating substantial improvements over existing methods. The comprehensive methodology and robust experimental validation position this work as a meaningful contribution to the field of machine learning, particularly in audio processing and dialogue understanding.
The proposed DialogGraph-LLM framework introduces a novel Multi-Relational Dialogue Attention Network (MR-DAN) that effectively models complex inter-dependencies in audio dialogues through a graph-based approach. The integration of multimodal foundation models with a confidence-aware semi-supervised learning strategy enhances the model's ability to infer intent from raw audio data, addressing the challenges of limited annotated datasets. The methodology is well-structured, with clear definitions of edge types and attention mechanisms tailored to the unique properties of dialogue data.
The experiments are comprehensive, utilizing both proprietary and publicly available datasets (MarketCalls and MIntRec 2.0) to evaluate the performance of DialogGraph-LLM against strong baselines. The reported results demonstrate significant improvements in accuracy and F1 scores, particularly highlighting the effectiveness of the MR-DAN and the adaptive semi-supervised learning strategy. The ablation studies further validate the contributions of individual components, reinforcing the robustness of the proposed model.
The paper provides sufficient detail regarding the experimental setup, including dataset descriptions, model configurations, and evaluation metrics, which supports reproducibility. The availability of the code repository on GitHub enhances the potential for other researchers to replicate and build upon the work.
One limitation noted is the reliance on a specific LLM backbone (Qwen2.5-Omni-7B) without exploring the performance across different LLM architectures. Additionally, the proposed adaptive semi-supervised learning strategy may still be susceptible to noise propagation from pseudo-labels, which could affect model performance in practice.
The framework has significant implications for applications in human-computer interaction, customer service, and other audio-rich domains where intent recognition is crucial. By effectively leveraging audio data and addressing the challenges of limited supervision, DialogGraph-LLM could enhance the capabilities of dialogue systems in real-world scenarios. The paper presents DialogGraph-LLM, a novel framework for audio dialogue intent recognition that combines graph-based modeling with advanced semi-supervised learning techniques, demonstrating substantial improvements over existing methods. The comprehensive methodology and robust experimental validation position this work as a meaningful contribution to the field of machine learning, particularly in audio processing and dialogue understanding.
Modern text-to-speech (TTS) systems, particularly those built on Large Audio-Language Models (LALMs), generate high-fidelity speech that faithfully reproduces input text and mimics specified speaker identities. While prior misuse studies have focused on speaker impersonation, this work explores a distinct content-centric threat: exploiting TTS systems to produce speech containing harmful content. Realizing such threats poses two core challenges: (1) LALM safety alignment frequently rejects harmful prompts, yet existing jailbreak attacks are ill-suited for TTS because these systems are designed to faithfully vocalize any input text, and (2) real-world deployment pipelines often employ input/output filters that block harmful text and audio. We present HARMGEN, a suite of five attacks organized into two families that address these challenges. The first family employs semantic obfuscation techniques (Concat, Shuffle) that conceal harmful content within text. The second leverages audio-modality exploits (Read, Spell, Phoneme) that inject harmful content through auxiliary audio channels while maintaining benign textual prompts. Through evaluation across five commercial LALMs-based TTS systems and three datasets spanning two languages, we demonstrate that our attacks substantially reduce refusal rates and increase the toxicity of generated speech. We further assess both reactive countermeasures deployed by audio-streaming platforms and proactive defenses implemented by TTS providers. Our analysis reveals critical vulnerabilities: deepfake detectors underperform on high-fidelity audio; reactive moderation can be circumvented by adversarial perturbations; while proactive moderation detects 57-93% of attacks. Our work highlights a previously underexplored content-centric misuse vector for TTS and underscore the need for robust cross-modal safeguards throughout training and deployment.
Primary: Stony Brook University
All Institutions: Stony Brook University, Zhejiang University, The Hong Kong Polytechnic University
This paper makes a substantial contribution by exploring a previously underexamined misuse vector for TTS systems, presenting innovative attack methodologies, and demonstrating their effectiveness across various platforms. The findings highlight urgent security concerns that need to be addressed in the development and deployment of TTS technologies.
The paper introduces HARMGEN, a novel suite of attacks that creatively combines semantic obfuscation and audio-modality exploits to bypass TTS safety mechanisms. The methodology is well-structured, addressing two significant challenges in the misuse of TTS systems. The authors provide a clear description of the attack families and their operational principles, which demonstrates a thoughtful approach to the problem. However, the complexity of the attacks may require further elaboration for full comprehension.
The experiments are comprehensive, involving five commercial LALMs-based TTS systems and three datasets across two languages. The evaluation metrics are relevant, and the results show a substantial reduction in refusal rates and an increase in toxicity, indicating that the proposed attacks are effective. However, a deeper analysis of the datasets used and the specific metrics for measuring toxicity would enhance the robustness of the findings.
The paper lacks detailed implementation information, which raises concerns about reproducibility. While the methodology is described, specifics regarding the experimental setup, including parameters and configurations used in the attacks, are not sufficiently detailed. This could hinder other researchers from replicating the results.
The study primarily focuses on the effectiveness of the attacks without a thorough exploration of the ethical implications or potential countermeasures in depth. Additionally, the performance of the attacks may vary with different TTS systems, and the paper does not address this variability extensively.
The implications of this research are significant, as it highlights a critical vulnerability in TTS systems that could be exploited for malicious purposes. The findings underscore the need for enhanced safeguards in TTS deployment and raise awareness about the potential for harmful content generation, which is increasingly relevant in today's digital landscape. The work could influence future research directions in TTS safety and security. This paper makes a substantial contribution by exploring a previously underexamined misuse vector for TTS systems, presenting innovative attack methodologies, and demonstrating their effectiveness across various platforms. The findings highlight urgent security concerns that need to be addressed in the development and deployment of TTS technologies.
Recent Large Audio-Language Models (LALMs) exhibit impressive capabilities in understanding audio content for conversational QA tasks. However, these models struggle to accurately understand timestamps for temporal localization (e.g., Temporal Audio Grounding) and are restricted to short audio perception, leading to constrained capabilities on fine-grained tasks. We identify three key aspects that limit their temporal localization and long audio understanding: (i) timestamp representation, (ii) architecture, and (iii) data. To address this, we introduce TimeAudio, a novel method that empowers LALMs to connect their understanding of audio content with precise temporal perception. Specifically, we incorporate unique temporal markers to improve time-sensitive reasoning and apply an absolute time-aware encoding that explicitly grounds the acoustic features with absolute time information. Moreover, to achieve end-to-end long audio understanding, we introduce a segment-level token merging module to substantially reduce audio token redundancy and enhance the efficiency of information extraction. Due to the lack of suitable datasets and evaluation metrics, we consolidate existing audio datasets into a new dataset focused on temporal tasks and establish a series of metrics to evaluate the fine-grained performance. Evaluations show strong performance across a variety of fine-grained tasks, such as dense captioning, temporal grounding, and timeline speech summarization, demonstrating TimeAudio's robust temporal localization and reasoning capabilities.
Primary: University of Chinese Academy of Sciences
All Institutions: University of Chinese Academy of Sciences, Beijing Key Laboratory of Mobile Computing and Pervasive Device
The paper presents TimeAudio, a novel approach to enhancing temporal localization in large audio-language models, addressing significant gaps in existing methodologies. The comprehensive evaluation of the proposed methods and the introduction of a dedicated dataset for temporal tasks mark a meaningful contribution to the field of audio understanding in machine learning.
The methodology is well-structured, introducing innovative components such as temporal markers and absolute time-aware encoding to enhance the temporal understanding of audio-language models. The segment-level token merging module is a significant contribution, addressing the inefficiencies in processing long audio inputs. The proposed FTAR dataset is a valuable addition, specifically designed for temporal reasoning tasks, which is a notable gap in existing datasets.
The experiments are comprehensive, comparing TimeAudio against several baseline models across multiple tasks, including dense audio captioning, temporal audio grounding, and timeline speech summarization. The results demonstrate significant improvements in performance metrics, validating the effectiveness of the proposed methods. However, the paper could benefit from more extensive ablation studies to further dissect the contributions of each component.
The paper provides sufficient implementation details, including the architecture, training strategy, and dataset construction. The availability of the code on GitHub enhances reproducibility, although more explicit instructions on the setup and dependencies would be beneficial.
The paper does not address potential limitations in terms of the generalizability of the model to unseen audio types or the scalability of the proposed methods to even longer audio segments. Additionally, the reliance on specific datasets may limit the model's applicability in broader contexts.
The advancements in temporal audio understanding have implications for various applications, including audio search engines, interactive AI assistants, and accessibility tools for the hearing impaired. The work could lead to more sophisticated audio processing systems that better understand and respond to human queries about audio content. The paper presents TimeAudio, a novel approach to enhancing temporal localization in large audio-language models, addressing significant gaps in existing methodologies. The comprehensive evaluation of the proposed methods and the introduction of a dedicated dataset for temporal tasks mark a meaningful contribution to the field of audio understanding in machine learning.
Brain-computer interface (BCI) speech decoding has emerged as a promising tool for assisting individuals with speech impairments. In this context, the integration of electroencephalography (EEG) and electromyography (EMG) signals offers strong potential for enhancing decoding performance. Mandarin tone classification presents particular challenges, as tonal variations convey distinct meanings even when phonemes remain identical. In this study, we propose a novel cross-subject multimodal BCI decoding framework that fuses EEG and EMG signals to classify four Mandarin tones under both audible and silent speech conditions. Inspired by the cooperative mechanisms of neural and muscular systems in speech production, our neural decoding architecture combines spatial-temporal feature extraction branches with a cross-attention fusion mechanism, enabling informative interaction between modalities. We further incorporate domain-adversarial training to improve cross-subject generalization. We collected 4,800 EEG trials and 4,800 EMG trials from 10 participants using only twenty EEG and five EMG channels, demonstrating the feasibility of minimal-channel decoding. Despite employing lightweight modules, our model outperforms state-of-the-art baselines across all conditions, achieving average classification accuracies of 87.83% for audible speech and 88.08% for silent speech. In cross-subject evaluations, it still maintains strong performance with accuracies of 83.27% and 85.10% for audible and silent speech, respectively. We further conduct ablation studies to validate the effectiveness of each component. Our findings suggest that tone-level decoding with minimal EEG-EMG channels is feasible and potentially generalizable across subjects, contributing to the development of practical BCI applications.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the development of CAT-Net, a novel cross-attention framework for EEG-EMG fusion that significantly enhances Mandarin tone classification accuracy in both audible and silent speech conditions. This work represents a substantial advancement in the field of brain-computer interfaces, particularly in addressing the challenges of cross-subject generalization and minimal-channel decoding.
The proposed CAT-Net framework introduces a cross-attention mechanism that enhances the interaction between EEG and EMG modalities for tone classification. This approach is innovative as it moves beyond traditional concatenation methods, allowing for a more nuanced understanding of the interplay between neural and muscular signals during speech production. The integration of domain-adversarial training to improve cross-subject generalization is a significant methodological advancement, addressing a critical challenge in BCI applications. The use of minimal-channel configurations is also noteworthy, demonstrating efficiency in data collection and processing.
The experiments conducted are robust, with a well-defined dataset comprising 4,800 EEG and 4,800 EMG trials from 10 participants. The results show that CAT-Net outperforms state-of-the-art baselines across various conditions, achieving high classification accuracies for both audible and silent speech. The use of ablation studies to assess the contribution of different components of the model adds rigor to the evaluation, providing insights into the effectiveness of the cross-attention mechanism and domain adaptation strategies.
While the paper provides a detailed description of the methodology and experimental setup, the reproducibility may be limited due to the lack of specific information on the data collection process and participant demographics. The availability of the code on GitHub is a positive aspect that can aid in reproducibility, but comprehensive documentation of the experimental conditions and data preprocessing steps would enhance this further.
One limitation is the small sample size of 10 participants, which may affect the generalizability of the findings. Additionally, while the model shows strong performance in cross-subject evaluations, the reliance on a specific tonal language (Mandarin) may limit its applicability to other languages or dialects. The potential for overfitting in the context of limited training data is also a concern.
The implications of this research are significant for the development of practical BCI applications, particularly for individuals with speech impairments. By demonstrating the feasibility of effective tone-level decoding using minimal channels, this work paves the way for more accessible and user-friendly BCI systems. The approach could be extended to other languages and communication modalities, potentially benefiting a broader range of users. The main contribution of this paper is the development of CAT-Net, a novel cross-attention framework for EEG-EMG fusion that significantly enhances Mandarin tone classification accuracy in both audible and silent speech conditions. This work represents a substantial advancement in the field of brain-computer interfaces, particularly in addressing the challenges of cross-subject generalization and minimal-channel decoding.
We introduce Music Flamingo, a novel large audio-language model designed to advance music (including song) understanding in foundational audio models. While audio-language research has progressed rapidly, music remains challenging due to its dynamic, layered, and information-dense nature. Progress has been further limited by the difficulty of scaling open audio understanding models, primarily because of the scarcity of high-quality music data and annotations. As a result, prior models are restricted to producing short, high-level captions, answering only surface-level questions, and showing limited generalization across diverse musical cultures. To address these challenges, we curate MF-Skills, a large-scale dataset labeled through a multi-stage pipeline that yields rich captions and question-answer pairs covering harmony, structure, timbre, lyrics, and cultural context. We fine-tune an enhanced Audio Flamingo 3 backbone on MF-Skills and further strengthen multiple skills relevant to music understanding. To improve the model's reasoning abilities, we introduce a post-training recipe: we first cold-start with MF-Think, a novel chain-of-thought dataset grounded in music theory, followed by GRPO-based reinforcement learning with custom rewards. Music Flamingo achieves state-of-the-art results across 10+ benchmarks for music understanding and reasoning, establishing itself as a generalist and musically intelligent audio-language model. Beyond strong empirical results, Music Flamingo sets a new standard for advanced music understanding by demonstrating how models can move from surface-level recognition toward layered, human-like perception of songs. We believe this work provides both a benchmark and a foundation for the community to build the next generation of models that engage with music as meaningfully as humans do.
Primary: NVIDIA
All Institutions: NVIDIA
Music Flamingo introduces a novel large audio-language model that significantly enhances music understanding through innovative methodologies and a comprehensive dataset. The paper's contributions are substantial, addressing critical gaps in the field and setting a foundation for future advancements in music perception and reasoning.
The methodology is robust, introducing a multi-stage pipeline for dataset curation (MF-Skills) that enhances the richness of music understanding. The use of a cold-start approach with MF-Think and reinforcement learning with custom rewards is innovative and tailored to the complexities of music theory, indicating a thoughtful design to improve reasoning capabilities.
The paper reports state-of-the-art results across 10+ benchmarks, demonstrating the effectiveness of Music Flamingo in music understanding and reasoning tasks. However, details on the specific benchmarks and the comparative performance metrics could be elaborated for clarity.
The paper does not provide explicit implementation details or code availability, which raises concerns about reproducibility. While the project URL is provided, the absence of a public repository for the model and datasets limits the ability for other researchers to replicate the findings.
The paper acknowledges the challenges of scaling audio understanding models due to data scarcity, which remains a significant limitation. Additionally, the reliance on a specific dataset may limit generalizability across diverse musical genres and cultures.
Music Flamingo has the potential to significantly advance the field of audio-language models, enabling deeper engagement with music in various applications, including music education, recommendation systems, and interactive music generation. This work sets a new benchmark for future research in music understanding. Music Flamingo introduces a novel large audio-language model that significantly enhances music understanding through innovative methodologies and a comprehensive dataset. The paper's contributions are substantial, addressing critical gaps in the field and setting a foundation for future advancements in music perception and reasoning.
Recent progress in large language models (LLMs) has enabled understanding of both speech and non-speech audio, but exposing new safety risks emerging from complex audio inputs that are inadequately handled by current safeguards. We introduce SACRED-Bench (Speech-Audio Composition for RED-teaming) to evaluate the robustness of LLMs under complex audio-based attacks. Unlike existing perturbation-based methods that rely on noise optimization or white-box access, SACRED-Bench exploits speech-audio composition mechanisms. SACRED-Bench adopts three mechanisms: (a) speech overlap and multi-speaker dialogue, which embeds harmful prompts beneath or alongside benign speech; (b) speech-audio mixture, which imply unsafe intent via non-speech audio alongside benign speech or audio; and (c) diverse spoken instruction formats (open-ended QA, yes/no) that evade text-only filters. Experiments show that, even Gemini 2.5 Pro, the state-of-the-art proprietary LLM, still exhibits 66% attack success rate in SACRED-Bench test set, exposing vulnerabilities under cross-modal, speech-audio composition attacks. To bridge this gap, we propose SALMONN-Guard, a safeguard LLM that jointly inspects speech, audio, and text for safety judgments, reducing attack success down to 20%. Our results highlight the need for audio-aware defenses for the safety of multimodal LLMs. The benchmark and SALMONN-Guard checkpoints can be found at https://huggingface.co/datasets/tsinghua-ee/SACRED-Bench. Warning: this paper includes examples that may be offensive or harmful.
Primary: Tsinghua University
All Institutions: Tsinghua University, Shanghai Artificial Intelligence Laboratory, University of Cambridge
The paper presents a novel approach to evaluating and mitigating vulnerabilities in multimodal LLMs through innovative audio-based attacks and defenses. Its contributions are significant for advancing the safety and robustness of AI systems in handling complex audio inputs.
The methodology is innovative, introducing SACRED-Bench, a benchmark that exploits the complexities of audio inputs for red-teaming multimodal LLMs. The three attack mechanisms—speech overlap, speech-audio mixture, and diverse spoken instruction formats—are well-conceived and demonstrate a clear understanding of the vulnerabilities in existing models. The construction of the attacks is methodical, leveraging both semantic and acoustic layers to obscure harmful content, which is a significant advancement over previous perturbation-based methods.
The experiments are robust, with a comprehensive evaluation of multiple state-of-the-art models, including proprietary and open-source LLMs. The reported attack success rates provide compelling evidence of the vulnerabilities in these models, and the introduction of SALMONN-Guard as a mitigation strategy is well-supported by experimental results showing a significant reduction in attack success rates. The use of a diverse dataset for training and testing enhances the reliability of the findings.
The paper provides sufficient detail regarding the experimental setup, including the datasets used and the evaluation metrics. However, the lack of a publicly available implementation or code repository may hinder full reproducibility. The authors mention using Hugging Face for dataset availability, but further details on the model training and evaluation processes would strengthen reproducibility.
One limitation is the potential ethical concerns surrounding the creation and use of harmful audio content for testing purposes. Additionally, while the proposed SALMONN-Guard shows promise, its effectiveness in real-world applications remains to be fully validated. The reliance on specific models for evaluation may also limit the generalizability of the findings to other LLMs not included in the study.
The research has significant implications for the safety and robustness of multimodal LLMs, particularly in applications where audio inputs are prevalent. The introduction of SACRED-Bench and SALMONN-Guard could lead to improved safety measures in AI systems, enhancing their reliability in real-world scenarios. This work emphasizes the importance of developing comprehensive defenses against emerging threats in AI, particularly as models become more capable of processing complex inputs. The paper presents a novel approach to evaluating and mitigating vulnerabilities in multimodal LLMs through innovative audio-based attacks and defenses. Its contributions are significant for advancing the safety and robustness of AI systems in handling complex audio inputs.
Recent progress in large language models (LLMs) has enabled understanding of both speech and non-speech audio, but exposing new safety risks emerging from complex audio inputs that are inadequately handled by current safeguards. We introduce SACRED-Bench (Speech-Audio Composition for RED-teaming) to evaluate the robustness of LLMs under complex audio-based attacks. Unlike existing perturbation-based methods that rely on noise optimization or white-box access, SACRED-Bench exploits speech-audio composition mechanisms. SACRED-Bench adopts three mechanisms: (a) speech overlap and multi-speaker dialogue, which embeds harmful prompts beneath or alongside benign speech; (b) speech-audio mixture, which imply unsafe intent via non-speech audio alongside benign speech or audio; and (c) diverse spoken instruction formats (open-ended QA, yes/no) that evade text-only filters. Experiments show that, even Gemini 2.5 Pro, the state-of-the-art proprietary LLM, still exhibits 66% attack success rate in SACRED-Bench test set, exposing vulnerabilities under cross-modal, speech-audio composition attacks. To bridge this gap, we propose SALMONN-Guard, a safeguard LLM that jointly inspects speech, audio, and text for safety judgments, reducing attack success down to 20%. Our results highlight the need for audio-aware defenses for the safety of multimodal LLMs. The benchmark and SALMONN-Guard checkpoints can be found at https://huggingface.co/datasets/tsinghua-ee/SACRED-Bench. Warning: this paper includes examples that may be offensive or harmful.
Primary: Tsinghua University
All Institutions: Tsinghua University, Shanghai Artificial Intelligence Laboratory, University of Cambridge
The paper presents a comprehensive exploration of speech-audio compositional attacks on multimodal LLMs and introduces a novel defense mechanism. Its contributions are significant for advancing the understanding of vulnerabilities in audio processing within AI systems and establishing benchmarks for future research in this area.
The paper introduces SACRED-Bench, a novel benchmark for evaluating the robustness of multimodal LLMs against speech-audio compositional attacks. The methodology is innovative, leveraging complex audio compositions that include speech overlap, multi-speaker dialogue, and non-speech audio mixtures to craft adversarial examples. This approach is distinct from traditional perturbation-based methods, which often rely on noise optimization. The introduction of SALMONN-Guard as a defense mechanism adds a significant layer of complexity, showcasing a proactive approach to safeguarding against these attacks. The detailed construction of attack types and the rationale behind their design demonstrate a thorough understanding of the vulnerabilities in current LLMs.
The experimental setup is robust, involving a comprehensive evaluation of multiple state-of-the-art LLMs against the proposed SACRED-Bench attacks. The results indicate a high attack success rate, even for advanced models like Gemini 2.5 Pro, highlighting significant vulnerabilities. The effectiveness of SALMONN-Guard in reducing attack success rates to 20% is particularly noteworthy, demonstrating the practical applicability of the proposed defense mechanism. The experiments are well-structured, with clear metrics for evaluating attack success rates across different methods.
The paper provides a clear description of the experimental setup, including the datasets used and the specific methodologies for generating adversarial examples. However, the reproducibility could be enhanced by providing more detailed implementation specifics, such as hyperparameters for the SALMONN-Guard model and the exact configurations used during training. The availability of the benchmark dataset on Hugging Face is a positive step towards facilitating reproducibility.
One limitation is the reliance on synthetic data for training SALMONN-Guard, which may not capture all real-world scenarios. Additionally, while the paper addresses the vulnerabilities of current LLMs, it does not extensively discuss the potential for adversarial attacks to evolve, which could undermine the effectiveness of the proposed defenses over time. The ethical implications of generating harmful audio content for testing purposes also warrant consideration.
The findings of this research have significant implications for the safety and reliability of multimodal LLMs, particularly in applications involving audio processing. As LLMs become more integrated into various domains, understanding and mitigating risks associated with audio inputs is crucial. The introduction of SACRED-Bench and SALMONN-Guard could pave the way for more robust safety mechanisms in AI systems, ultimately contributing to safer AI deployment in sensitive applications. The paper presents a comprehensive exploration of speech-audio compositional attacks on multimodal LLMs and introduces a novel defense mechanism. Its contributions are significant for advancing the understanding of vulnerabilities in audio processing within AI systems and establishing benchmarks for future research in this area.
In this work, we address the challenge of generalizable audio deepfake detection (ADD) across diverse speech synthesis paradigms-including conventional text-to-speech (TTS) systems and modern diffusion or flow-matching (FM) based generators. Prior work has mostly targeted individual synthesis families and often fails to generalize across paradigms due to overfitting to generation-specific artifacts. We hypothesize that synthetic speech, irrespective of its generative origin, leaves behind shared structural distortions in the embedding space that can be aligned through geometry-aware modeling. To this end, we propose RHYME, a unified detection framework that fuses utterance-level embeddings from diverse pretrained speech encoders using non-Euclidean projections. RHYME maps representations into hyperbolic and spherical manifolds-where hyperbolic geometry excels at modeling hierarchical generator families, and spherical projections capture angular, energy-invariant cues such as periodic vocoder artifacts. The fused representation is obtained via Riemannian barycentric averaging, enabling synthesis-invariant alignment. RHYME outperforms individual PTMs and homogeneous fusion baselines, achieving top performance and setting new state-of-the-art in cross-paradigm ADD.
Primary: unknown
All Institutions: unknown
The paper presents RHYME, a geometry-aware framework for generalizable audio deepfake detection that effectively combines hyperbolic and spherical geometry to improve detection across diverse synthesis paradigms. This innovative approach addresses a critical gap in the field, offering a promising solution to enhance the robustness of deepfake detection systems.
The proposed RHYME framework introduces a novel approach to audio deepfake detection by leveraging non-Euclidean geometry, specifically hyperbolic and spherical spaces, to model and fuse embeddings from various speech synthesis paradigms. This geometry-aware modeling is innovative as it addresses the challenge of generalization across different synthesis methods, which has been a significant limitation in prior work. The use of Riemannian barycentric averaging for fusion is a unique aspect that enhances the robustness of the detection framework.
The experiments conducted on two benchmark datasets (ASVspoof and DFADD) provide a rigorous evaluation of the RHYME framework's performance. The paper demonstrates strong results across various settings, including zero-shot and cross-corpus evaluations, which are critical for assessing generalization capabilities. However, the paper would benefit from more detailed statistical analysis and comparisons with a broader range of existing methods.
The paper includes a GitHub repository link that provides access to the source code and models, which is a positive aspect for reproducibility. However, the paper lacks detailed implementation instructions and hyperparameter settings that would facilitate easier reproduction of results by other researchers.
The study is limited to English-language datasets, which restricts the generalizability of the findings to multilingual and accented speech. Additionally, the paper does not address potential biases in the datasets used, which could affect the robustness of the detection framework in real-world applications.
The RHYME framework has significant implications for enhancing the security of audio communications against deepfake technologies, which are increasingly used for malicious purposes. By providing a reliable detection mechanism, the work contributes to the broader field of digital forensics and speech security. However, the authors acknowledge the potential for misuse of detection technologies, emphasizing the need for responsible deployment. The paper presents RHYME, a geometry-aware framework for generalizable audio deepfake detection that effectively combines hyperbolic and spherical geometry to improve detection across diverse synthesis paradigms. This innovative approach addresses a critical gap in the field, offering a promising solution to enhance the robustness of deepfake detection systems.
Automatic speech recognition (ASR) systems often rely on autoregressive (AR) Transformer decoder architectures, which limit efficient inference parallelization due to their sequential nature. To this end, non-autoregressive (NAR) approaches aim primarily to achieve significant decoding speedup while the maintaining recognition accuracy that is comparable to AR baselines. This paper proposes a novel NAR block-based attention mask decoder (AMD) that effectively improves decoding efficiency while maintaining ASR accuracy, and also offering flexibility in balancing the performance-efficiency trade-off on both Conformer and large language model (LLM)-based ASR systems. The proposed AMD performs parallel inference within contiguous blocks of output labels while maintaining monotonic left-to-right prediction between blocks. A one-pass beam search algorithm is designed to dynamically fuse Connectionist Temporal Classification (CTC), AR decoder, and AMD probabilities. Experiments are conducted on normal speech LS960 and DBank elderly speech across: a) The Conformer encoder-decoder ASR system with filterbank input features; b) its integration with WavLM features; and c) further advancement by integrating an LLM-based decoder. On the LS960 task, the proposed AMD empowered tripartite decoder achieves decoding speedup ratios of up to 1.44x, 1.55x, and 2.31x under the three model configurations over the CTC + AR baselines, without statistically significant WER increases. When operating with real-time factors (RTFs) comparable to the baselines, the tripartite decoder produces statistically significant WER reductions of 0.19%, 0.62% and 0.13% absolute (4.3%, 16.3%, and 3.8% relative). Similar improvements are also obtained on the DBank task.
Primary: Chinese University of Hong Kong
All Institutions: Chinese University of Hong Kong, Chinese Academy of Sciences, National Research Council Canada, Nanyang Technological University, Tsinghua University
This paper presents a novel approach to enhancing ASR systems through an innovative non-autoregressive decoder, significantly improving decoding efficiency while maintaining accuracy, particularly in challenging speech domains.
The proposed methodology introduces a novel non-autoregressive block-based attention mask decoder (AMD) that enhances decoding efficiency while maintaining accuracy in ASR systems. The integration of a one-pass beam search algorithm that dynamically fuses CTC, AR decoder, and AMD probabilities is a notable advancement. The approach effectively balances performance and efficiency trade-offs, particularly in atypical speech domains, which is a significant contribution to the field.
The experiments conducted on the LS960 and DBank datasets demonstrate the effectiveness of the AMD-empowered tripartite decoder across various configurations. The results show significant decoding speedups without substantial increases in word error rates (WER), affirming the robustness of the proposed method. The statistical significance of the results adds credibility to the findings.
The paper provides detailed descriptions of the experimental setup, including model configurations, training procedures, and hyperparameter settings. However, the absence of publicly available code or datasets limits the reproducibility of the results.
One limitation is the reliance on specific datasets, which may not generalize to all ASR applications. Additionally, while the AMD shows improvements, the performance gap between NAR and AR models in certain contexts remains an area for further exploration.
The advancements in non-autoregressive decoding techniques have the potential to significantly enhance real-time applications of ASR systems, particularly in scenarios requiring high efficiency and low latency, such as in medical and elderly speech recognition. This work could lead to broader applications in assistive technologies and improve accessibility for diverse populations. This paper presents a novel approach to enhancing ASR systems through an innovative non-autoregressive decoder, significantly improving decoding efficiency while maintaining accuracy, particularly in challenging speech domains.
Video-to-Music generation seeks to generate musically appropriate background music that enhances audiovisual immersion for videos. However, current approaches suffer from two critical limitations: 1) incomplete representation of video details, leading to weak alignment, and 2) inadequate temporal and rhythmic correspondence, particularly in achieving precise beat synchronization. To address the challenges, we propose Video Echoed in Music (VeM), a latent music diffusion that generates high-quality soundtracks with semantic, temporal, and rhythmic alignment for input videos. To capture video details comprehensively, VeM employs a hierarchical video parsing that acts as a music conductor, orchestrating multi-level information across modalities. Modality-specific encoders, coupled with a storyboard-guided cross-attention mechanism (SG-CAtt), integrate semantic cues while maintaining temporal coherence through position and duration encoding. For rhythmic precision, the frame-level transition-beat aligner and adapter (TB-As) dynamically synchronize visual scene transitions with music beats. We further contribute a novel video-music paired dataset sourced from e-commerce advertisements and video-sharing platforms, which imposes stricter transition-beat synchronization requirements. Meanwhile, we introduce novel metrics tailored to the task. Experimental results demonstrate superiority, particularly in semantic relevance and rhythmic precision.
Primary: UGen
All Institutions: UGen
The paper presents a novel approach to video-to-music generation that leverages advanced machine learning techniques to enhance semantic, temporal, and rhythmic alignment. The integration of hierarchical parsing and innovative alignment mechanisms represents a meaningful advancement in the field, addressing critical limitations of existing methods and setting a foundation for future research in multimodal content generation.
The proposed methodology, Video Echoed in Music (VeM), introduces a diffusion-based framework that effectively integrates hierarchical video parsing and a storyboard-guided cross-attention mechanism to achieve semantic, temporal, and rhythmic alignment in video-to-music generation. The use of modality-specific encoders and the transition-beat aligner demonstrates a thoughtful approach to addressing the shortcomings of existing methods, particularly in maintaining rhythmic precision and enhancing audiovisual coherence. The hierarchical parsing serves as a robust conductor for the music generation process, allowing for a nuanced understanding of video content that is critical for effective music composition.
The experimental evaluation is thorough, utilizing both objective and subjective metrics to assess the performance of VeM against established baselines. The introduction of the TB-Match dataset, specifically designed for this task, adds significant value, as it imposes stricter synchronization requirements that enhance the robustness of the evaluation. The results indicate that VeM outperforms existing methods across various metrics, showcasing improvements in music quality, semantic alignment, temporal synchronization, and rhythmic consistency. The ablation studies further validate the contributions of each component within the framework.
The paper provides detailed implementation information, including training configurations and the architecture of the models used, which supports reproducibility. However, the reliance on specific pre-trained models and the limited availability of high-quality datasets may pose challenges for full replication of results.
The paper acknowledges limitations related to the inherent challenges of achieving perfect alignment between visual transitions and music beats, particularly in scenarios with frequent transitions. Additionally, the focus on specific domains (e-commerce and video-sharing platforms) may limit the generalizability of the findings to other types of video content.
This research has significant implications for various fields, including film, advertising, and gaming, where effective video-to-music generation can enhance user engagement and emotional impact. The methodology could also pave the way for more advanced applications in automated content creation and multimedia editing, potentially transforming how audiovisual content is produced. The paper presents a novel approach to video-to-music generation that leverages advanced machine learning techniques to enhance semantic, temporal, and rhythmic alignment. The integration of hierarchical parsing and innovative alignment mechanisms represents a meaningful advancement in the field, addressing critical limitations of existing methods and setting a foundation for future research in multimodal content generation.
Video-to-music (V2M) generation aims to create music that aligns with visual content. However, two main challenges persist in existing methods: (1) the lack of explicit rhythm modeling hinders audiovisual temporal alignments; (2) effectively integrating various visual features to condition music generation remains non-trivial. To address these issues, we propose Diff-V2M, a general V2M framework based on a hierarchical conditional diffusion model, comprising two core components: visual feature extraction and conditional music generation. For rhythm modeling, we begin by evaluating several rhythmic representations, including low-resolution mel-spectrograms, tempograms, and onset detection functions (ODF), and devise a rhythmic predictor to infer them directly from videos. To ensure contextual and affective coherence, we also extract semantic and emotional features. All features are incorporated into the generator via a hierarchical cross-attention mechanism, where emotional features shape the affective tone via the first layer, while semantic and rhythmic features are fused in the second cross-attention layer. To enhance feature integration, we introduce timestep-aware fusion strategies, including feature-wise linear modulation (FiLM) and weighted fusion, allowing the model to adaptively balance semantic and rhythmic cues throughout the diffusion process. Extensive experiments identify low-resolution ODF as a more effective signal for modeling musical rhythm and demonstrate that Diff-V2M outperforms existing models on both in-domain and out-of-domain datasets, achieving state-of-the-art performance in terms of objective metrics and subjective comparisons. Demo and code are available at https://Tayjsl97.github.io/Diff-V2M-Demo/.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of Diff-V2M, a hierarchical conditional diffusion model that effectively addresses the challenges of video-to-music generation by incorporating explicit rhythm modeling and a sophisticated feature integration strategy. This work significantly advances the field of audiovisual content generation, providing a robust framework that outperforms existing methods and opens avenues for future research.
The proposed methodology in Diff-V2M is innovative, leveraging a hierarchical conditional diffusion model to address the challenges of video-to-music generation. The introduction of a rhythmic predictor that infers rhythmic representations directly from video content is a significant advancement, as it allows for explicit rhythm modeling, which has been a gap in previous works. The use of multiple visual features (emotional, semantic, and rhythmic) integrated through a hierarchical cross-attention mechanism is a robust approach that enhances the model's ability to generate coherent music aligned with video content. The timestep-aware fusion strategies (FiLM and weighted fusion) further demonstrate a thoughtful design to adaptively balance the contributions of different features, enhancing the model's performance.
The experimental evaluation is comprehensive, utilizing both in-domain and out-of-domain datasets to assess the model's performance. The paper reports extensive quantitative results that show Diff-V2M outperforms existing methods in various metrics, including Frechet Audio Distance and subjective evaluations. The use of ablation studies to analyze the impact of different components of the model strengthens the findings, providing insights into the importance of rhythmic and emotional features in music generation.
The paper includes detailed implementation details, including the architecture of the model, training strategies, and evaluation metrics, which contribute to reproducibility. The availability of a demo and code repository enhances the potential for other researchers to replicate the results. However, the actual code repository is not provided, which could limit reproducibility for some aspects.
The paper acknowledges limitations, such as the reliance on scene cuts and inter-frame differences, which may overlook subtle motion cues in human-centric videos. Additionally, the model does not offer explicit control over musical attributes like genre and emotion, which could restrict its adaptability in certain contexts. These limitations highlight areas for future research and improvement.
The implications of this research are significant, particularly in the context of personalized audiovisual content creation, which is increasingly relevant in the age of social media and video platforms. The ability to generate music that aligns with visual content can enhance user experiences in various applications, including video editing, gaming, and interactive media. The advancements in rhythm modeling and feature integration could also inspire further research in multimodal generative models. The main contribution of this paper is the introduction of Diff-V2M, a hierarchical conditional diffusion model that effectively addresses the challenges of video-to-music generation by incorporating explicit rhythm modeling and a sophisticated feature integration strategy. This work significantly advances the field of audiovisual content generation, providing a robust framework that outperforms existing methods and opens avenues for future research.
Significant progress has been made in spoken question answering (SQA) in recent years. However, many existing methods, including large audio language models, struggle with processing long audio. Follow the success of retrieval augmented generation, a speech-related retriever shows promising in help preprocessing long-form speech. But the performance of existing speech-related retrievers is lacking. To address this challenge, we propose CLSR, an end-to-end contrastive language-speech retriever that efficiently extracts question-relevant segments from long audio recordings for downstream SQA task. Unlike conventional speech-text contrastive models, CLSR incorporates an intermediate step that converts acoustic features into text-like representations prior to alignment, thereby more effectively bridging the gap between modalities. Experimental results across four cross-modal retrieval datasets demonstrate that CLSR surpasses both end-to-end speech related retrievers and pipeline approaches combining speech recognition with text retrieval, providing a robust foundation for advancing practical long-form SQA applications.
Primary: Unknown
All Institutions: Unknown
The paper presents CLSR, a novel end-to-end contrastive language-speech retriever that significantly enhances spoken question answering by effectively bridging acoustic and textual modalities. The comprehensive methodology and strong experimental results position this work as a valuable contribution to the field of audio processing and machine learning.
The methodology presented in this paper introduces the CLSR model, which innovatively bridges the gap between acoustic and textual representations through an intermediate text-like representation. This approach addresses the limitations of existing models that struggle with long-form audio by leveraging a continuous integrate-and-fire mechanism and a vector quantizer to enhance the alignment between modalities. The model's architecture is well-structured, and the use of contrastive learning principles is effectively integrated into the retrieval process. However, the paper could benefit from a more detailed explanation of the CIF and sampler mechanisms, as well as their specific contributions to the overall performance improvements.
The experimental evaluation is robust, utilizing four diverse datasets to demonstrate the effectiveness of CLSR against both end-to-end and pipeline models. The results indicate significant improvements in retrieval performance, particularly in long-form spoken question answering tasks. The comparison with baseline models is thorough, and the ablation studies provide insights into the contributions of various components of the model. However, the paper could improve by including more extensive analysis of the results, such as error analysis or qualitative assessments of the retrieval outputs.
The paper provides sufficient details regarding the experimental setup, including the datasets used, model configurations, and training procedures. The inclusion of a GitHub repository with the code enhances reproducibility. However, the lack of specific hyperparameter tuning details and the absence of a clear description of the computational resources used may hinder full reproducibility.
One limitation of the study is the reliance on specific datasets that may not fully represent the diversity of spoken language contexts encountered in real-world applications. Additionally, while the CLSR model shows promising results, it may still face challenges in handling highly noisy audio inputs or very diverse accents, which are common in practical scenarios. The paper also does not address the scalability of the model when applied to larger datasets or more complex retrieval tasks.
The proposed CLSR model has significant implications for practical applications in spoken question answering systems, particularly in contexts such as education, customer service, and information retrieval from long audio sources. By improving the efficiency and accuracy of retrieving relevant audio segments, CLSR could enhance user experiences in various interactive voice applications. The model's architecture could also inspire future research in cross-modal learning and retrieval systems. The paper presents CLSR, a novel end-to-end contrastive language-speech retriever that significantly enhances spoken question answering by effectively bridging acoustic and textual modalities. The comprehensive methodology and strong experimental results position this work as a valuable contribution to the field of audio processing and machine learning.
Large Audio-language Models (LAMs) have recently enabled powerful speech-based interactions by coupling audio encoders with Large Language Models (LLMs). However, the security of LAMs under adversarial attacks remains underexplored, especially through audio jailbreaks that craft malicious audio prompts to bypass alignment. Existing efforts primarily rely on converting text-based attacks into speech or applying shallow signal-level perturbations, overlooking the impact of human speech's expressive variations on LAM alignment robustness. To address this gap, we propose StyleBreak, a novel style-aware audio jailbreak framework that systematically investigates how diverse human speech attributes affect LAM alignment robustness. Specifically, StyleBreak employs a two-stage style-aware transformation pipeline that perturbs both textual content and audio to control linguistic, paralinguistic, and extralinguistic attributes. Furthermore, we develop a query-adaptive policy network that automatically searches for adversarial styles to enhance the efficiency of LAM jailbreak exploration. Extensive evaluations demonstrate that LAMs exhibit critical vulnerabilities when exposed to diverse human speech attributes. Moreover, StyleBreak achieves substantial improvements in attack effectiveness and efficiency across multiple attack paradigms, highlighting the urgent need for more robust alignment in LAMs.
Primary: unknown
All Institutions: unknown
This paper introduces StyleBreak, a novel framework that reveals critical vulnerabilities in large audio-language models through style-aware audio jailbreaks, significantly enhancing attack effectiveness and efficiency. The comprehensive methodology and experimental validation underscore its potential impact on the field of machine learning and model safety.
The proposed methodology of StyleBreak is innovative, employing a two-stage style-aware transformation pipeline that manipulates both textual and audio inputs to create adversarial examples. The integration of a query-adaptive policy network to enhance the efficiency of adversarial style exploration is a notable contribution, addressing a significant gap in existing research on LAM vulnerabilities. However, the complexity of the methodology may limit its accessibility for practical applications.
The experiments are comprehensive, demonstrating the effectiveness of StyleBreak across multiple LAMs and attack paradigms. The evaluation metrics are well-defined, and the results convincingly illustrate the vulnerabilities of LAMs to adversarial audio prompts. However, the reliance on a limited dataset for training and evaluation may affect the generalizability of the findings.
The paper lacks sufficient implementation details that would allow for easy reproduction of the results. While the methodology is described, specific parameters, configurations, and the exact setup for experiments are not fully disclosed, which could hinder reproducibility.
The study primarily focuses on a specific type of adversarial attack (audio jailbreak), which may not encompass all potential vulnerabilities of LAMs. Additionally, the effectiveness of StyleBreak may vary with different LAM architectures and settings, which is not fully explored in the paper.
The findings have significant implications for the security of LAMs in real-world applications, highlighting the need for robust alignment mechanisms to prevent malicious exploitation. The work could influence future research directions in model safety and adversarial robustness, particularly in audio processing contexts. This paper introduces StyleBreak, a novel framework that reveals critical vulnerabilities in large audio-language models through style-aware audio jailbreaks, significantly enhancing attack effectiveness and efficiency. The comprehensive methodology and experimental validation underscore its potential impact on the field of machine learning and model safety.
Music mixing involves combining individual tracks into a cohesive mixture, a task characterized by subjectivity where multiple valid solutions exist for the same input. Existing automatic mixing systems treat this task as a deterministic regression problem, thus ignoring this multiplicity of solutions. Here we introduce MEGAMI (Multitrack Embedding Generative Auto MIxing), a generative framework that models the conditional distribution of professional mixes given unprocessed tracks. MEGAMI uses a track-agnostic effects processor conditioned on per-track generated embeddings, handles arbitrary unlabeled tracks through a permutation-equivariant architecture, and enables training on both dry and wet recordings via domain adaptation. Our objective evaluation using distributional metrics shows consistent improvements over existing methods, while listening tests indicate performances approaching human-level quality across diverse musical genres.
Primary: Aalto University
All Institutions: Aalto University, Sony AI
The main contribution of this paper is the introduction of MEGAMI, a generative framework for automatic music mixing that effectively captures the complexity of mixing decisions through innovative methodologies, demonstrating significant improvements over existing systems in both objective and subjective evaluations. This work represents a meaningful advancement in the intersection of machine learning and audio engineering, with the potential to reshape the landscape of music production.
The paper introduces MEGAMI, a novel generative framework for automatic music mixing that leverages conditional diffusion models to capture the multimodal nature of mixing decisions. The methodology is well-structured, utilizing a track-agnostic effects processor and a permutation-equivariant architecture, which allows for handling arbitrary numbers of unlabeled tracks. The separation of mixing effects from musical content through learned embeddings is a significant advancement over traditional deterministic approaches. The domain adaptation strategy is innovative, enabling training on wet-only datasets, which is a critical limitation in existing systems.
The experiments are robust, employing both objective metrics (Kernel Audio Distance) and subjective listening tests to evaluate the quality of mixes produced by MEGAMI compared to human mixes and other baselines. The results indicate consistent improvements over existing methods, with subjective evaluations showing that MEGAMI approaches human-level quality in music mixing. The use of diverse datasets enhances the credibility of the findings, although the paper could benefit from a larger participant pool in subjective tests for more statistically significant results.
The paper provides a comprehensive description of the implementation details, including the architecture of the models and the datasets used for training and evaluation. The availability of a GitHub repository for the project further supports reproducibility. However, the paper could improve by including more detailed instructions on how to replicate the experiments, particularly regarding the data preprocessing steps and hyperparameter settings.
One limitation is the reliance on the quality and diversity of the training datasets, particularly the wet-only datasets, which may not fully capture the range of mixing styles present in professional mixes. Additionally, while the subjective evaluation shows promising results, the variability in listener preferences could affect the perceived quality of the mixes. The paper also does not address the potential computational cost of training and inference, which could limit practical applications.
The proposed framework has significant implications for the music production industry, potentially democratizing access to high-quality mixing tools for independent artists and producers. The ability to generate professional-quality mixes automatically could streamline the production process and reduce costs. Furthermore, the methodologies developed could inspire future research in other areas of audio processing and generative modeling. The main contribution of this paper is the introduction of MEGAMI, a generative framework for automatic music mixing that effectively captures the complexity of mixing decisions through innovative methodologies, demonstrating significant improvements over existing systems in both objective and subjective evaluations. This work represents a meaningful advancement in the intersection of machine learning and audio engineering, with the potential to reshape the landscape of music production.
Zero-shot singing voice conversion (SVC) transforms a source singer's timbre to an unseen target speaker's voice while preserving melodic content without fine-tuning. Existing methods model speaker timbre and vocal content separately, losing essential acoustic information that degrades output quality while requiring significant computational resources. To overcome these limitations, we propose HQ-SVC, an efficient framework for high-quality zero-shot SVC. HQ-SVC first extracts jointly content and speaker features using a decoupled codec. It then enhances fidelity through pitch and volume modeling, preserving critical acoustic information typically lost in separate modeling approaches, and progressively refines outputs via differentiable signal processing and diffusion techniques. Evaluations confirm HQ-SVC significantly outperforms state-of-the-art zero-shot SVC methods in conversion quality and efficiency. Beyond voice conversion, HQ-SVC achieves superior voice naturalness compared to specialized audio super-resolution methods while natively supporting voice super-resolution tasks.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of HQ-SVC, a novel framework for high-quality zero-shot singing voice conversion that operates efficiently in low-resource scenarios. This work represents a significant advancement in the field of audio processing, particularly in enhancing the quality and accessibility of singing voice conversion technologies.
The methodology presented in HQ-SVC is innovative, utilizing a decoupled codec for feature extraction and an Enhanced Voice Adaptation (EVA) module that integrates additional acoustic features. This approach addresses the shortcomings of existing zero-shot singing voice conversion methods by jointly modeling content and speaker features, which is a significant advancement. The use of differentiable signal processing and diffusion techniques for output refinement is also noteworthy, as it enhances the fidelity of the generated audio. However, the paper could benefit from more detailed descriptions of the implementation and the specific configurations used in the experiments.
The experiments are well-structured, comparing HQ-SVC against state-of-the-art methods in both zero-shot singing voice conversion and voice super-resolution tasks. The evaluation metrics used, including objective measures (STOI, SECS, F0 RMSE, etc.) and subjective ratings (NMOS, SMOS), provide a comprehensive assessment of the model's performance. The results demonstrate that HQ-SVC significantly outperforms existing methods in various metrics, which supports the claims made in the paper. However, the limited dataset size for training could raise concerns about the generalizability of the results.
The paper provides a GitHub link to the code, which is essential for reproducibility. However, the details regarding the training process, such as hyperparameters and data preprocessing steps, could be more explicitly stated to facilitate easier replication of the results by other researchers.
One limitation is the reliance on a relatively small dataset for training, which may affect the model's ability to generalize to a wider range of unseen speakers. Additionally, while the paper discusses the performance of HQ-SVC in low-resource scenarios, it does not provide an extensive analysis of how the model performs under different conditions or with varying levels of data quality.
The implications of this research are significant for fields such as music production, virtual singing applications, and audio processing in general. By enabling high-quality zero-shot singing voice conversion in low-resource settings, HQ-SVC could democratize access to advanced audio synthesis technologies, allowing more creators to leverage these tools without the need for extensive datasets or computational resources. The main contribution of this paper is the introduction of HQ-SVC, a novel framework for high-quality zero-shot singing voice conversion that operates efficiently in low-resource scenarios. This work represents a significant advancement in the field of audio processing, particularly in enhancing the quality and accessibility of singing voice conversion technologies.
We challenge the conventional view of neural network pruning as solely a compression technique, demonstrating that one-shot magnitude pruning serves as a powerful implicit regularizer for ASR. Using Whisper-small, we combine gradient- and Fisher-based sensitivity diagnostics with targeted, component-wise pruning. This reveals architectural asymmetries: decoder FFNs are pruning-fragile, whereas decoder self-attention and the last encoder layers contain redundancy that, when removed, improves generalization. Without fine-tuning, pruning 50% of decoder self-attention reduces WER by 2.38% absolute (20.44% relative) on LibriSpeech test-other; pruning the last four encoder layers at 50% instead yields a 1.72% absolute (14.8% relative) improvement. Gains persisted on Common Voice and TED-LIUM datasets. Beyond regularization benefits, our sensitivity-aware approach enables more aggressive one-shot compression. At 40% sparsity, where established global pruning approaches catastrophically fail, our method preserves near-baseline accuracy. This positions pruning as a first-class architectural design tool: knowing where to prune is as important as how much to prune.
Primary: unknown
All Institutions: unknown
This paper presents a novel perspective on neural network pruning, demonstrating its potential as an implicit regularizer in ASR systems. The combination of sensitivity analysis and targeted pruning not only enhances model performance but also challenges traditional views on pruning as merely a compression technique.
The methodology presented in this paper is robust, combining both first-order and second-order sensitivity analyses to inform pruning decisions. The authors effectively leverage these analyses to identify which components of the Whisper-small model are more sensitive to pruning, thus allowing for targeted and effective pruning that serves as an implicit regularizer. The approach of treating pruning not merely as a compression technique but as a means to improve generalization is innovative and well-justified through empirical evidence. However, the paper could benefit from a more detailed discussion on the theoretical implications of pruning as regularization.
The experiments conducted are thorough, utilizing multiple datasets (LibriSpeech, Common Voice, and TED-LIUM) to validate the effectiveness of the proposed pruning method across different acoustic conditions. The results are compelling, showing significant improvements in WER without the need for fine-tuning, which is a notable contribution to the field of ASR. The authors provide clear metrics and comparisons against standard pruning methods, demonstrating the advantages of their sensitivity-aware approach.
While the paper provides a detailed description of the methodology and experimental setup, it lacks specific implementation details that would aid in reproducibility, such as hyperparameters used during pruning and the exact configurations of the Whisper-small model. Including a supplementary material or a link to a code repository would greatly enhance reproducibility.
One limitation of the study is the focus on a single model architecture (Whisper-small), which may limit the generalizability of the findings to other architectures. Additionally, the paper does not address potential trade-offs between pruning for regularization and the risk of underfitting, especially in more complex models. The empirical results, while promising, may not be universally applicable across all ASR systems.
The findings of this research have significant implications for the design of neural network architectures in ASR and potentially other domains. By positioning pruning as a first-class architectural design tool, the work encourages practitioners to rethink model optimization strategies, potentially leading to more efficient and robust models. This could have broader applications in real-time speech recognition systems, where both accuracy and computational efficiency are critical. This paper presents a novel perspective on neural network pruning, demonstrating its potential as an implicit regularizer in ASR systems. The combination of sensitivity analysis and targeted pruning not only enhances model performance but also challenges traditional views on pruning as merely a compression technique.
Aligning large generative models with human feedback is a critical challenge. In speech synthesis, this is particularly pronounced due to the lack of a large-scale human preference dataset, which hinders the development of models that truly align with human perception. To address this, we introduce SpeechJudge, a comprehensive suite comprising a dataset, a benchmark, and a reward model centered on naturalness--one of the most fundamental subjective metrics for speech synthesis. First, we present SpeechJudge-Data, a large-scale human feedback corpus of 99K speech pairs. The dataset is constructed using a diverse set of advanced zero-shot text-to-speech (TTS) models across diverse speech styles and multiple languages, with human annotations for both intelligibility and naturalness preference. From this, we establish SpeechJudge-Eval, a challenging benchmark for speech naturalness judgment. Our evaluation reveals that existing metrics and AudioLLMs struggle with this task; the leading model, Gemini-2.5-Flash, achieves less than 70% agreement with human judgment, highlighting a significant gap for improvement. To bridge this gap, we develop SpeechJudge-GRM, a generative reward model (GRM) based on Qwen2.5-Omni-7B. It is trained on SpeechJudge-Data via a two-stage post-training process: Supervised Fine-Tuning (SFT) with Chain-of-Thought rationales followed by Reinforcement Learning (RL) with GRPO on challenging cases. On the SpeechJudge-Eval benchmark, the proposed SpeechJudge-GRM demonstrates superior performance, achieving 77.2% accuracy (and 79.4% after inference-time scaling @10) compared to a classic Bradley-Terry reward model (72.7%). Furthermore, SpeechJudge-GRM can be also employed as a reward function during the post-training of speech generation models to facilitate their alignment with human preferences.
Primary: Unknown
All Institutions: Unknown
The main contribution of this paper is the introduction of SpeechJudge, a comprehensive suite that includes a large-scale human feedback dataset, an evaluation benchmark, and a generative reward model aimed at improving the naturalness of speech synthesis. This work addresses a critical gap in the field and provides a solid foundation for future research and development in speech synthesis aligned with human preferences.
The methodology presented in the paper is robust, focusing on the construction of a large-scale human feedback dataset (SpeechJudge-Data) and a corresponding evaluation benchmark (SpeechJudge-Eval). The authors employ advanced zero-shot TTS models to generate diverse speech samples, which are then annotated for intelligibility and naturalness. The development of the SpeechJudge-GRM, a generative reward model, is particularly noteworthy as it utilizes a two-stage training process that combines supervised fine-tuning with reinforcement learning to improve alignment with human preferences. This innovative approach addresses a significant gap in the existing literature regarding the evaluation of speech naturalness.
The experimental evaluation is thorough, with a clear focus on assessing the performance of various models against the SpeechJudge-Eval benchmark. The authors provide detailed results showing the limitations of existing metrics and models in judging speech naturalness, with the SpeechJudge-GRM outperforming traditional methods. The use of a substantial dataset (99K speech pairs) and the rigorous evaluation process lend credibility to the findings, showcasing the effectiveness of the proposed methods.
While the paper details the methodology and experimental setup, it lacks specific implementation details that would facilitate reproducibility. For example, the absence of links to code repositories or datasets limits the ability of other researchers to replicate the study. Providing such resources would significantly enhance the reproducibility of the results.
One limitation of the study is the reliance on human annotators, which can introduce variability and bias in the annotations. Additionally, while the dataset is large, it may not fully capture the diversity of speech styles and contexts, potentially limiting the generalizability of the findings. The paper also does not address the computational resources required for training the generative reward model, which may be a barrier for some researchers.
The work has significant implications for the field of speech synthesis and natural language processing. By improving the alignment of speech synthesis models with human perceptions of naturalness, the research could enhance the quality of synthesized speech in applications such as virtual assistants, audiobooks, and language learning tools. The findings may also inspire further research into human-aligned models across different modalities. The main contribution of this paper is the introduction of SpeechJudge, a comprehensive suite that includes a large-scale human feedback dataset, an evaluation benchmark, and a generative reward model aimed at improving the naturalness of speech synthesis. This work addresses a critical gap in the field and provides a solid foundation for future research and development in speech synthesis aligned with human preferences.
Spiking neural networks (SNNs) offer a promising path toward energy-efficient speech command recognition (SCR) by leveraging their event-driven processing paradigm. However, existing SNN-based SCR methods often struggle to capture rich temporal dependencies and contextual information from speech due to limited temporal modeling and binary spike-based representations. To address these challenges, we first introduce the multi-view spiking temporal-aware self-attention (MSTASA) module, which combines effective spiking temporal-aware attention with a multi-view learning framework to model complementary temporal dependencies in speech commands. Building on MSTASA, we further propose SpikCommander, a fully spike-driven transformer architecture that integrates MSTASA with a spiking contextual refinement channel MLP (SCR-MLP) to jointly enhance temporal context modeling and channel-wise feature integration. We evaluate our method on three benchmark datasets: the Spiking Heidelberg Dataset (SHD), the Spiking Speech Commands (SSC), and the Google Speech Commands V2 (GSC). Extensive experiments demonstrate that SpikCommander consistently outperforms state-of-the-art (SOTA) SNN approaches with fewer parameters under comparable time steps, highlighting its effectiveness and efficiency for robust speech command recognition.
Primary: Unknown
All Institutions: Unknown
The main contribution of this paper is the introduction of SpikCommander, a high-performance spiking transformer architecture that effectively models temporal dependencies in speech command recognition tasks. This work significantly advances the field of neuromorphic computing by addressing key challenges in SNNs and demonstrating superior performance compared to existing methods.
The paper introduces a novel architecture, SpikCommander, which integrates a multi-view spiking temporal-aware self-attention (MSTASA) module and a spiking contextual refinement MLP (SCR-MLP). This approach is innovative in addressing the limitations of existing spiking neural networks (SNNs) for speech command recognition by effectively modeling temporal dependencies through a multi-view learning framework. The architecture's design is well-justified and includes a detailed explanation of the spiking neuron model and the attention mechanisms employed, showcasing a comprehensive understanding of the challenges in SNNs.
The authors conduct extensive experiments on three benchmark datasets (SHD, SSC, and GSC), demonstrating that SpikCommander outperforms state-of-the-art SNN approaches with fewer parameters and comparable time steps. The results are presented clearly, with thorough comparisons against existing methods, and include ablation studies that validate the effectiveness of the proposed components. The experimental design appears robust, although further exploration of additional datasets could enhance generalizability.
The paper provides a GitHub repository link for code availability, which is crucial for reproducibility. However, the implementation details could be more explicit, particularly regarding hyperparameter settings and training procedures, to facilitate easier reproduction of results by other researchers.
The paper does not address potential limitations in the generalizability of the proposed model across different types of speech commands or in real-world applications. Additionally, while the energy efficiency of SNNs is highlighted, practical deployment on neuromorphic hardware remains to be validated.
The research has significant implications for energy-efficient speech command recognition, particularly in resource-constrained environments. The advancements in SNNs could lead to more sustainable AI applications, especially in mobile and embedded systems. The main contribution of this paper is the introduction of SpikCommander, a high-performance spiking transformer architecture that effectively models temporal dependencies in speech command recognition tasks. This work significantly advances the field of neuromorphic computing by addressing key challenges in SNNs and demonstrating superior performance compared to existing methods.
The development of high-performance, on-device keyword spotting (KWS) systems for ultra-low-power hardware is critically constrained by the scarcity of specialized, multi-command training datasets. Traditional data collection through human recording is costly, slow, and lacks scalability. This paper introduces SYNTTS-COMMANDS, a novel, multilingual voice command dataset entirely generated using state-of-the-art Text-to-Speech (TTS) synthesis. By leveraging the CosyVoice 2 model and speaker embeddings from public corpora, we created a scalable collection of English and Chinese commands. Extensive benchmarking across a range of efficient acoustic models demonstrates that our synthetic dataset enables exceptional accuracy, achieving up to 99.5\% on English and 98\% on Chinese command recognition. These results robustly validate that synthetic speech can effectively replace human-recorded audio for training KWS classifiers. Our work directly addresses the data bottleneck in TinyML, providing a practical, scalable foundation for building private, low-latency, and energy-efficient voice interfaces on resource-constrained edge devices.
Primary: Unknown
All Institutions: Unknown
This paper presents SYNTTS-COMMANDS, a novel multilingual voice command dataset generated using TTS technology, demonstrating that synthetic speech can effectively replace human-recorded audio for training KWS classifiers. The work addresses a critical data bottleneck in the field, paving the way for more efficient and inclusive voice-enabled applications on resource-constrained devices.
The methodology is robust, leveraging state-of-the-art TTS technology to generate a synthetic dataset that addresses the critical bottleneck in KWS systems. The use of the CosyVoice 2 model and speaker embeddings from established corpora is a strong choice, ensuring high-quality output. The rigorous quality validation process, including automated transcription and manual verification, enhances the credibility of the dataset. However, the methodology could benefit from a more detailed explanation of the selection criteria for the commands and the specific acoustic models used in the benchmarking.
The experimental evaluation is comprehensive, demonstrating the dataset's effectiveness through extensive benchmarking across various acoustic models. The results, showing high accuracy rates for both English and Chinese commands, validate the dataset's utility. However, the paper could improve by including comparisons with existing datasets and discussing potential biases in the synthetic data.
The paper outlines a clear dataset construction process and evaluation methodology, which supports reproducibility. However, the lack of a publicly available dataset or code repository limits the ability for other researchers to replicate the study fully. Providing access to the dataset and the models used would significantly enhance reproducibility.
The primary limitation is the dataset's focus on in-class command recognition, which does not account for out-of-vocabulary inputs or environmental noise. This could affect real-world applicability. Additionally, while the dataset is multilingual, it currently only includes English and Chinese, which may limit its use in diverse linguistic contexts.
The introduction of a scalable, synthetic voice command dataset has significant implications for the development of on-device AI and TinyML applications. By reducing reliance on human-recorded data, it lowers costs and accelerates the development of multilingual voice interfaces, promoting inclusivity in voice technology. The potential for future expansion into other languages and command types could further enhance its impact. This paper presents SYNTTS-COMMANDS, a novel multilingual voice command dataset generated using TTS technology, demonstrating that synthetic speech can effectively replace human-recorded audio for training KWS classifiers. The work addresses a critical data bottleneck in the field, paving the way for more efficient and inclusive voice-enabled applications on resource-constrained devices.
Speech-to-Speech (S2S) models have shown promising dialogue capabilities, but their ability to handle paralinguistic cues--such as emotion, tone, and speaker attributes--and to respond appropriately in both content and style remains underexplored. Progress is further hindered by the scarcity of high-quality and expressive demonstrations. To address this, we introduce a novel reinforcement learning (RL) framework for paralinguistic-aware S2S, ParaS2S, which evaluates and optimizes both content and speaking style directly at the waveform level. We first construct ParaS2SBench, a benchmark comprehensively evaluates S2S models' output for content and style appropriateness from diverse and challenging input queries. It scores the fitness of input-output pairs and aligns well with human judgments, serving as an automatic judge for model outputs. With this scalable scoring feedback, we enable the model to explore and learn from diverse unlabeled speech via Group Relative Policy Optimization (GRPO). Experiments show that existing S2S models fail to respond appropriately to paralinguistic attributes, performing no better than pipeline-based baselines. Our RL approach achieves a 11% relative improvement in response content and style's appropriateness on ParaS2SBench over supervised fine-tuning (SFT), surpassing all prior models while requiring substantially fewer warm-up annotations than pure SFT.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of ParaS2S, a novel RL framework for enhancing spoken language models' ability to respond to paralinguistic cues, backed by a comprehensive benchmark for evaluation. This work significantly advances the field of speech-to-speech interaction by addressing a critical gap in existing methodologies and demonstrating substantial improvements in model performance.
The proposed methodology introduces a novel reinforcement learning framework, ParaS2S, which is a significant advancement in the domain of spoken language models. The construction of ParaS2SBench as a benchmark for evaluating content and style appropriateness is particularly noteworthy. By focusing on waveform-level optimization, the authors address a critical gap in existing S2S models that often overlook paralinguistic cues. The use of Group Relative Policy Optimization (GRPO) to enhance model learning from diverse unlabeled speech is innovative and demonstrates a solid understanding of reinforcement learning principles.
The experiments conducted are robust, showcasing a clear comparison between the proposed RL approach and traditional supervised fine-tuning methods. The reported 11% relative improvement in response appropriateness is significant and indicates that the proposed method not only outperforms existing models but also does so with fewer warm-up annotations. However, the paper could benefit from a more detailed description of the datasets used and the specific metrics employed for evaluation.
The paper lacks sufficient implementation details that would allow for full reproducibility of the results. While the methodology is described, the absence of specific hyperparameters, training procedures, and dataset access information limits the ability of other researchers to replicate the findings. Providing a code repository or supplementary materials would greatly enhance reproducibility.
One of the primary limitations of the study is the reliance on the constructed benchmark, ParaS2SBench, which may not encompass all possible paralinguistic attributes. Additionally, the performance improvements, while statistically significant, may not translate to all real-world applications, particularly in diverse linguistic and cultural contexts. The paper does not address potential biases in the training data or the benchmark itself.
The implications of this research are substantial, particularly in applications such as conversational agents, virtual assistants, and any technology that relies on nuanced human interaction. By improving the ability of S2S models to understand and replicate paralinguistic cues, the work could lead to more natural and effective human-computer interactions. However, ethical considerations regarding the use of such technology, especially in sensitive contexts, must be addressed. The main contribution of this paper is the introduction of ParaS2S, a novel RL framework for enhancing spoken language models' ability to respond to paralinguistic cues, backed by a comprehensive benchmark for evaluation. This work significantly advances the field of speech-to-speech interaction by addressing a critical gap in existing methodologies and demonstrating substantial improvements in model performance.
This paper revisits the neural vocoder task through the lens of audio restoration and propose a novel diffusion vocoder called BridgeVoC. Specifically, by rank analysis, we compare the rank characteristics of Mel-spectrum with other common acoustic degradation factors, and cast the vocoder task as a specialized case of audio restoration, where the range-space spectral (RSS) surrogate of the target spectrum acts as the degraded input. Based on that, we introduce the Schrodinger bridge framework for diffusion modeling, which defines the RSS and target spectrum as dual endpoints of the stochastic generation trajectory. Further, to fully utilize the hierarchical prior of subbands in the time-frequency (T-F) domain, we elaborately devise a novel subband-aware convolutional diffusion network as the data predictor, where subbands are divided following an uneven strategy, and convolutional-style attention module is employed with large kernels for efficient T-F contextual modeling. To enable single-step inference, we propose an omnidirectional distillation loss to facilitate effective information transfer from the teacher model to the student model, and the performance is improved by combining target-related and bijective consistency losses. Comprehensive experiments are conducted on various benchmarks and out-of-distribution datasets. Quantitative and qualitative results show that while enjoying fewer parameters, lower computational cost, and competitive inference speed, the proposed BridgeVoC yields stateof-the-art performance over existing advanced GAN-, DDPMand flow-matching-based baselines with only 4 sampling steps. And consistent superiority is still achieved with single-step inference.
Primary: Chinese Academy of Sciences
All Institutions: Chinese Academy of Sciences, University of Chinese Academy of Sciences, Tsinghua University, Tencent AI Lab
This paper presents a significant contribution to the field of neural vocoders by framing the vocoder task as an audio restoration problem and introducing innovative methodologies that enhance performance and efficiency. The comprehensive experiments and results validate the effectiveness of the proposed approach, making it a valuable addition to the literature on audio processing.
The paper introduces a novel approach to neural vocoding by framing it as an audio restoration problem, which is a significant shift from traditional methods. The use of the Schrödinger bridge framework for diffusion modeling is innovative and provides a fresh perspective on the vocoder task. The proposed subband-aware convolutional diffusion network (BridgeVoC) effectively leverages hierarchical prior knowledge in the time-frequency domain, which enhances the model's ability to reconstruct audio waveforms. The introduction of an omnidirectional distillation loss for single-step inference is also a noteworthy contribution, as it addresses the common challenge of information transfer in model distillation.
The experiments conducted are comprehensive, utilizing various benchmarks and out-of-distribution datasets, which strengthens the validity of the results. The quantitative and qualitative analyses demonstrate that BridgeVoC achieves state-of-the-art performance with fewer parameters and lower computational costs compared to existing models. The paper provides detailed metrics, including MCD, PESQ, and VISQOL, which are essential for evaluating audio quality.
The paper includes a demo link and mentions the availability of code, which is crucial for reproducibility. However, the details regarding the implementation, such as hyperparameters and training configurations, could be more explicitly stated to facilitate easier replication of the results by other researchers.
One limitation is the reliance on specific datasets for training and evaluation, which may affect the generalizability of the model. Additionally, while the single-step inference method shows promise, it may not achieve the same quality as multi-step methods in all scenarios, particularly in complex audio environments.
The proposed method has significant implications for real-time audio processing applications, such as speech synthesis and enhancement, where computational efficiency and audio quality are critical. By addressing the performance-inference dilemma, this work could lead to advancements in various fields, including telecommunications, entertainment, and assistive technologies. This paper presents a significant contribution to the field of neural vocoders by framing the vocoder task as an audio restoration problem and introducing innovative methodologies that enhance performance and efficiency. The comprehensive experiments and results validate the effectiveness of the proposed approach, making it a valuable addition to the literature on audio processing.
Recent advancements in speech synthesis technology have enriched our daily lives, with high-quality and human-like audio widely adopted across real-world applications. However, malicious exploitation like voice-cloning fraud poses severe security risks. Existing defense techniques struggle to address the production large language model (LLM)-based speech synthesis. While previous studies have considered the protection for fine-tuning synthesizers, they assume manually annotated transcripts. Given the labor intensity of manual annotation, end-to-end (E2E) systems leveraging automatic speech recognition (ASR) to generate transcripts are becoming increasingly prevalent, e.g., voice cloning via commercial APIs. Therefore, this E2E speech synthesis also requires new security mechanisms. To tackle these challenges, we propose E2E-VGuard, a proactive defense framework for two emerging threats: (1) production LLM-based speech synthesis, and (2) the novel attack arising from ASR-driven E2E scenarios. Specifically, we employ the encoder ensemble with a feature extractor to protect timbre, while ASR-targeted adversarial examples disrupt pronunciation. Moreover, we incorporate the psychoacoustic model to ensure perturbative imperceptibility. For a comprehensive evaluation, we test 16 open-source synthesizers and 3 commercial APIs across Chinese and English datasets, confirming E2E-VGuard's effectiveness in timbre and pronunciation protection. Real-world deployment validation is also conducted. Our code and demo page are available at https://wxzyd123.github.io/e2e-vguard/.
Primary: The University of Hong Kong
All Institutions: The University of Hong Kong, Beijing University of Posts and Telecommunications, CSIRO’s Data61, National University of Singapore, Responsible AI Research (RAIR) Centre, Shenzhen International Graduate School, The University of Adelaide, Tsinghua University
The main contribution of this paper is the introduction of E2E-VGuard, a novel framework that proactively defends against adversarial attacks in production LLM-based speech synthesis, addressing both timbre and pronunciation vulnerabilities. This work is a substantial step forward in securing speech synthesis technologies, with implications for both academic research and practical applications in the field.
The methodology presented in E2E-VGuard is innovative, utilizing an encoder ensemble with a feature extractor to enhance the protection of timbre in speech synthesis. The incorporation of a psychoacoustic model to ensure perturbative imperceptibility is a significant advancement, addressing the challenge of maintaining audio quality while implementing security measures. The approach to countering ASR-targeted adversarial examples is particularly noteworthy, as it reflects a deep understanding of the vulnerabilities in current systems.
The experimental evaluation is robust, involving 16 open-source synthesizers and 3 commercial APIs across diverse datasets in both Chinese and English. This breadth of testing enhances the credibility of the findings. The results demonstrate E2E-VGuard's effectiveness in protecting against both timbre and pronunciation attacks, which is critical for real-world applications. However, the paper could benefit from a more detailed discussion of the metrics used to evaluate effectiveness.
The authors provide a demo page and mention that the code is available, which supports reproducibility. However, the paper could improve by including more detailed implementation instructions and specific configurations used during experiments to facilitate independent verification of results.
One limitation is the potential dependency on the quality of the ASR systems used, as variations in ASR performance could impact the effectiveness of the proposed defenses. Additionally, the paper does not extensively discuss the scalability of the proposed solution in real-world scenarios or its performance against more sophisticated adversarial attacks.
The implications of this research are significant, as it addresses critical security concerns in speech synthesis technology, which is increasingly used in various applications, including virtual assistants and automated customer service. By providing a proactive defense framework, E2E-VGuard has the potential to enhance trust in voice-based systems and mitigate risks associated with voice-cloning fraud. The main contribution of this paper is the introduction of E2E-VGuard, a novel framework that proactively defends against adversarial attacks in production LLM-based speech synthesis, addressing both timbre and pronunciation vulnerabilities. This work is a substantial step forward in securing speech synthesis technologies, with implications for both academic research and practical applications in the field.
Large language models (LLMs) have recently achieved impressive results in speech recognition across multiple modalities, including Auditory Speech Recognition (ASR), Visual Speech Recognition (VSR), and Audio-Visual Speech Recognition (AVSR). Despite this progress, current LLM-based approaches typically address each task independently, training separate models that raise computational and deployment resource use while missing potential cross-task synergies. They also rely on fixed-rate token compression, which restricts flexibility in balancing accuracy with efficiency. These limitations highlight the need for a unified framework that can support ASR, VSR, and AVSR while enabling elastic inference. To this end, we present Omni-AVSR, a unified audio-visual LLM that combines efficient multi-granularity training with parameter-efficient adaptation. Specifically, we adapt the matryoshka representation learning paradigm to efficiently train across multiple audio and visual granularities, reducing its inherent training resource use. Furthermore, we explore three LoRA-based strategies for adapting the backbone LLM, balancing shared and task-specific specialization. Experiments on LRS2 and LRS3 show that Omni-AVSR achieves comparable or superior accuracy to state-of-the-art baselines while training a single model at substantially lower training and deployment resource use. The model also remains robust under acoustic noise, and we analyze its scaling behavior as LLM size increases, providing insights into the trade-off between performance and efficiency.
Primary: Imperial College London
All Institutions: Imperial College London
The main contribution of this paper is the development of Omni-AVSR, a unified audio-visual LLM that effectively integrates multiple speech recognition modalities while optimizing for resource efficiency. This work significantly advances the field by addressing the limitations of existing models and proposing a novel methodology that balances performance and efficiency in multimodal speech recognition tasks.
The paper introduces a unified framework, Omni-AVSR, which leverages a matryoshka representation learning paradigm to facilitate multi-granularity training across ASR, VSR, and AVSR tasks. This approach is innovative as it aims to reduce resource consumption while maintaining performance, addressing a significant gap in the current literature where these tasks are treated independently. The use of LoRA-based strategies for parameter-efficient adaptation is also a notable contribution, allowing for shared and task-specific model specialization.
The experiments conducted on the LRS2 and LRS3 datasets are comprehensive, demonstrating that Omni-AVSR achieves comparable or superior accuracy to existing state-of-the-art models while significantly lowering resource use. The robustness of the model under acoustic noise and the analysis of scaling behavior as LLM size increases provide valuable insights into performance-efficiency trade-offs, enhancing the credibility of the results.
The paper provides a GitHub repository link for code access, which is essential for reproducibility. However, further details on the experimental setup, hyperparameters, and specific configurations used in the training process would enhance reproducibility.
One limitation is the reliance on fixed-rate token compression, which may still impose constraints on flexibility despite the proposed improvements. Additionally, the paper does not extensively discuss the potential challenges in deploying the unified model in real-world scenarios, particularly in diverse acoustic environments.
The unified approach to multimodal speech recognition has the potential to streamline applications in various fields, including human-computer interaction, accessibility technologies, and automated transcription services. By reducing resource requirements, it could lead to more efficient deployment of speech recognition systems in resource-constrained environments. The main contribution of this paper is the development of Omni-AVSR, a unified audio-visual LLM that effectively integrates multiple speech recognition modalities while optimizing for resource efficiency. This work significantly advances the field by addressing the limitations of existing models and proposing a novel methodology that balances performance and efficiency in multimodal speech recognition tasks.
Classroom environments are particularly challenging for children with hearing impairments, where background noise, multiple talkers, and reverberation degrade speech perception. These difficulties are greater for children than adults, yet most deep learning speech separation models for assistive devices are developed using adult voices in simplified, low-reverberation conditions. This overlooks both the higher spectral similarity of children's voices, which weakens separation cues, and the acoustic complexity of real classrooms. We address this gap using MIMO-TasNet, a compact, low-latency, multi-channel architecture suited for real-time deployment in bilateral hearing aids or cochlear implants. We simulated naturalistic classroom scenes with moving child-child and child-adult talker pairs under varying noise and distance conditions. Training strategies tested how well the model adapts to children's speech through spatial cues. Models trained on adult speech, classroom data, and finetuned variants were compared to assess data-efficient adaptation. Results show that adult-trained models perform well in clean scenes, but classroom-specific training greatly improves separation quality. Finetuning with only half the classroom data achieved comparable gains, confirming efficient transfer learning. Training with diffuse babble noise further enhanced robustness, and the model preserved spatial awareness while generalizing to unseen distances. These findings demonstrate that spatially aware architectures combined with targeted adaptation can improve speech accessibility for children in noisy classrooms, supporting future on-device assistive technologies.
Primary: Radboud University, Donders Institute for Brain, Cognition, and Behaviour
All Institutions: Radboud University, Mortimer B. Zuckerman Mind, Brain, Behavior Institute, Columbia University, Department of Otorhinolaryngology, Leiden University Medical Centre, Leiden Institute for Brain and Cognition, Department of Bioelectronics, Delft University of Technology
This paper presents a significant advancement in the field of speech separation for hearing-impaired children, demonstrating the effectiveness of targeted adaptation strategies in complex auditory environments. The methodology is robust, and the results have the potential to inform the design of more effective hearing assistive devices.
The paper employs a well-structured methodology using the MIMO-TasNet architecture, which is adept for real-time speech separation in complex environments. The authors simulate realistic classroom conditions, incorporating dynamic talker movements and background noise, which is crucial for evaluating the model's performance in ecologically valid scenarios. The use of binaural room impulse responses (BRIRs) and head-related impulse responses (HRIRs) to capture spatial cues is a significant strength, allowing for a nuanced approach to speech separation that accounts for the unique acoustic properties of children's voices.
The experiments are rigorous, utilizing a comprehensive dataset that includes both child and adult speech under various conditions. The results demonstrate clear improvements in speech separation quality when models are trained on classroom-specific data compared to adult-only data. The fine-tuning approach is particularly noteworthy, as it shows that substantial performance gains can be achieved with limited additional data, which is critical in low-resource domains. Statistical significance is well-reported, enhancing the credibility of the findings.
The paper provides sufficient detail regarding the model architecture and training procedures, including the use of specific datasets and evaluation metrics. The availability of the code on GitHub further supports reproducibility, allowing other researchers to replicate the study and build upon its findings.
One limitation is the reliance on simulated data, which may not fully capture the complexities of real-world classroom environments. Additionally, while the study addresses the challenges of speech separation for children, it does not explore the potential impact of different types of background noise beyond babble, which could further influence model performance.
The implications of this research are significant for the development of assistive technologies for hearing-impaired children. By improving speech accessibility in noisy classroom settings, the findings could enhance educational outcomes for children with hearing impairments, promoting inclusivity and better learning experiences. This paper presents a significant advancement in the field of speech separation for hearing-impaired children, demonstrating the effectiveness of targeted adaptation strategies in complex auditory environments. The methodology is robust, and the results have the potential to inform the design of more effective hearing assistive devices.
Spatial perception is central to auditory intelligence, enabling accurate understanding of real-world acoustic scenes and advancing human-level perception of the world around us. While recent large audio-language models (LALMs) show strong reasoning over complex audios, most operate on monaural inputs and lack the ability to capture spatial cues such as direction, elevation, and distance. We introduce SPUR, a lightweight, plug-in approach that equips LALMs with spatial perception through minimal architectural changes. SPUR consists of: (i) a First-Order Ambisonics (FOA) encoder that maps (W, X, Y, Z) channels to rotation-aware, listener-centric spatial features, integrated into target LALMs via a multimodal adapter; and (ii) SPUR-Set, a spatial QA dataset combining open-source FOA recordings with controlled simulations, emphasizing relative direction, elevation, distance, and overlap for supervised spatial reasoning. Fine-tuning our model on the SPUR-Set consistently improves spatial QA and multi-speaker attribution while preserving general audio understanding. SPUR provides a simple recipe that transforms monaural LALMs into spatially aware models. Extensive ablations validate the effectiveness of our approach.
Primary: University of Maryland
All Institutions: University of Maryland, Dolby Laboratories
The main contribution of this paper is the introduction of SPUR, a framework that effectively integrates spatial audio understanding into large audio-language models, enhancing their reasoning capabilities while preserving their general audio understanding. This work represents a significant step forward in bridging the gap between auditory scene understanding and high-level reasoning in machine learning.
The proposed SPUR framework introduces a novel approach to integrating spatial audio understanding into large audio-language models (LALMs) through a lightweight, plug-and-play spatial adapter. The methodology is well-structured, utilizing a First-Order Ambisonics (FOA) encoder to extract spatial features and a multimodal adapter to condition existing LALMs. The detailed description of the four-stage pipeline for feature extraction and the innovative use of spatial covariance modeling are commendable. The approach's emphasis on maintaining the core model's integrity while enhancing spatial reasoning is a significant methodological strength.
The experiments conducted are thorough, with extensive ablation studies that validate the effectiveness of the SPUR framework. The introduction of the SPUR-Set dataset is a notable contribution, providing a robust benchmark for evaluating spatial reasoning in audio-language models. The results demonstrate significant improvements in spatial QA and multi-speaker attribution tasks, showcasing the practical impact of the proposed method. The comparison against existing models further strengthens the evaluation, although it would benefit from more diverse baseline comparisons.
The paper provides a clear outline of the experimental setup, including hyperparameters and training protocols, which enhances reproducibility. However, the absence of a publicly available code repository limits the ability for external validation of results. Future work should consider releasing the implementation to foster community engagement and further research.
The paper acknowledges several limitations, including the reliance on FOA, which may constrain spatial resolution and directionality. The dataset's focus on controlled environments may also limit generalization to real-world scenarios. Additionally, the static nature of the embeddings does not account for dynamic listener movement, which could be a significant factor in spatial audio applications.
The SPUR framework has the potential to advance spatial audio understanding significantly, with applications in immersive media, AR/VR, and assistive technologies for individuals with visual impairments. By enabling more nuanced spatial reasoning in audio-language models, it could enhance user interaction in various contexts, from creative workflows to navigation in complex auditory environments. The main contribution of this paper is the introduction of SPUR, a framework that effectively integrates spatial audio understanding into large audio-language models, enhancing their reasoning capabilities while preserving their general audio understanding. This work represents a significant step forward in bridging the gap between auditory scene understanding and high-level reasoning in machine learning.
Beamforming with desired directivity patterns using compact microphone arrays is essential in many audio applications. Directivity patterns achievable using traditional beamformers depend on the number of microphones and the array aperture. Generally, their effectiveness degrades for compact arrays. To overcome these limitations, we propose a neural directional filtering (NDF) approach that leverages deep neural networks to enable sound capture with a predefined directivity pattern. The NDF computes a single-channel complex mask from the microphone array signals, which is then applied to a reference microphone to produce an output that approximates a virtual directional microphone with the desired directivity pattern. We introduce training strategies and propose data-dependent metrics to evaluate the directivity pattern and directivity factor. We show that the proposed method: i) achieves a frequency-invariant directivity pattern even above the spatial aliasing frequency, ii) can approximate diverse and higher-order patterns, iii) can steer the pattern in different directions, and iv) generalizes to unseen conditions. Lastly, experimental comparisons demonstrate superior performance over conventional beamforming and parametric approaches.
Primary: IEEE Publication Technology Group
All Institutions: IEEE Publication Technology Group
The main contribution of this paper is the introduction of a novel neural directional filtering approach that effectively captures sound with controllable directivity patterns using a compact microphone array. This work significantly advances the field of audio processing by providing a robust solution to challenges faced in traditional beamforming methods, particularly in reverberant environments.
The proposed Neural Directional Filtering (NDF) method utilizes deep neural networks to compute a single-channel complex mask from microphone array signals, enabling the capture of sound with predefined directivity patterns. The methodology is innovative as it integrates a steerable mechanism and employs a batch-aggregated normalized L1 loss function for training, which enhances performance for higher-order patterns. The introduction of data-dependent metrics to evaluate directivity patterns and factors adds significant value to the methodology, allowing for a more nuanced understanding of the model's performance.
The experiments are comprehensive and well-structured, evaluating the NDF model in both anechoic and reverberant environments. The performance metrics used, including Signal-to-Distortion Ratio (SDR), Directivity Factor (DF), and power patterns, provide a robust framework for assessing the effectiveness of the proposed method. The results demonstrate that NDF consistently outperforms conventional beamforming techniques, particularly in challenging conditions, showcasing its potential for real-world applications.
The paper provides detailed descriptions of the experimental setup, including the array geometry, datasets, and training strategies. However, the lack of publicly available code or datasets limits reproducibility. The methodology is described in sufficient detail to allow for replication, but access to the actual implementation would enhance reproducibility significantly.
One notable limitation is the model's reliance on training data that assumes stationary sources, which may not generalize well to dynamic environments with moving sources. Additionally, the steerability of the model is constrained to directions encountered during training, which could limit its practical applicability in diverse scenarios. The paper also does not address the computational complexity of the model, which may be a concern for real-time applications.
The NDF approach has significant implications for various audio applications, including hearing aids, virtual reality, and immersive audio experiences. By enabling precise control over sound capture in complex environments, this research could enhance user experiences in consumer electronics and assistive technologies. The ability to generalize to unseen conditions also suggests potential for broader applications in real-world scenarios. The main contribution of this paper is the introduction of a novel neural directional filtering approach that effectively captures sound with controllable directivity patterns using a compact microphone array. This work significantly advances the field of audio processing by providing a robust solution to challenges faced in traditional beamforming methods, particularly in reverberant environments.
Beamforming with desired directivity patterns using compact microphone arrays is essential in many audio applications. Directivity patterns achievable using traditional beamformers depend on the number of microphones and the array aperture. Generally, their effectiveness degrades for compact arrays. To overcome these limitations, we propose a neural directional filtering (NDF) approach that leverages deep neural networks to enable sound capture with a predefined directivity pattern. The NDF computes a single-channel complex mask from the microphone array signals, which is then applied to a reference microphone to produce an output that approximates a virtual directional microphone with the desired directivity pattern. We introduce training strategies and propose data-dependent metrics to evaluate the directivity pattern and directivity factor. We show that the proposed method: i) achieves a frequency-invariant directivity pattern even above the spatial aliasing frequency, ii) can approximate diverse and higher-order patterns, iii) can steer the pattern in different directions, and iv) generalizes to unseen conditions. Lastly, experimental comparisons demonstrate superior performance over conventional beamforming and parametric approaches.
Primary: IEEE Publication Technology Group
All Institutions: IEEE Publication Technology Group
This paper presents a novel neural directional filtering approach that enhances sound capture capabilities using compact microphone arrays. The integration of deep learning with traditional beamforming techniques represents a significant advancement in audio processing, with promising implications for various applications in the field.
The methodology presented in this paper is innovative as it combines deep learning techniques with traditional beamforming approaches to create a neural directional filtering (NDF) system. The authors introduce a novel architecture that incorporates LSTM networks to process microphone array signals, allowing for the learning of complex directivity patterns. The proposed training strategies and evaluation metrics are well-defined, enabling a comprehensive analysis of the model's performance in both anechoic and reverberant environments. The steerability feature of the NDF is particularly noteworthy, as it allows for dynamic adjustments to the directivity pattern, which is a significant advancement over traditional fixed beamforming methods.
The experiments conducted are thorough and cover a wide range of scenarios, including both anechoic and reverberant conditions. The authors provide detailed descriptions of the datasets, training procedures, and evaluation metrics used to assess the performance of the NDF. The results demonstrate that the NDF outperforms conventional beamforming techniques, particularly in terms of achieving higher-order directivity patterns and maintaining frequency invariance. However, the paper could benefit from more extensive comparisons with additional state-of-the-art methods to further validate the claims.
The paper provides a solid foundation for reproducibility, detailing the training strategies, datasets, and model architectures used. However, the absence of publicly available code or datasets limits the ability for independent verification of results. Including a link to a repository or supplementary materials would enhance reproducibility significantly.
One limitation is the reliance on simulated environments for training and evaluation, which may not fully capture the complexities of real-world scenarios. Additionally, while the model shows promise in handling reverberation, its performance in highly dynamic environments with rapidly changing sound sources is not thoroughly explored. The steerability feature is limited to angles encountered during training, which may restrict its applicability in some contexts.
The proposed NDF has potential applications in various fields, including hearing aids, virtual reality, and smart audio devices, where directional sound capture is crucial. By improving the ability to filter and capture sound from specific directions, this work could enhance user experience in audio technologies and contribute to advancements in spatial audio processing. This paper presents a novel neural directional filtering approach that enhances sound capture capabilities using compact microphone arrays. The integration of deep learning with traditional beamforming techniques represents a significant advancement in audio processing, with promising implications for various applications in the field.
Self-talk-an internal dialogue that can occur silently or be spoken aloud-plays a crucial role in emotional regulation, cognitive processing, and motivation, yet has remained largely invisible and unmeasurable in everyday life. In this paper, we present MutterMeter, a mobile system that automatically detects vocalized self-talk from audio captured by earable microphones in real-world settings. Detecting self-talk is technically challenging due to its diverse acoustic forms, semantic and grammatical incompleteness, and irregular occurrence patterns, which differ fundamentally from assumptions underlying conventional speech understanding models. To address these challenges, MutterMeter employs a hierarchical classification architecture that progressively integrates acoustic, linguistic, and contextual information through a sequential processing pipeline, adaptively balancing accuracy and computational efficiency. We build and evaluate MutterMeter using a first-of-its-kind dataset comprising 31.1 hours of audio collected from 25 participants. Experimental results demonstrate that MutterMeter achieves robust performance with a macro-averaged F1 score of 0.84, outperforming conventional approaches, including LLM-based and speech emotion recognition models.
Primary: Korea University of Technology and Education
All Institutions: Korea University of Technology and Education
The main contribution of this paper is the introduction of MutterMeter, a novel system for automatic self-talk detection using earable technology. This work represents a meaningful advancement in the intersection of machine learning, audio processing, and human-centered computing, addressing a previously underexplored area with practical implications for emotional and cognitive health.
The methodology presented in this paper is innovative, employing a hierarchical classification architecture that integrates acoustic, linguistic, and contextual information. This approach is particularly well-suited for the complexities of self-talk detection, which is inherently different from traditional speech recognition tasks. The authors provide a clear explanation of their sequential processing pipeline, which balances accuracy and computational efficiency. However, the paper could benefit from more detailed descriptions of the specific algorithms used within each classification layer and how they adaptively adjust to the varying contexts of self-talk.
The experimental evaluation is robust, utilizing a unique dataset of 31.1 hours of audio from 25 participants, which is a significant contribution to the field. The reported macro-averaged F1 score of 0.84 indicates strong performance compared to existing models. However, the paper lacks a thorough comparison with a wider range of baseline models beyond LLM-based and speech emotion recognition models, which could provide a more comprehensive view of its effectiveness.
The paper does not provide sufficient details regarding the implementation of the MutterMeter system, such as the specific configurations of the models used, the preprocessing steps for the audio data, or the evaluation metrics beyond the F1 score. This lack of detail may hinder reproducibility for other researchers looking to replicate or build upon this work.
One limitation noted is the relatively small sample size of 25 participants, which may not fully capture the diversity of self-talk across different demographics or contexts. Additionally, the system's performance in noisy environments or with varied accents and speech patterns is not addressed, which could impact its real-world applicability.
The potential applications of this research are significant, particularly in mental health and personal development domains. Automatic self-talk detection could provide insights into emotional regulation and cognitive processes, offering new tools for therapists and coaches. Furthermore, the technology could be integrated into wearable devices, enhancing user awareness of their internal dialogues and promoting mindfulness. The main contribution of this paper is the introduction of MutterMeter, a novel system for automatic self-talk detection using earable technology. This work represents a meaningful advancement in the intersection of machine learning, audio processing, and human-centered computing, addressing a previously underexplored area with practical implications for emotional and cognitive health.
With the rise of voice-enabled technologies, loudspeaker playback has become widespread, posing increasing risks to speech privacy. Traditional eavesdropping methods often require invasive access or line-of-sight, limiting their practicality. In this paper, we present mmSpeech, an end-to-end mmWave-based eavesdropping system that reconstructs intelligible speech solely from vibration signals induced by loudspeaker playback, even through walls and without prior knowledge of the speaker. To achieve this, we reveal an optimal combination of vibrating material and radar sampling rate for capturing high- quality vibrations using narrowband mmWave signals. We then design a deep neural network that reconstructs intelligible speech from the estimated noisy spectrograms. To further support downstream speech understanding, we introduce a synthetic training pipeline and selectively fine-tune the encoder of a pre-trained ASR model. We implement mmSpeech with a commercial mmWave radar and validate its performance through extensive experiments. Results show that mmSpeech achieves state-of-the-art speech quality and generalizes well across unseen speakers and various conditions.
Primary: Xi'an Jiaotong University
All Institutions: Xi'an Jiaotong University
The main contribution of this paper is the development of mmSpeech, an innovative eavesdropping system that utilizes mmWave radar to reconstruct speech from vibrations, showcasing significant advancements in the field of audio processing and security. The methodology and experimental validation present a strong case for the system's effectiveness, although the paper would benefit from improved reproducibility and ethical considerations.
The paper presents a novel approach to eavesdropping using mmWave radar technology, which is a significant advancement over traditional methods that require line-of-sight or invasive techniques. The authors detail an optimal combination of materials and radar sampling rates for capturing vibrations, which is a critical aspect of their methodology. The deep neural network designed for reconstructing speech from noisy spectrograms is well-justified, and the introduction of a synthetic training pipeline to enhance the model's performance is a commendable innovation. However, the paper could benefit from a more detailed explanation of the model architecture and training process.
The experiments conducted are extensive and validate the system's performance across various conditions and unseen speakers. The results indicating state-of-the-art speech quality are promising, but the paper lacks a comparative analysis with existing eavesdropping systems, which would strengthen the claims of superiority. The datasets used for training and evaluation are not clearly specified, which raises questions about the reproducibility of the results.
The implementation details are somewhat vague, and there is no mention of code availability or supplementary materials that would allow for replication of the experiments. This is a significant limitation, as reproducibility is a cornerstone of scientific research.
One major limitation is the potential ethical implications of the technology, as it could be misused for malicious purposes. Additionally, the reliance on specific materials and conditions may limit the generalizability of the results. The paper could also explore the limitations of the radar technology in different environments more thoroughly.
The implications of this research are profound, particularly in the context of privacy and security in voice-enabled technologies. The ability to eavesdrop through walls raises significant ethical concerns and could lead to discussions about regulations and safeguards in the deployment of such technologies. The main contribution of this paper is the development of mmSpeech, an innovative eavesdropping system that utilizes mmWave radar to reconstruct speech from vibrations, showcasing significant advancements in the field of audio processing and security. The methodology and experimental validation present a strong case for the system's effectiveness, although the paper would benefit from improved reproducibility and ethical considerations.