Recent joint audio-visual diffusion models achieve remarkable generation quality but suffer from high latency due to their bidirectional attention dependencies, hindering real-time applications. We propose OmniForcing, the first framework to distill an offline, dual-stream bidirectional diffusion model into a high-fidelity streaming autoregressive generator. However, naively applying causal distillation to such dual-stream architectures triggers severe training instability, due to the extreme temporal asymmetry between modalities and the resulting token sparsity. We address the inherent information density gap by introducing an Asymmetric Block-Causal Alignment with a zero-truncation Global Prefix that prevents multi-modal synchronization drift. The gradient explosion caused by extreme audio token sparsity during the causal shift is further resolved through an Audio Sink Token mechanism equipped with an Identity RoPE constraint. Finally, a Joint Self-Forcing Distillation paradigm enables the model to dynamically self-correct cumulative cross-modal errors from exposure bias during long rollouts. Empowered by a modality-independent rolling KV-cache inference scheme, OmniForcing achieves state-of-the-art streaming generation at $\sim$25 FPS on a single GPU, maintaining multi-modal synchronization and visual quality on par with the bidirectional teacher.\textbf{Project Page:} \href{https://omniforcing.com}{https://omniforcing.com}
Primary: Peking University
All Institutions: JD Explore Academy, Fudan University, Peking University, The University of Hong Kong
The paper presents OmniForcing, a novel framework for real-time joint audio-visual generation that effectively addresses the challenges of latency and training instability in existing models. Its innovative methodologies and comprehensive experimental evaluations position it as a significant contribution to the field of machine learning and multimedia generation.
The proposed OmniForcing framework is a significant advancement in real-time joint audio-visual generation, addressing the latency issues of existing models through innovative techniques such as Asymmetric Block-Causal Alignment and Audio Sink Tokens with Identity RoPE. The methodology is well-structured, with a clear focus on overcoming the challenges of temporal asymmetry and training instability in dual-stream architectures. The introduction of a Joint Self-Forcing Distillation paradigm is particularly noteworthy, as it allows the model to dynamically correct cross-modal errors, enhancing the robustness of the generation process.
The experiments are comprehensive, comparing OmniForcing against both bidirectional models and cascaded autoregressive baselines. The evaluation metrics are well-defined, focusing on visual quality, audio fidelity, and real-time inference efficiency. The results demonstrate that OmniForcing achieves state-of-the-art performance, significantly reducing latency while maintaining high-quality outputs, which is crucial for real-time applications.
The paper provides detailed implementation details, including training setups and hyperparameters, which enhances reproducibility. However, the lack of a publicly available code repository or demo may hinder independent verification of results.
One limitation is the inherent trade-off between streaming capability and the full-sequence attention of the original bidirectional model, which may lead to slight reductions in consistency and synchrony compared to the teacher model. Additionally, the reliance on a specific architecture (LTX-2) may limit the generalizability of the findings to other models.
The work has significant implications for real-time applications in multimedia content creation, gaming, and interactive media, where low-latency audio-visual generation is essential. By enabling efficient streaming of synchronized audio and video, OmniForcing could facilitate advancements in various fields, including virtual reality and live performance technologies. The paper presents OmniForcing, a novel framework for real-time joint audio-visual generation that effectively addresses the challenges of latency and training instability in existing models. Its innovative methodologies and comprehensive experimental evaluations position it as a significant contribution to the field of machine learning and multimedia generation.
Existing video personalization methods preserve visual likeness but treat video and audio separately. Without access to the visual scene, audio models cannot synchronize sounds with on-screen actions; and because classical voice-cloning models condition only on a reference recording, a text prompt cannot redirect speaking style or acoustic environment. We propose ID-LoRA (Identity-Driven In-Context LoRA), which jointly generates a subject's appearance and voice in a single model, letting a text prompt, a reference image, and a short audio clip govern both modalities together. ID-LoRA adapts the LTX-2 joint audio-video diffusion backbone via parameter-efficient In-Context LoRA and, to our knowledge, is the first method to personalize visual appearance and voice in a single generative pass. Two challenges arise. Reference and generation tokens share the same positional-encoding space, making them hard to distinguish; we address this with negative temporal positions, placing reference tokens in a disjoint RoPE region while preserving their internal temporal structure. Speaker characteristics also tend to be diluted during denoising; we introduce identity guidance, a classifier-free guidance variant that amplifies speaker-specific features by contrasting predictions with and without the reference signal. In human preference studies, ID-LoRA is preferred over Kling 2.6 Pro by 73% of annotators for voice similarity and 65% for speaking style. On cross-environment settings, speaker similarity improves by 24% over Kling, with the gap widening as conditions diverge. A preliminary user study further suggests that joint generation provides a useful inductive bias for physically grounded sound synthesis. ID-LoRA achieves these results with only ~3K training pairs on a single GPU. Code, models, and data will be released.
Primary: Tel Aviv University
All Institutions: Tel Aviv University
The main contribution of this paper is the introduction of ID-LoRA, a unified framework for audio-video personalization that allows for joint generation of visual and auditory content based on text prompts, significantly advancing the capabilities of generative models in the multimedia domain. The technical contributions, particularly in methodology and experimental validation, position this work as a notable advancement in the field of machine learning and generative media.
The proposed ID-LoRA framework innovatively integrates audio and video generation through a unified model, addressing the limitations of existing cascaded approaches. The introduction of negative temporal positions and identity guidance are significant methodological advancements that enhance the model's ability to preserve speaker identity while allowing for flexible text-based control over both audio and visual outputs. The use of a joint audio-video diffusion backbone (LTX-2) is a strong choice, leveraging the latest advancements in diffusion models to achieve high-quality generative results.
The experiments are robust, featuring a comprehensive evaluation protocol that includes both automatic metrics and human preference studies. The paper demonstrates significant improvements over state-of-the-art models, including commercial solutions, in speaker similarity and lip synchronization. The use of multiple datasets (CelebV-HQ and TalkVid) and the careful construction of evaluation splits (easy and hard) provide a thorough assessment of the model's performance across different conditions. The human evaluation metrics, including A/B preference tests and Mean Opinion Scores, add credibility to the findings.
The paper provides detailed implementation information, including training parameters, dataset preprocessing, and evaluation metrics, which enhances reproducibility. However, the reliance on specific datasets and the mention of proprietary models (like Kling 2.6 Pro) could pose challenges for complete reproducibility in diverse contexts.
While the model shows strong performance, it may still struggle with extreme variations in acoustic environments or highly dynamic visual scenes that were not extensively tested. The model's reliance on a relatively small training dataset (~3K pairs) raises questions about its generalization capabilities in broader applications. Additionally, the ethical implications of generating realistic audio-visual content without consent are not fully addressed.
The potential applications of ID-LoRA are significant, including personalized content creation, multilingual dubbing, and accessibility tools. However, the technology also poses risks related to misuse, such as deepfakes and unauthorized impersonation. The authors acknowledge these risks and suggest mitigations, emphasizing the need for ethical considerations in deployment. The main contribution of this paper is the introduction of ID-LoRA, a unified framework for audio-video personalization that allows for joint generation of visual and auditory content based on text prompts, significantly advancing the capabilities of generative models in the multimedia domain. The technical contributions, particularly in methodology and experimental validation, position this work as a notable advancement in the field of machine learning and generative media.
Human-computer interaction has traditionally relied on the acoustic channel, a dependency that introduces systemic vulnerabilities to environmental noise, privacy constraints, and physiological speech impairments. Silent Speech Interfaces (SSIs) emerge as a transformative paradigm that bypasses the acoustic stage by decoding linguistic intent directly from the neuro-muscular-articulatory continuum. This review provides a high-level synthesis of the SSI landscape, transitioning from traditional transducer-centric analysis to a holistic intent-to-execution taxonomy. We systematically evaluate sensing modalities across four critical physiological interception points: neural oscillations, neuromuscular activation, articulatory kinematics (ultrasound/magnetometry), and pervasive active probing via acoustic or radio-frequency sensing. Critically, we analyze the current paradigm shift from heuristic signal processing to Latent Semantic Alignment. In this new era, Large Language Models (LLMs) and deep generative architectures serve as high-level linguistic priors to resolve the ``informational sparsity'' and non-stationarity of biosignals. By mapping fragmented physiological gestures into structured semantic latent spaces, modern SSI frameworks have, for the first time, approached the Word Error Rate usability threshold required for real-world deployment. We further examine the transition of SSIs from bulky laboratory instrumentation to ``invisible interfaces'' integrated into commodity-grade wearables, such as earables and smart glasses. Finally, we outline a strategic roadmap addressing the ``user-dependency paradox'' through self-supervised foundation models and define the ethical boundaries of ``neuro-security'' to protect cognitive liberty in an increasingly interfaced world.
Primary: National University of Defense Technology
All Institutions: National University of Defense Technology, Hunan Normal University, Hunan University
The paper provides a comprehensive taxonomy and systematic review of Silent Speech Interfaces, highlighting the transition from traditional acoustic-based systems to innovative modalities leveraging Large Language Models. This work is significant as it outlines the potential of SSIs to enhance communication for diverse populations while addressing critical ethical considerations in the field.
The paper presents a comprehensive taxonomy of Silent Speech Interfaces (SSIs), detailing various sensing modalities and their physiological interception points. It transitions from traditional signal processing to modern approaches utilizing Large Language Models (LLMs) for semantic alignment. The methodology is robust, integrating diverse sensing techniques and advanced machine learning architectures, including deep generative models and self-supervised learning. However, the paper could benefit from more empirical validation of the proposed frameworks and a clearer delineation of the methodologies employed in existing studies.
While the paper provides a thorough review of existing literature and benchmarks, it lacks original experimental results or novel datasets. The comprehensive analysis of existing benchmarks, including performance metrics and comparison across modalities, is commendable. However, the absence of new empirical data limits the ability to assess the practical effectiveness of the proposed frameworks.
The paper does not provide specific implementation details or code repositories, which raises concerns about reproducibility. Although it discusses various methodologies and benchmarks, the lack of a clear path for others to replicate the findings diminishes the overall impact of the work.
The review primarily synthesizes existing literature without introducing new experimental findings. Additionally, while it addresses the ethical implications of SSIs, the discussion could be expanded to include more concrete examples of potential misuse or societal impacts. The paper also does not provide a detailed roadmap for future research, which could guide subsequent studies in this rapidly evolving field.
The potential applications of SSIs are significant, ranging from assistive technologies for individuals with speech impairments to secure communication in sensitive environments. The integration of LLMs into SSI frameworks could revolutionize human-computer interaction, making it more inclusive and privacy-preserving. However, the ethical considerations surrounding neuro-security and cognitive liberty must be carefully addressed to prevent misuse. The paper provides a comprehensive taxonomy and systematic review of Silent Speech Interfaces, highlighting the transition from traditional acoustic-based systems to innovative modalities leveraging Large Language Models. This work is significant as it outlines the potential of SSIs to enhance communication for diverse populations while addressing critical ethical considerations in the field.
Recent joint audio-visual diffusion models achieve remarkable generation quality but suffer from high latency due to their bidirectional attention dependencies, hindering real-time applications. We propose OmniForcing, the first framework to distill an offline, dual-stream bidirectional diffusion model into a high-fidelity streaming autoregressive generator. However, naively applying causal distillation to such dual-stream architectures triggers severe training instability, due to the extreme temporal asymmetry between modalities and the resulting token sparsity. We address the inherent information density gap by introducing an Asymmetric Block-Causal Alignment with a zero-truncation Global Prefix that prevents multi-modal synchronization drift. The gradient explosion caused by extreme audio token sparsity during the causal shift is further resolved through an Audio Sink Token mechanism equipped with an Identity RoPE constraint. Finally, a Joint Self-Forcing Distillation paradigm enables the model to dynamically self-correct cumulative cross-modal errors from exposure bias during long rollouts. Empowered by a modality-independent rolling KV-cache inference scheme, OmniForcing achieves state-of-the-art streaming generation at $\sim$25 FPS on a single GPU, maintaining multi-modal synchronization and visual quality on par with the bidirectional teacher.\textbf{Project Page:} \href{https://omniforcing.com}{https://omniforcing.com}
Primary: Peking University
All Institutions: JD Explore Academy, Fudan University, Peking University, The University of Hong Kong
The paper presents OmniForcing, a novel framework for real-time joint audio-visual generation that effectively addresses the challenges of latency and training instability in existing models. Its innovative methodologies and comprehensive experimental evaluations position it as a significant contribution to the field of machine learning and multimedia generation.
The proposed OmniForcing framework is a significant advancement in real-time joint audio-visual generation, addressing the latency issues of existing models through innovative techniques such as Asymmetric Block-Causal Alignment and Audio Sink Tokens with Identity RoPE. The methodology is well-structured, with a clear focus on overcoming the challenges of temporal asymmetry and training instability in dual-stream architectures. The introduction of a Joint Self-Forcing Distillation paradigm is particularly noteworthy, as it allows the model to dynamically correct cross-modal errors, enhancing the robustness of the generation process.
The experiments are comprehensive, comparing OmniForcing against both bidirectional models and cascaded autoregressive baselines. The evaluation metrics are well-defined, focusing on visual quality, audio fidelity, and real-time inference efficiency. The results demonstrate that OmniForcing achieves state-of-the-art performance, significantly reducing latency while maintaining high-quality outputs, which is crucial for real-time applications.
The paper provides detailed implementation details, including training setups and hyperparameters, which enhances reproducibility. However, the lack of a publicly available code repository or demo may hinder independent verification of results.
One limitation is the inherent trade-off between streaming capability and the full-sequence attention of the original bidirectional model, which may lead to slight reductions in consistency and synchrony compared to the teacher model. Additionally, the reliance on a specific architecture (LTX-2) may limit the generalizability of the findings to other models.
The work has significant implications for real-time applications in multimedia content creation, gaming, and interactive media, where low-latency audio-visual generation is essential. By enabling efficient streaming of synchronized audio and video, OmniForcing could facilitate advancements in various fields, including virtual reality and live performance technologies. The paper presents OmniForcing, a novel framework for real-time joint audio-visual generation that effectively addresses the challenges of latency and training instability in existing models. Its innovative methodologies and comprehensive experimental evaluations position it as a significant contribution to the field of machine learning and multimedia generation.
Human-computer interaction has traditionally relied on the acoustic channel, a dependency that introduces systemic vulnerabilities to environmental noise, privacy constraints, and physiological speech impairments. Silent Speech Interfaces (SSIs) emerge as a transformative paradigm that bypasses the acoustic stage by decoding linguistic intent directly from the neuro-muscular-articulatory continuum. This review provides a high-level synthesis of the SSI landscape, transitioning from traditional transducer-centric analysis to a holistic intent-to-execution taxonomy. We systematically evaluate sensing modalities across four critical physiological interception points: neural oscillations, neuromuscular activation, articulatory kinematics (ultrasound/magnetometry), and pervasive active probing via acoustic or radio-frequency sensing. Critically, we analyze the current paradigm shift from heuristic signal processing to Latent Semantic Alignment. In this new era, Large Language Models (LLMs) and deep generative architectures serve as high-level linguistic priors to resolve the ``informational sparsity'' and non-stationarity of biosignals. By mapping fragmented physiological gestures into structured semantic latent spaces, modern SSI frameworks have, for the first time, approached the Word Error Rate usability threshold required for real-world deployment. We further examine the transition of SSIs from bulky laboratory instrumentation to ``invisible interfaces'' integrated into commodity-grade wearables, such as earables and smart glasses. Finally, we outline a strategic roadmap addressing the ``user-dependency paradox'' through self-supervised foundation models and define the ethical boundaries of ``neuro-security'' to protect cognitive liberty in an increasingly interfaced world.
Primary: National University of Defense Technology
All Institutions: National University of Defense Technology, Hunan Normal University, Hunan University
The paper provides a comprehensive taxonomy and systematic review of Silent Speech Interfaces, highlighting the transition from traditional acoustic-based systems to innovative modalities leveraging Large Language Models. This work is significant as it outlines the potential of SSIs to enhance communication for diverse populations while addressing critical ethical considerations in the field.
The paper presents a comprehensive taxonomy of Silent Speech Interfaces (SSIs), detailing various sensing modalities and their physiological interception points. It transitions from traditional signal processing to modern approaches utilizing Large Language Models (LLMs) for semantic alignment. The methodology is robust, integrating diverse sensing techniques and advanced machine learning architectures, including deep generative models and self-supervised learning. However, the paper could benefit from more empirical validation of the proposed frameworks and a clearer delineation of the methodologies employed in existing studies.
While the paper provides a thorough review of existing literature and benchmarks, it lacks original experimental results or novel datasets. The comprehensive analysis of existing benchmarks, including performance metrics and comparison across modalities, is commendable. However, the absence of new empirical data limits the ability to assess the practical effectiveness of the proposed frameworks.
The paper does not provide specific implementation details or code repositories, which raises concerns about reproducibility. Although it discusses various methodologies and benchmarks, the lack of a clear path for others to replicate the findings diminishes the overall impact of the work.
The review primarily synthesizes existing literature without introducing new experimental findings. Additionally, while it addresses the ethical implications of SSIs, the discussion could be expanded to include more concrete examples of potential misuse or societal impacts. The paper also does not provide a detailed roadmap for future research, which could guide subsequent studies in this rapidly evolving field.
The potential applications of SSIs are significant, ranging from assistive technologies for individuals with speech impairments to secure communication in sensitive environments. The integration of LLMs into SSI frameworks could revolutionize human-computer interaction, making it more inclusive and privacy-preserving. However, the ethical considerations surrounding neuro-security and cognitive liberty must be carefully addressed to prevent misuse. The paper provides a comprehensive taxonomy and systematic review of Silent Speech Interfaces, highlighting the transition from traditional acoustic-based systems to innovative modalities leveraging Large Language Models. This work is significant as it outlines the potential of SSIs to enhance communication for diverse populations while addressing critical ethical considerations in the field.
We propose Relativistic Adversarial Feedback (RAF), a novel training objective for GAN vocoders that improves in-domain fidelity and generalization to unseen scenarios. Although modern GAN vocoders employ advanced architectures, their training objectives often fail to promote generalizable representations. RAF addresses this problem by leveraging speech self-supervised learning models to assist discriminators in evaluating sample quality, encouraging the generator to learn richer representations. Furthermore, we utilize relativistic pairing for real and fake waveforms to improve the modeling of the training data distribution. Experiments across multiple datasets show consistent gains in both objective and subjective metrics on GAN-based vocoders. Importantly, the RAF-trained BigVGAN-base outperforms the LSGAN-trained BigVGAN in perceptual quality using only 12\% of the parameters. Comparative studies further confirm the effectiveness of RAF as a training framework for GAN vocoders.
Primary: Korea Advanced Institute of Science and Technology (KAIST)
All Institutions: Korea Advanced Institute of Science and Technology (KAIST)
The paper presents a novel training framework, RAF, that enhances GAN vocoders' performance in speech synthesis by leveraging self-supervised learning and relativistic pairing. This work significantly contributes to the field by addressing the limitations of existing training objectives and demonstrating broad applicability across various datasets and vocoder architectures.
The paper introduces a novel training framework called Relativistic Adversarial Feedback (RAF) for GAN-based vocoders, which significantly enhances both in-domain fidelity and generalization to unseen scenarios. The methodology effectively integrates self-supervised learning models to assist discriminators in evaluating sample quality, promoting richer representations in the generator. The use of relativistic pairing for real and fake waveforms is a key innovation that allows for improved modeling of the training data distribution. The framework is well-structured, with clear definitions of the quality gap and discriminator gap, and the adversarial training objective is robustly formulated.
The experiments are comprehensive, utilizing multiple datasets to validate the effectiveness of RAF across various GAN-based vocoders. The results demonstrate consistent performance improvements in both objective and subjective metrics, with RAF-trained models outperforming traditional LSGAN models in perceptual quality while using fewer parameters. The inclusion of ablation studies strengthens the evaluation, providing insights into the contributions of different components of the RAF framework.
The authors provide a link to the source code for reproducing results, which is a positive aspect for ensuring reproducibility. However, the paper could benefit from more detailed descriptions of the experimental setup, including hyperparameter choices and training configurations, to facilitate easier replication by other researchers.
The paper acknowledges limitations regarding the computational costs associated with training RAF due to the use of long segments and heavy SSL models. Additionally, the authors do not explore lightweight alternatives or provide rigorous theoretical explanations for the convergence of RAF. There are also ethical considerations regarding the potential misuse of realistic audio deepfakes generated by the framework.
The proposed RAF framework has significant implications for the field of speech synthesis and neural vocoding, particularly in enhancing the quality and generalization capabilities of GAN-based models. The integration of self-supervised learning models opens avenues for further research in resource-efficient settings and could contribute to advancements in applications such as text-to-speech and voice conversion systems. The paper presents a novel training framework, RAF, that enhances GAN vocoders' performance in speech synthesis by leveraging self-supervised learning and relativistic pairing. This work significantly contributes to the field by addressing the limitations of existing training objectives and demonstrating broad applicability across various datasets and vocoder architectures.
Reinforcement Learning (RL) has become an effective paradigm for enhancing Large Language Models (LLMs) and visual generative models. However, its application in text-to-audio (TTA) generation remains largely under-explored. Prior work typically employs offline methods like Direct Preference Optimization (DPO) and leverages Contrastive Language-Audio Pretraining (CLAP) models as reward functions. In this study, we investigate the integration of online Group Relative Policy Optimization (GRPO) into TTA generation. We adapt the algorithm for Flow Matching-based audio models and demonstrate that online RL significantly outperforms its offline counterparts. Furthermore, we incorporate rewards derived from Large Audio Language Models (LALMs), which can provide fine-grained scoring signals that are better aligned with human perception. With only 470M parameters, our final model, \textbf{Resonate}, establishes a new SOTA on TTA-Bench in terms of both audio quality and semantic alignment.
Primary: Shanghai Jiao Tong University
All Institutions: Shanghai Jiao Tong University, SJTU Paris Elite Institute of Technology, X-LANCE Lab
The main contribution of this paper is the introduction of Resonate, a novel text-to-audio generator that employs online reinforcement learning and LALMs to achieve state-of-the-art performance in audio quality and semantic alignment. This work represents a significant step forward in the field of generative audio models, combining innovative methodologies with rigorous experimental validation to address existing limitations in TTA generation.
The paper presents a novel integration of online reinforcement learning (RL) into text-to-audio (TTA) generation, specifically through the Group Relative Policy Optimization (GRPO) framework. This approach addresses the limitations of offline RL methods by enabling more dynamic and responsive training that aligns better with human preferences. The use of Large Audio Language Models (LALMs) as reward models is particularly innovative, as it allows for fine-grained feedback that enhances the model's performance. The architecture leverages a Flux-style flow Transformer, which is well-suited for the generative tasks at hand. Overall, the methodology is robust, well-structured, and demonstrates a clear advancement over previous techniques.
The experiments are comprehensive, utilizing a large-scale audio-text dataset for pre-training and a well-defined evaluation benchmark (TTA-Bench) for assessing model performance. The results indicate that the proposed Resonate model achieves state-of-the-art performance in both audio quality and semantic alignment, outperforming existing models across various metrics. The inclusion of both objective and subjective evaluation methods strengthens the findings, providing a balanced view of the model's capabilities. The ablation studies further validate the effectiveness of the proposed methods, highlighting the advantages of online RL and LALM-based rewards.
The authors have provided clear details regarding the model architecture, training procedures, and evaluation metrics, which supports reproducibility. The availability of code and model weights on GitHub enhances this aspect, allowing other researchers to replicate the study and build upon the work. However, specific hyperparameter settings and the rationale behind certain design choices could be elaborated further to aid in complete reproducibility.
One limitation is the reliance on the quality of the datasets used for training and evaluation, which may affect the generalizability of the results. Additionally, while the model achieves state-of-the-art performance, the computational efficiency and scalability of the approach in real-world applications could be further explored. The paper does not address potential biases in the training data or the implications of using LALMs as reward models.
The advancements presented in this paper have significant implications for various applications, including automated content creation in filmmaking, gaming, and virtual reality. The integration of RL and LALMs in TTA generation could lead to more intuitive and human-aligned audio synthesis, enhancing user experiences across multimedia platforms. Furthermore, the open-sourcing of the model and code promotes collaboration and innovation within the research community. The main contribution of this paper is the introduction of Resonate, a novel text-to-audio generator that employs online reinforcement learning and LALMs to achieve state-of-the-art performance in audio quality and semantic alignment. This work represents a significant step forward in the field of generative audio models, combining innovative methodologies with rigorous experimental validation to address existing limitations in TTA generation.
Neural vocoders have recently advanced waveform generation, yielding natural and expressive audio. Among these approaches, iSTFT-based vocoders have recently gained attention. They predict a complex-valued spectrogram and then synthesize the waveform via iSTFT, thereby avoiding learned upsampling stages that can increase computational cost. However, current approaches use real-valued networks that process the real and imaginary parts independently. This separation limits their ability to capture the inherent structure of complex spectrograms. We present ComVo, a Complex-valued neural Vocoder whose generator and discriminator use native complex arithmetic. This enables an adversarial training framework that provides structured feedback in complex-valued representations. To guide phase transformations in a structured manner, we introduce phase quantization, which discretizes phase values and regularizes the training process. Finally, we propose a block-matrix computation scheme to improve training efficiency by reducing redundant operations. Experiments demonstrate that ComVo achieves higher synthesis quality than comparable real-valued baselines, and that its block-matrix scheme reduces training time by 25%. Audio samples and code are available at https://hs-oh-prml.github.io/ComVo/.
Primary: Unknown
All Institutions: Unknown
The main contribution of this paper is the introduction of ComVo, a complex-valued neural vocoder that enhances waveform generation through a novel adversarial framework and efficient computational techniques. This work represents a significant advancement in the modeling of complex spectrograms, with the potential to improve audio synthesis quality and efficiency in various applications.
The paper presents a novel approach to waveform generation using complex-valued neural networks (CVNNs) in an adversarial framework. The introduction of phase quantization as a structured nonlinearity and the block-matrix computation scheme for efficiency are significant contributions. The methodology effectively integrates complex arithmetic into both the generator and discriminator, allowing for better modeling of the inherent structure of complex spectrograms. The design choices are well-justified, and the paper provides a clear rationale for the benefits of using CVNNs over traditional real-valued networks.
The experiments are thorough, comparing the proposed ComVo model against several strong baselines using both subjective (MOS, SMOS, CMOS) and objective metrics (PESQ, MR-STFT error). The results demonstrate that ComVo consistently outperforms real-valued vocoders, achieving higher synthesis quality and reduced training time. The use of diverse datasets and evaluation metrics strengthens the validity of the findings, and the inclusion of qualitative analyses through Grad-CAM visualizations adds depth to the evaluation.
The paper provides sufficient implementation details, including architecture specifications and training setups, which facilitate reproducibility. The availability of audio samples and code on the provided demo URL further supports this aspect. However, the paper could benefit from a more detailed description of hyperparameter tuning and the specific configurations used for each baseline model.
While the paper acknowledges the higher memory footprint associated with complex-valued parameters, it does not explore potential optimizations for multi-GPU training setups, which could enhance scalability. Additionally, the reliance on split-style designs may limit the flexibility of the model, and future work is needed to explore more advanced architectures.
The integration of CVNNs into waveform generation has the potential to significantly advance the field of speech synthesis and audio processing. By improving the quality of generated audio and reducing computational costs, this work could facilitate the development of more efficient and effective neural vocoders, impacting applications in text-to-speech systems, music generation, and other audio-related technologies. The main contribution of this paper is the introduction of ComVo, a complex-valued neural vocoder that enhances waveform generation through a novel adversarial framework and efficient computational techniques. This work represents a significant advancement in the modeling of complex spectrograms, with the potential to improve audio synthesis quality and efficiency in various applications.
Evaluating 'anime-like' voices currently relies on costly subjective judgments, yet no standardized objective metric exists. A key challenge is that anime-likeness, unlike naturalness, lacks a shared absolute scale, making conventional Mean Opinion Score (MOS) protocols unreliable. To address this gap, we propose AnimeScore, a preference-based framework for automatic anime-likeness evaluation via pairwise ranking. We collect 15,000 pairwise judgments from 187 evaluators with free-form descriptions, and acoustic analysis reveals that perceived anime-likeness is driven by controlled resonance shaping, prosodic continuity, and deliberate articulation rather than simple heuristics such as high pitch. We show that handcrafted acoustic features reach a 69.3% AUC ceiling, while SSL-based ranking models achieve up to 90.8% AUC, providing a practical metric that can also serve as a reward signal for preference-based optimization of generative speech models.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of AnimeScore, a novel preference-based framework for evaluating anime-like speech, which combines extensive data collection, acoustic analysis, and advanced ranking models to provide a practical and objective metric for a previously subjective evaluation task. This work significantly advances the field of audio processing and speech synthesis by addressing a unique challenge in evaluating stylistic voice attributes.
The methodology is robust, employing a preference-based framework that collects a substantial dataset of 15,000 pairwise judgments to evaluate 'anime-like' speech. The authors effectively address the challenges of subjective evaluation by utilizing pairwise comparisons, which is more reliable for style-centric attributes. The acoustic analysis is comprehensive, identifying key features that contribute to perceived anime-likeness. The integration of self-supervised learning (SSL) models for ranking demonstrates a forward-thinking approach to automatic evaluation, enhancing the practicality of the framework.
The experiments are well-structured, with a clear focus on validating the proposed framework through rigorous testing of various SSL backbones. The results indicate a significant improvement over traditional handcrafted features, with SSL models achieving up to 90.8% AUC. The paper provides detailed statistical analyses, including pairwise accuracy and ROC-AUC metrics, which lend credibility to the findings. However, the paper could benefit from a more extensive exploration of the implications of these results in real-world applications.
The paper mentions that the dataset and implementation are publicly available, which is a positive aspect for reproducibility. However, specific details regarding the training setups, hyperparameters, and model architectures could be elaborated further to enhance replicability. The absence of a clear code repository link may hinder some researchers from fully reproducing the results.
The study acknowledges limitations such as demographic imbalances in the evaluator pool and the moderate scale of the dataset. Additionally, the lack of ablation studies on model structure limits the understanding of how different components contribute to performance. Future work should address these limitations to strengthen the findings.
The proposed framework has significant implications for the anime industry and speech generation systems, providing a standardized metric for evaluating anime-like speech. This could streamline the development process for generative models, allowing for more efficient iteration and optimization. Furthermore, the insights gained from the acoustic analysis could inform future research on voice synthesis and style transfer in audio applications. The main contribution of this paper is the introduction of AnimeScore, a novel preference-based framework for evaluating anime-like speech, which combines extensive data collection, acoustic analysis, and advanced ranking models to provide a practical and objective metric for a previously subjective evaluation task. This work significantly advances the field of audio processing and speech synthesis by addressing a unique challenge in evaluating stylistic voice attributes.
Speech Emotion Captioning (SEC) leverages large audio-language models to generate rich, context-aware affective descriptions from speech. However, real-world deployment remains challenging due to the substantial computational demands on resource-constrained edge devices and the privacy risks of transmitting biometric audio. While smaller audio-language models enable efficient on-device SEC, their limited capacity often weakens subtle paralinguistic modeling and fine-grained affective grounding. We propose an edge-cloud collaborative framework based on Uncertainty-Guided Speculative Decoding (UGSD). A lightweight edge model drafts captions locally, and only high-uncertainty token blocks are selectively escalated to a stronger cloud verifier for validation. Experiments on the MER2024 benchmark demonstrate substantial BLEU improvements up to 62.7%. UGSD further achieves 1.4x lower latency and 8.5x higher token throughput compared to an edge-only model. These results empirically characterize the quality-efficiency-privacy trade-off in deployable SEC systems.
Primary: unknown
All Institutions: unknown
The paper presents an innovative edge-cloud collaborative framework for Speech Emotion Captioning that effectively addresses computational and privacy challenges. The methodology and experimental results indicate a strong contribution to the field, although improvements in reproducibility and detailed methodological descriptions are needed for broader adoption and validation.
The proposed methodology introduces an edge-cloud collaborative framework that utilizes Uncertainty-Guided Speculative Decoding (UGSD) to enhance Speech Emotion Captioning (SEC). This approach is innovative as it balances the computational load between edge devices and cloud resources, allowing for efficient processing while maintaining privacy. The method's reliance on uncertainty to determine when to escalate processing to the cloud is a notable contribution, as it addresses both efficiency and privacy concerns effectively. However, the paper could benefit from a more detailed description of the UGSD algorithm and its implementation specifics.
The experiments conducted on the MER2024 benchmark are robust, demonstrating significant improvements in BLEU scores and efficiency metrics such as latency and token throughput. The reported BLEU score improvement of up to 62.7% is impressive and indicates a strong performance of the proposed method compared to traditional edge-only models. However, the paper lacks a comprehensive analysis of the datasets used, including the size, diversity, and how they were annotated, which is crucial for understanding the generalizability of the results.
The paper does not provide sufficient details regarding the implementation of the proposed framework, including hyperparameters, model architectures, or training procedures. This lack of transparency hinders reproducibility, as other researchers may struggle to replicate the results without access to the code or detailed methodological descriptions.
One limitation of the proposed approach is its dependency on the quality of the lightweight edge model. If the edge model's performance is subpar, the overall system may not achieve the desired quality in captioning. Additionally, the reliance on cloud resources, while mitigated by the uncertainty-based approach, still poses potential latency issues in real-world applications where immediate responses are required.
The implications of this research are significant, particularly in the context of deploying SEC systems in privacy-sensitive environments. The proposed framework could facilitate the integration of emotion recognition in various applications, such as virtual assistants, mental health monitoring, and interactive entertainment. By addressing the computational and privacy challenges, this work paves the way for more widespread adoption of SEC technologies. The paper presents an innovative edge-cloud collaborative framework for Speech Emotion Captioning that effectively addresses computational and privacy challenges. The methodology and experimental results indicate a strong contribution to the field, although improvements in reproducibility and detailed methodological descriptions are needed for broader adoption and validation.
We investigate continued pretraining (CPT) for adapting wav2vec2-bert-2.0 to Swahili automatic speech recognition (ASR). Our approach combines unlabeled audio with limited labeled data through pseudo-labeled CPT followed by supervised finetuning. With 20,000 labeled samples, we achieve 3.24% WER on Common Voice Swahili-an 82% relative improvement over the baseline. This result surpasses the best previously reported academic system (8.3% WER from XLS-R) by 61% relative improvement. We provide concrete data requirements and a replicable methodology applicable to other low-resource languages.
Primary: Harvard University
All Institutions: Harvard University, Thiomi-Lugha NLP
The paper presents a systematic evaluation of continued pretraining for Swahili ASR, achieving state-of-the-art performance with minimal labeled data. This work is significant in demonstrating the potential of leveraging unlabeled audio in low-resource language settings, providing a replicable methodology that could be applied to other underserved languages.
The methodology presented in the paper is robust and systematic, focusing on the application of continued pretraining (CPT) for low-resource Swahili ASR. The authors clearly outline their three-stage pipeline, which includes training a labeling model, generating pseudo-labels, and performing supervised finetuning. This structured approach is well-justified and effectively demonstrates the potential of CPT in leveraging unlabeled data. The use of a strong baseline model for pseudo-labeling is particularly commendable, as it ensures the quality of the training data.
The experimental evaluation is thorough, with clear comparisons between models trained with and without CPT. The authors provide detailed results across different configurations, showcasing significant improvements in word error rate (WER). The benchmarks established are concrete and relevant, demonstrating a clear advancement over previous systems. The results are well-supported by the experimental design, which isolates the effects of CPT.
The paper provides sufficient details regarding the experimental setup, including hyperparameters, dataset descriptions, and training procedures. However, the lack of a publicly accessible code repository or demo limits the reproducibility of the results. While the methodology is replicable, the absence of shared resources may hinder broader adoption.
One limitation is the reliance on the quality of the pseudo-labels generated by the baseline model. If the labeling model does not perform well, it could adversely affect the continued pretraining process. Additionally, while the study focuses on Swahili, the generalizability of the findings to other low-resource languages remains to be validated.
The implications of this research are significant, particularly for the over 100 million Swahili speakers who could benefit from improved ASR technology. The methodology could pave the way for advancements in educational technology, accessibility tools, and the preservation of oral traditions in various African languages. The findings underscore the feasibility of developing high-quality ASR systems in low-resource settings, which could have a transformative effect on technology access in these communities. The paper presents a systematic evaluation of continued pretraining for Swahili ASR, achieving state-of-the-art performance with minimal labeled data. This work is significant in demonstrating the potential of leveraging unlabeled audio in low-resource language settings, providing a replicable methodology that could be applied to other underserved languages.
Speech-aware large language models (LLMs) can accept speech inputs, yet their training objectives largely emphasize linguistic content or specific fields such as emotions or the speaker's gender, leaving it unclear whether they encode speaker identity. First, we propose a model-agnostic scoring protocol that produces continuous verification scores for both API-only and open-weight models, using confidence scores or log-likelihood ratios from the Yes/No token probabilities. Using this protocol, we benchmark recent speech-aware LLMs and observe weak speaker discrimination (EERs above 20% on VoxCeleb1). Second, we introduce a lightweight augmentation that equips an LLM with ASV capability by injecting frozen ECAPA-TDNN speaker embeddings through a learned projection and training only LoRA adapters. On TinyLLaMA-1.1B, the resulting ECAPA-LLM achieves 1.03% EER on VoxCeleb1-E, approaching a dedicated speaker verification system while preserving a natural-language interface.
Primary: Johns Hopkins University
All Institutions: Johns Hopkins University
The main contribution of this work is the development of a novel scoring protocol and augmentation technique that significantly enhances the speaker verification capabilities of speech-aware LLMs. This research addresses a critical gap in the ability of LLMs to process and discriminate speaker identity, which is essential for advancing applications in voice recognition and personalized AI systems.
The paper introduces a model-agnostic scoring protocol for evaluating speaker verification capabilities in speech-aware LLMs, which is a significant advancement in the field. The methodology is sound, utilizing both confidence scoring and log-likelihood ratios to derive continuous verification scores. The augmentation of LLMs with ECAPA-TDNN speaker embeddings through LoRA adapters is innovative, allowing for the integration of speaker verification capabilities while maintaining the natural language interface of LLMs. The approach is well-structured and addresses a critical gap in existing LLM capabilities regarding speaker identity discrimination.
The experiments are comprehensive, benchmarking various speech-aware LLMs against the VoxCeleb1 dataset. The results demonstrate the weak performance of off-the-shelf models in speaker discrimination, with EERs exceeding 20%. The introduction of the ECAPA-LLM shows a significant improvement, achieving an EER of 1.03%, which is commendable. The evaluation metrics are appropriate, and the use of multiple trials and datasets enhances the robustness of the findings. However, the paper could benefit from more extensive comparisons with state-of-the-art ASV systems.
The paper provides sufficient details regarding the datasets used, the training procedures, and the evaluation metrics, which facilitates reproducibility. However, the absence of a publicly accessible code repository limits the ability for others to replicate the results fully. Including a GitHub link or similar would enhance reproducibility significantly.
The paper acknowledges limitations in the scoring methods, particularly the coarse nature of confidence-based scoring for closed systems and the high failure rates observed in some models. Additionally, the performance of larger models like Ministral3-3B was unexpectedly poor, suggesting potential issues with embedding space alignment that require further investigation. The reliance on specific datasets may also limit the generalizability of the findings.
The implications of this research are substantial, as it paves the way for integrating speaker verification capabilities into general-purpose LLMs. This could enhance applications in biometric authentication, personalized assistants, and dialogue analysis. The findings suggest a promising direction for future research in multimodal AI systems, where both linguistic and speaker identity information can be processed jointly. The main contribution of this work is the development of a novel scoring protocol and augmentation technique that significantly enhances the speaker verification capabilities of speech-aware LLMs. This research addresses a critical gap in the ability of LLMs to process and discriminate speaker identity, which is essential for advancing applications in voice recognition and personalized AI systems.
In target speaker extraction (TSE), we aim to recover target speech from a multi-talker mixture using a short enrollment utterance as reference. Recent studies on diffusion and flow-matching generators have improved target-speech fidelity. However, multi-step sampling increases latency, and one-step solutions often rely on a mixture-dependent time coordinate that can be unreliable for real-world conversations. We present AlphaFlowTSE, a one-step conditional generative model trained with a Jacobian-vector product (JVP)-free AlphaFlow objective. AlphaFlowTSE learns mean-velocity transport along a mixture-to-target trajectory starting from the observed mixture, eliminating auxiliary mixing-ratio prediction, and stabilizes training by combining flow matching with an interval-consistency teacher-student target. Experiments on Libri2Mix and REAL-T confirm that AlphaFlowTSE improves target-speaker similarity and real-mixture generalization for downstream automatic speech recognition (ASR).
Primary: Xiamen University
All Institutions: Xiamen University, Hong Kong SAR, Nanjing University, School of Artificial Intelligence, School of Electronic Science and Engineering, School of Informatics, School of Intelligence Science and Technology, Shenzhen Loop Area Institute, The Chinese University of Hong Kong
The main contribution of this paper is the introduction of AlphaFlowTSE, a novel one-step generative model for target speaker extraction that significantly improves extraction quality and generalization while maintaining low latency. This work represents a meaningful advancement in the field of audio processing, particularly in enhancing the fidelity and efficiency of speaker extraction systems.
The methodology proposed in AlphaFlowTSE is innovative, as it introduces a one-step generative model for target speaker extraction that utilizes a Jacobian-vector product (JVP)-free AlphaFlow objective. The combination of trajectory matching and interval-consistency teacher-student supervision is a significant advancement in the field, addressing the challenges of latency and accuracy in real-world applications. The use of mean-velocity transport in the complex STFT domain is particularly noteworthy, as it aligns training with inference, making the model more efficient.
The experiments conducted on the Libri2Mix and REAL-T datasets are comprehensive and demonstrate the effectiveness of AlphaFlowTSE. The results show improvements in target-speaker similarity and generalization to real conversational mixtures, with strong performance metrics such as PESQ, ESTOI, and SI-SDR. The ablation studies regarding the MR predictor further validate the robustness of the proposed model.
The paper provides detailed implementation details, including training protocols, model architecture, and evaluation metrics, which contribute to reproducibility. However, the lack of a public repository or demo URL limits the ease of access for other researchers to replicate the results.
One limitation is the reliance on the enrollment utterance, which may not always be available in practical scenarios. Additionally, while the model shows strong performance in controlled environments, its effectiveness in highly variable real-world conditions remains to be fully validated.
The advancements in target speaker extraction have significant implications for applications in personal communication systems, such as virtual assistants and conference call technologies. The ability to accurately extract a target speaker's voice in real-time can enhance user experiences in various audio processing tasks, including automatic speech recognition and noise suppression. The main contribution of this paper is the introduction of AlphaFlowTSE, a novel one-step generative model for target speaker extraction that significantly improves extraction quality and generalization while maintaining low latency. This work represents a meaningful advancement in the field of audio processing, particularly in enhancing the fidelity and efficiency of speaker extraction systems.
We present FireRedASR2S, a state-of-the-art industrial-grade all-in-one automatic speech recognition (ASR) system. It integrates four modules in a unified pipeline: ASR, Voice Activity Detection (VAD), Spoken Language Identification (LID), and Punctuation Prediction (Punc). All modules achieve SOTA performance on the evaluated benchmarks: FireRedASR2: An ASR module with two variants, FireRedASR2-LLM (8B+ parameters) and FireRedASR2-AED (1B+ parameters), supporting speech and singing transcription for Mandarin, Chinese dialects and accents, English, and code-switching. Compared to FireRedASR, FireRedASR2 delivers improved recognition accuracy and broader dialect and accent coverage. FireRedASR2-LLM achieves 2.89% average CER on 4 public Mandarin benchmarks and 11.55% on 19 public Chinese dialects and accents benchmarks, outperforming competitive baselines including Doubao-ASR, Qwen3-ASR, and Fun-ASR. FireRedVAD: An ultra-lightweight module (0.6M parameters) based on the Deep Feedforward Sequential Memory Network (DFSMN), supporting streaming VAD, non-streaming VAD, and multi-label VAD (mVAD). On the FLEURS-VAD-102 benchmark, it achieves 97.57% frame-level F1 and 99.60% AUC-ROC, outperforming Silero-VAD, TEN-VAD, FunASR-VAD, and WebRTC-VAD. FireRedLID: An Encoder-Decoder LID module supporting 100+ languages and 20+ Chinese dialects and accents. On FLEURS (82 languages), it achieves 97.18% utterance-level accuracy, outperforming Whisper and SpeechBrain. FireRedPunc: A BERT-style punctuation prediction module for Chinese and English. On multi-domain benchmarks, it achieves 78.90% average F1, outperforming FunASR-Punc (62.77%). To advance research in speech processing, we release model weights and code at https://github.com/FireRedTeam/FireRedASR2S.
Primary: Super Intelligence Team
All Institutions: Super Intelligence Team
The paper introduces FireRedASR2S, a state-of-the-art all-in-one ASR system that integrates multiple modules to enhance speech recognition accuracy and robustness across various languages and dialects. The technical contributions are substantial, particularly in the context of modular design and comprehensive evaluation, making it a valuable addition to the field of automatic speech recognition.
The paper presents a comprehensive and modular architecture for an automatic speech recognition system that integrates multiple components (ASR, VAD, LID, Punc) into a unified pipeline. The methodology is well-structured, leveraging state-of-the-art techniques such as the Encoder-Decoder architecture for LID and BERT-style models for punctuation prediction. The use of a large and diverse training corpus enhances the model's generalization capabilities across various dialects and languages, which is a significant methodological strength.
The experimental evaluation is robust, with extensive benchmarking against multiple public datasets, demonstrating superior performance across all components. The results are clearly presented, showing improvements in character error rates and other relevant metrics compared to existing systems. The use of human-annotated data for training VAD is particularly noteworthy, as it enhances the reliability of segmentation in real-world applications.
The authors have committed to open-sourcing their model weights and code, which is a positive step towards reproducibility. However, the paper could benefit from more detailed descriptions of the training processes and hyperparameter settings to facilitate replication of results by other researchers.
While the system shows impressive performance, it may still face challenges in extremely noisy environments or with highly accented speech that diverges from the training data. Additionally, the reliance on large-scale training data may limit accessibility for smaller institutions or researchers with fewer resources.
The FireRedASR2S system has significant implications for various applications, including real-time transcription services, multilingual communication tools, and accessibility technologies. Its modular design allows for flexible deployment in diverse settings, potentially advancing the state of the art in speech recognition technology. The paper introduces FireRedASR2S, a state-of-the-art all-in-one ASR system that integrates multiple modules to enhance speech recognition accuracy and robustness across various languages and dialects. The technical contributions are substantial, particularly in the context of modular design and comprehensive evaluation, making it a valuable addition to the field of automatic speech recognition.
We study timestamped speaker-attributed ASR for long-form, multi-party speech with overlap, where chunk-wise inference must preserve meeting-level speaker identity consistency while producing time-stamped, speaker-labeled transcripts. Previous Speech-LLM systems tend to prioritize either local diarization or global labeling, but often lack the ability to capture fine-grained temporal boundaries or robust cross-chunk identity linking. We propose G-STAR, an end-to-end system that couples a time-aware speaker-tracking module with a Speech-LLM transcription backbone. The tracker provides structured speaker cues with temporal grounding, and the LLM generates attributed text conditioned on these cues. G-STAR supports both component-wise optimization and joint end-to-end training, enabling flexible learning under heterogeneous supervision and domain shift. Experiments analyze cue fusion, local versus long-context trade-offs and hierarchical objectives.
Primary: Nanjing University
All Institutions: Central Media Technology Institute, Nanjing University, Shanghai Jiao Tong University, Shenzhen Research Institute of Big Data, ETH Zรผrich
G-STAR presents a novel approach to timestamped speaker-attributed ASR, significantly advancing the field of audio processing by integrating speaker tracking with LLMs. The methodology and potential applications underscore its relevance and impact in improving speech recognition systems.
The methodology proposed in G-STAR is innovative, combining a time-aware speaker-tracking module with a Speech-LLM transcription backbone. This dual approach addresses the limitations of existing systems that either focus on local diarization or global labeling, thus enhancing the ability to maintain speaker identity consistency across long-form, multi-party speech. The flexibility of supporting both component-wise optimization and joint end-to-end training is a significant strength, allowing for adaptability in various training scenarios. However, the paper could benefit from a more detailed explanation of the cue fusion process and how it integrates with the LLM.
The experiments conducted are comprehensive, analyzing various aspects such as cue fusion, local versus long-context trade-offs, and hierarchical objectives. However, the paper lacks a detailed description of the datasets used, which is crucial for evaluating the robustness and generalizability of the proposed system. The absence of comparisons with state-of-the-art methods also limits the clarity of G-STAR's performance improvements.
The paper does not provide sufficient implementation details or code availability, which raises concerns about reproducibility. Clearer guidelines on how to replicate the experiments, including hyperparameters and training procedures, would enhance the paper's credibility and utility for the research community.
One of the primary limitations is the potential complexity of the system, which may hinder real-time applications. Additionally, the paper does not address how well the system performs in highly noisy environments or with overlapping speech from multiple speakers, which are common challenges in practical scenarios.
The implications of G-STAR are significant, particularly in applications such as virtual meetings, automated transcription services, and assistive technologies for the hearing impaired. By improving speaker attribution and temporal accuracy in transcripts, this research could enhance communication accessibility and efficiency in multi-party interactions. G-STAR presents a novel approach to timestamped speaker-attributed ASR, significantly advancing the field of audio processing by integrating speaker tracking with LLMs. The methodology and potential applications underscore its relevance and impact in improving speech recognition systems.
Environmental sound understanding in computational auditory scene analysis (CASA) is often formulated as an audio-only recognition problem. This formulation leaves a persistent drawback in multi-label audio tagging (AT): acoustic similarity can make certain events difficult to separate from waveforms alone. In such cases, disambiguating cues often lie outside the waveform. Geospatial semantic context (GSC), derived from geographic information system data, e.g., points of interest (POI), provides location-tied environmental priors that can help reduce this ambiguity. A systematic study of this direction is enabled through the proposed geospatial audio tagging (Geo-AT) task, which conditions multi-label sound event tagging on GSC alongside audio. To benchmark Geo-AT, Geo-ATBench is introduced as a polyphonic audio benchmark with geographical annotations, containing 10.71 hours of audio across 28 event categories; each clip is paired with a GSC representation from 11 semantic context categories. GeoFusion-AT is proposed as a unified geo-audio fusion framework that evaluates feature-, representation-, and decision-level fusion on representative audio backbones, with audio- and GSC-only baselines. Results show that incorporating GSC improves AT performance, especially on acoustically confounded labels, indicating geospatial semantics provide effective priors beyond audio alone. A crowdsourced listening study with 10 participants on 579 samples shows that there is no significant difference in performance between models on Geo-ATBench labels and aggregated human labels, supporting Geo-ATBench as a human-aligned benchmark. The Geo-AT task, benchmark Geo-ATBench, and reproducible geo-audio fusion framework GeoFusion-AT provide a foundation for studying AT with geospatial semantic context within the CASA community. Dataset, code, models are on homepage (https://github.com/WuYanru2002/Geo-ATBench).
Primary: University of Oxford
All Institutions: University of Oxford, Xi'an Jiaotong-Liverpool University, KTH Royal Institute of Technology, Ghent University
The paper presents a novel approach to multi-label audio tagging by integrating geospatial semantic context, significantly enhancing the understanding of environmental sounds in complex auditory scenes. The comprehensive methodology and rigorous experimental evaluation contribute to the advancement of the field, establishing a foundation for future research in audio tagging and multimodal learning.
The paper introduces the Geo-AT task, which integrates geospatial semantic context (GSC) with audio for multi-label audio tagging (AT). The methodology is well-structured, presenting a clear framework (GeoFusion-AT) that evaluates various fusion strategies (feature-, representation-, and decision-level). The systematic approach to dataset creation (Geo-ATBench) and the inclusion of human-aligned evaluations enhance the robustness of the proposed methods.
The experiments are comprehensive, utilizing multiple audio backbones and evaluating the performance of models under different fusion strategies. The results demonstrate significant improvements in performance when incorporating GSC, particularly for acoustically similar events. The use of statistical tests to validate improvements adds rigor to the findings.
The authors provide access to the dataset, code, and models, which supports reproducibility. The detailed descriptions of the experimental setup, including data collection and annotation processes, contribute to the transparency of the research.
The study relies on a relatively small sample size for the human evaluation (10 participants), which may limit the generalizability of the findings. Additionally, the dataset is constrained to specific geographic contexts, which might affect the applicability of the results in diverse environments.
The proposed framework and benchmark have the potential to advance research in computational auditory scene analysis (CASA) and multimodal learning by providing a standardized task that incorporates geospatial context. This could lead to improvements in applications such as urban sound monitoring, smart city technologies, and assistive listening devices. The paper presents a novel approach to multi-label audio tagging by integrating geospatial semantic context, significantly enhancing the understanding of environmental sounds in complex auditory scenes. The comprehensive methodology and rigorous experimental evaluation contribute to the advancement of the field, establishing a foundation for future research in audio tagging and multimodal learning.
The Mean Opinion Score (MOS) serves as the standard metric for speech quality assessment, yet biases in human annotations remain underexplored. We conduct the first systematic analysis of gender bias in MOS, revealing that male listeners consistently assign higher scores than female listeners--a gap that is most pronounced in low-quality speech and gradually diminishes as quality improves. This quality-dependent structure proves difficult to eliminate through simple calibration. We further demonstrate that automated MOS models trained on aggregated labels exhibit predictions skewed toward male standards of perception. To address this, we propose a gender-aware model that learns gender-specific scoring patterns through abstracting binary group embeddings, thereby improving overall and gender-specific prediction accuracy. This study establishes that gender bias in MOS constitutes a systematic, learnable pattern demanding attention in equitable speech evaluation.
Primary: National Institute of Information and Communications Technology
All Institutions: National Institute of Information and Communications Technology, Nagoya University, National Taiwan University
The main contribution of this paper is the identification and analysis of gender bias in MOS ratings, along with the introduction of a gender-aware model that addresses these biases. This work is significant as it not only uncovers a previously underexplored issue in speech quality assessment but also proposes a novel solution that could enhance fairness in automated evaluations.
The paper presents a systematic analysis of gender bias in Mean Opinion Scores (MOS) for speech quality assessment. It employs a novel approach by abstracting binary group embeddings to create a gender-aware model that learns gender-specific scoring patterns. This methodology is well-structured and addresses a significant gap in the literature regarding the biases in human annotations. The use of gender-specific embeddings is innovative and adds depth to the existing methodologies in speech quality assessment.
The experiments conducted are thorough, revealing a clear disparity in MOS ratings between male and female listeners, particularly in low-quality speech. The authors provide a compelling analysis of how automated MOS models inherit these biases. However, the paper could benefit from more extensive datasets and a broader range of speech quality scenarios to validate the findings further. The results indicate that the proposed model improves prediction accuracy, which is a strong point.
The paper lacks detailed implementation specifics that would allow for full reproducibility of the experiments. While the methodology is sound, the absence of code or a clear description of the experimental setup limits the ability of other researchers to replicate the findings. Including a supplementary material section or a link to a code repository would enhance reproducibility.
One limitation is the focus on gender bias without considering other potential biases (e.g., age, ethnicity) that could also impact MOS ratings. Additionally, the study's reliance on aggregated labels for training automated models may overlook individual listener variability. The authors acknowledge the difficulty in eliminating bias through calibration, suggesting a need for further research in this area.
This research has significant implications for the field of speech quality assessment and machine learning. By highlighting the systematic nature of gender bias in MOS, it calls for more equitable evaluation practices in audio processing systems. The findings could influence future research directions and the development of fairer algorithms in speech technology, ultimately contributing to more inclusive applications. The main contribution of this paper is the identification and analysis of gender bias in MOS ratings, along with the introduction of a gender-aware model that addresses these biases. This work is significant as it not only uncovers a previously underexplored issue in speech quality assessment but also proposes a novel solution that could enhance fairness in automated evaluations.
Recent advances in generative models have amplified the risk of malicious misuse of speech synthesis technologies, enabling adversaries to impersonate target speakers and access sensitive resources. Although speech deepfake detection has progressed rapidly, most existing countermeasures lack formal robustness guarantees or fail to generalize to unseen generation techniques. We propose PV-VASM, a probabilistic framework for verifying the robustness of voice anti-spoofing models (VASMs). PV-VASM estimates the probability of misclassification under text-to-speech (TTS), voice cloning (VC), and parametric signal transformations. The approach is model-agnostic and enables robustness verification against unseen speech synthesis techniques and input perturbations. We derive a theoretical upper bound on the error probability and validate the method across diverse experimental settings, demonstrating its effectiveness as a practical robustness verification tool.
Primary: City University of Hong Kong
All Institutions: City University of Hong Kong, Applied AI Institute, Central University, Trusted AI Research Center
The main contribution of this paper is the introduction of PV-VASM, a probabilistic framework for verifying the robustness of voice anti-spoofing models, which addresses critical gaps in existing methodologies. This work is significant as it provides a systematic approach to robustness verification, crucial for the safe deployment of voice anti-spoofing technologies in real-world applications.
The proposed PV-VASM framework presents a novel approach to verifying the robustness of voice anti-spoofing models through a probabilistic framework. It effectively estimates misclassification probabilities under various transformations, including those from generative models. The methodology is well-structured, utilizing probabilistic concentration inequalities and providing a theoretical upper bound on error probabilities. The model-agnostic nature of PV-VASM enhances its applicability across different voice anti-spoofing models, which is a significant advancement in the field.
The experimental validation of PV-VASM is comprehensive, covering a variety of transformations and generative models. The authors provide detailed results on the performance of their method against both parametric transformations and synthetic speech generation, showcasing its effectiveness. The use of multiple datasets and the inclusion of real-world conditions strengthen the credibility of the findings.
While the paper includes a thorough description of the methodology and experiments, it lacks specific implementation details or links to code repositories, which may hinder reproducibility. The absence of a demo or project URL further limits the ability for others to validate the findings independently.
The paper acknowledges that the robustness of voice anti-spoofing models significantly varies depending on the type of perturbation and the parameter space. The proposed upper bounds may be overly conservative, leading to less practical applicability in certain scenarios. Additionally, the complexity of verifying robustness against generative models presents challenges that are not fully addressed.
The implications of this research are substantial, particularly in enhancing the security of voice recognition systems against deepfake technologies. The ability to certify the robustness of voice anti-spoofing models has significant applications in various domains, including finance, security, and personal privacy. The main contribution of this paper is the introduction of PV-VASM, a probabilistic framework for verifying the robustness of voice anti-spoofing models, which addresses critical gaps in existing methodologies. This work is significant as it provides a systematic approach to robustness verification, crucial for the safe deployment of voice anti-spoofing technologies in real-world applications.
Although the deep integration of the Automatic Speech Recognition (ASR) system with Large Language Models (LLMs) has significantly improved accuracy, the deployment of such systems in low-latency streaming scenarios remains challenging. In this paper, we propose Uni-ASR, a unified framework based on LLMs that integrates both non-streaming and streaming speech recognition capabilities. We propose a joint training paradigm that enables the system to seamlessly transition between two recognition modes without any architectural modifications. Furthermore, we introduce a context-aware training paradigm and a co-designed fallback decoding strategy, which can enhance streaming recognition accuracy without introducing additional latency. The experimental results demonstrate that Uni-ASR not only achieves competitive performance within non-streaming mode, but also demonstrates strong effectiveness in streaming scenarios under diverse latency constraints.
Primary: Tongyi AI Lab
All Institutions: Tongyi AI Lab
The main contribution of this paper is the introduction of Uni-ASR, a unified architecture that effectively integrates non-streaming and streaming ASR capabilities, achieving competitive performance across both modes. This work represents a significant advancement in the field of automatic speech recognition, particularly in addressing the challenges of low-latency applications while maintaining high accuracy.
The methodology presented in Uni-ASR is robust, integrating both non-streaming and streaming capabilities within a single architecture. The joint training paradigm and context-aware training approach are innovative, allowing for seamless transitions between modes without architectural changes. The fallback decoding strategy is particularly noteworthy as it addresses the latency issues inherent in streaming ASR, enhancing performance while maintaining low latency. The use of established architectures like Conformer and LLMs adds credibility to the approach, although the paper could benefit from more detailed descriptions of the training dynamics and hyperparameter settings.
The experimental evaluation is comprehensive, utilizing multiple widely recognized benchmarks such as AISHELL, LibriSpeech, and WeNetSpeech. The results demonstrate competitive performance in both non-streaming and streaming scenarios, outperforming several state-of-the-art models. The ablation studies effectively highlight the contributions of the proposed methodologies, particularly the impact of the latest-token fallback decoding strategy and the context-aware training paradigm on streaming performance. However, the paper could enhance its experimental rigor by including more diverse datasets and languages beyond the Chinese-English bilingual corpus.
The paper provides a reasonable level of detail regarding the architecture and training process, which aids reproducibility. However, the lack of specific URLs for code or datasets limits the ability for others to replicate the study fully. Including a link to a GitHub repository or supplementary materials would significantly improve reproducibility.
While the paper presents a strong framework, it is limited by its focus on a specific bilingual corpus, which may not generalize well to other languages or dialects. Additionally, the reliance on a single model (Qwen3-1.7B) for the decoder may restrict the exploration of how different LLMs could impact performance. The paper also does not sufficiently address potential computational costs associated with the unified architecture during deployment in real-world applications.
The implications of this research are significant, as it addresses a critical need for efficient ASR systems that can operate in real-time environments. The integration of LLMs into ASR frameworks could enhance applications in various fields, including accessibility technologies, real-time translation, and interactive voice response systems. By improving the accuracy and efficiency of ASR, this work could facilitate better communication in multilingual contexts and enhance user experiences in voice-activated technologies. The main contribution of this paper is the introduction of Uni-ASR, a unified architecture that effectively integrates non-streaming and streaming ASR capabilities, achieving competitive performance across both modes. This work represents a significant advancement in the field of automatic speech recognition, particularly in addressing the challenges of low-latency applications while maintaining high accuracy.
This paper introduces V2A-DPO, a novel Direct Preference Optimization (DPO) framework tailored for flow-based video-to-audio generation (V2A) models, incorporating key adaptations to effectively align generated audio with human preferences. Our approach incorporates three core innovations: (1) AudioScore-a comprehensive human preference-aligned scoring system for assessing semantic consistency, temporal alignment, and perceptual quality of synthesized audio; (2) an automated AudioScore-driven pipeline for generating large-scale preference pair data for DPO optimization; (3) a curriculum learning-empowered DPO optimization strategy specifically tailored for flow-based generative models. Experiments on benchmark VGGSound dataset demonstrate that human-preference aligned Frieren and MMAudio using V2A-DPO outperform their counterparts optimized using Denoising Diffusion Policy Optimization (DDPO) as well as pre-trained baselines. Furthermore, our DPO-optimized MMAudio achieves state-of-the-art performance across multiple metrics, surpassing published V2A models.
Primary: The Chinese University of Hong Kong
All Institutions: The Chinese University of Hong Kong, National Research Council Canada, Shanghai Jiao Tong University, The University of Warwick
The paper presents V2A-DPO, a novel framework for optimizing video-to-audio generation that aligns audio outputs with human preferences through innovative scoring and optimization strategies. This research significantly contributes to the field by addressing critical limitations in existing models and demonstrating state-of-the-art performance through rigorous experimentation.
The proposed V2A-DPO framework innovatively combines Direct Preference Optimization with a comprehensive scoring system (AudioScore) and a curriculum learning strategy, effectively addressing the limitations of existing video-to-audio generation models. The integration of human preference alignment into the optimization process is particularly noteworthy, as it enhances the perceptual quality and aesthetic appeal of generated audio. The methodology is well-structured, with clear definitions of the scoring metrics and a robust pipeline for generating preference pairs, which is crucial for training the models effectively.
The experiments conducted on the VGGSound dataset are thorough and demonstrate the effectiveness of the proposed method against state-of-the-art models. The results show significant improvements in multiple evaluation metrics, validating the authors' claims. The comparison against both DDPO and pre-trained baselines provides a solid foundation for the reported advancements. However, the paper could benefit from more extensive ablation studies to further dissect the contributions of each component in the proposed framework.
The paper provides sufficient implementation details, including the training setup, hyperparameters, and evaluation metrics, which enhances reproducibility. However, the absence of a publicly available code repository limits the ease of replication for other researchers. Providing a link to the code would greatly improve the reproducibility of the results.
One limitation is the reliance on a specific dataset (VGGSound), which may limit the generalizability of the findings to other domains or datasets. Additionally, while the AudioScore scoring system is comprehensive, its effectiveness in diverse contexts beyond the training data remains to be validated. The paper also does not address potential biases in the human preference annotations, which could affect the training process.
The advancements in video-to-audio generation have significant implications for multimedia content creation, accessibility, and entertainment industries. By aligning audio generation with human preferences, the proposed framework could enhance user experiences in applications such as film, gaming, and virtual reality. Furthermore, the methodologies developed could inspire future research in multimodal machine learning and preference learning, paving the way for more sophisticated generative models. The paper presents V2A-DPO, a novel framework for optimizing video-to-audio generation that aligns audio outputs with human preferences through innovative scoring and optimization strategies. This research significantly contributes to the field by addressing critical limitations in existing models and demonstrating state-of-the-art performance through rigorous experimentation.
Recent advancements in Speech Large Language Models have significantly enhanced multi-dimensional speech understanding. However, the majority of high-performance frameworks are predominantly optimized for GPU centric ecosystems and proprietary backbones, creating a significant gap for deployment on non-CUDA computing infrastructures. In this paper, we present OSUM-Pangu, a fully open-source speech understanding foundation model developed on a completely non-CUDA software and hardware stack. By integrating an audio encoder with the openPangu-7B LLM backbone, we successfully implement the entire training and inference pipeline on the Ascend NPU platform. To facilitate efficient task alignment under non-CUDA resource constraints, we adopt a practical training process that sequentially bridges speech perception and user intent recognition. Experimental results demonstrate that OSUM-Pangu achieves task accuracy comparable to mainstream GPU-based models while maintaining robust natural language interaction capabilities. Our work provides a reproducible, non-CUDA baseline for the open-source speech community, promoting the independent evolution of multimodal intelligence.
Primary: Speech and Language Processing Group
All Institutions: Speech and Language Processing Group
The main contribution of this paper is the development of OSUM-Pangu, an open-source multidimensional speech understanding framework optimized for non-CUDA environments. This work significantly advances the field by providing a viable alternative to GPU-centric models, facilitating the evolution of multimodal intelligence in diverse computing infrastructures.
The methodology presented in OSUM-Pangu is robust, integrating an audio encoder with the openPangu-7B LLM backbone to create a non-CUDA framework for speech understanding. The multi-stage training pipeline is well-structured, allowing for efficient task alignment and robust instruction following. The use of a modality adapter to bridge acoustic and linguistic processing is innovative, although the reliance on fixed task tags in the initial training stages may limit flexibility.
The experiments are comprehensive, utilizing a variety of datasets and demonstrating competitive performance against mainstream GPU-based models. The Instruction Following Rate (IFR) metric is a valuable addition, providing insights into the model's ability to interpret natural language instructions. However, the paper could benefit from more detailed comparisons with existing models, particularly in terms of real-world applicability.
The implementation details are adequately described, and the use of open-source components enhances reproducibility. However, the lack of a publicly available demo or clear instructions for reproducing the experiments may hinder broader adoption.
One limitation is the potential dependency on the Ascend NPU infrastructure, which may not be accessible to all researchers. Additionally, while the model shows strong performance, it may not generalize well to all speech understanding tasks, particularly those requiring extensive multimodal training.
OSUM-Pangu has significant implications for the development of speech understanding systems in non-CUDA environments, promoting accessibility and encouraging further research in open-source frameworks. Its approach may inspire future work in multimodal AI, particularly in settings where GPU resources are limited. The main contribution of this paper is the development of OSUM-Pangu, an open-source multidimensional speech understanding framework optimized for non-CUDA environments. This work significantly advances the field by providing a viable alternative to GPU-centric models, facilitating the evolution of multimodal intelligence in diverse computing infrastructures.
Recent advances in generative models have amplified the risk of malicious misuse of speech synthesis technologies, enabling adversaries to impersonate target speakers and access sensitive resources. Although speech deepfake detection has progressed rapidly, most existing countermeasures lack formal robustness guarantees or fail to generalize to unseen generation techniques. We propose PV-VASM, a probabilistic framework for verifying the robustness of voice anti-spoofing models (VASMs). PV-VASM estimates the probability of misclassification under text-to-speech (TTS), voice cloning (VC), and parametric signal transformations. The approach is model-agnostic and enables robustness verification against unseen speech synthesis techniques and input perturbations. We derive a theoretical upper bound on the error probability and validate the method across diverse experimental settings, demonstrating its effectiveness as a practical robustness verification tool.
Primary: Central University
All Institutions: Central University, Applied AI Institute, City University of Hong Kong, Trusted AI Research Center
The paper presents a robust framework for verifying the resilience of voice anti-spoofing models against emerging generative threats. Its comprehensive methodology and experimental validation position it as a valuable contribution to the field, although improvements in reproducibility and clarity would enhance its impact.
The proposed PV-VASM framework introduces a novel probabilistic approach to verify the robustness of voice anti-spoofing models against various transformations, including text-to-speech and voice cloning. The methodology is model-agnostic and leverages probabilistic concentration inequalities to derive a theoretical upper bound on misclassification probabilities. The authors provide a detailed mathematical formulation, which is commendable, but the complexity may hinder understanding for practitioners. The approach is well-grounded in existing literature, addressing a significant gap in the certification of voice anti-spoofing systems.
The experimental validation is extensive, covering a wide range of transformations and generative models. The authors utilize a combination of datasets, including ASVspoof and other open-source collections, to evaluate the robustness of their framework. The results demonstrate meaningful robustness certificates and highlight the framework's applicability in real-world scenarios. However, the paper could benefit from clearer visualizations of results and comparisons with existing methods to better illustrate the advantages of PV-VASM.
While the paper provides a comprehensive description of the methodology and experimental setup, it lacks specific implementation details and code availability, which are crucial for reproducibility. The absence of a project URL further complicates efforts to replicate the findings. Providing a GitHub repository or supplementary materials would greatly enhance the reproducibility of the results.
The paper acknowledges certain limitations, such as the potential over-conservativeness of the upper bounds and the varying performance of the framework against different types of perturbations. Additionally, the complexity of the methodology may pose challenges for practical implementation in real-world applications. The authors also note that the robustness against generative models is less effective, indicating a need for further refinement.
The proposed framework has significant implications for the field of audio security, particularly in enhancing the robustness of voice anti-spoofing systems. As generative models become increasingly sophisticated, the ability to certify the robustness of these systems is crucial for preventing misuse. The research contributes to the ongoing discourse on security in machine learning and could influence future developments in anti-spoofing technologies. The paper presents a robust framework for verifying the resilience of voice anti-spoofing models against emerging generative threats. Its comprehensive methodology and experimental validation position it as a valuable contribution to the field, although improvements in reproducibility and clarity would enhance its impact.
The modern generative audio models can be used by an adversary in an unlawful manner, specifically, to impersonate other people to gain access to private information. To mitigate this issue, speech deepfake detection (SDD) methods started to evolve. Unfortunately, current SDD methods generally suffer from the lack of generalization to new audio domains and generators. More than that, they lack interpretability, especially human-like reasoning that would naturally explain the attribution of a given audio to the bona fide or spoof class and provide human-perceptible cues. In this paper, we propose HIR-SDD, a novel SDD framework that combines the strengths of Large Audio Language Models (LALMs) with the chain-of-thought reasoning derived from the novel proposed human-annotated dataset. Experimental evaluation demonstrates both the effectiveness of the proposed method and its ability to provide reasonable justifications for predictions.
Primary: City University of Hong Kong
All Institutions: City University of Hong Kong, Applied AI Institute, Fusion Brain Lab, Trusted AI Research Center
The paper presents a novel approach to speech deepfake detection by integrating human-inspired reasoning with LALMs, addressing both detection performance and interpretability. This contribution is particularly relevant in the context of increasing concerns over audio deepfakes and their potential misuse, making the research timely and impactful in the field of machine learning and audio processing.
The proposed HIR-SDD framework innovatively integrates Large Audio Language Models (LALMs) with human-inspired reasoning through a novel dataset of human-annotated reasoning traces. This combination aims to enhance both the detection capabilities and interpretability of speech deepfake detection systems. The methodology is well-structured, utilizing hard-label and chain-of-thought (CoT) pipelines, and incorporates reinforcement learning to improve reasoning quality. However, while the approach is sound, the reliance on LALMs may introduce challenges related to generalization and robustness, particularly against high-fidelity deepfake audio that was not present in the training data.
The experimental evaluation is thorough, demonstrating the effectiveness of the HIR-SDD framework against conventional models like Wav2Vec2-AASIST. The paper provides detailed metrics such as accuracy, balanced accuracy, and F1 scores, and compares the performance of different training strategies. However, the results indicate that while the proposed model shows competitive performance, it still struggles with modern high-fidelity synthesis systems. The evaluation of reasoning quality through external models like GPT-5.1 adds credibility, but the overall improvement in reasoning quality remains modest.
The paper outlines the methodology and datasets used, but lacks sufficient detail on the implementation of the models and training procedures to ensure full reproducibility. While it mentions the use of specific models and training parameters, the absence of a publicly available code repository or demo limits the ability for others to replicate the results. The mention of future work to refine evaluation and stability suggests ongoing developments that may not yet be fully realized.
The primary limitations include the model's struggle with generalization to unseen high-fidelity deepfake audio, which is a critical aspect for practical applications. Additionally, the reasoning quality, while improved, does not show significant enhancements over traditional methods, indicating potential areas for further research. The dataset, while novel, may also be limited in scope, as it primarily focuses on English and Russian audio, which could affect the model's applicability to other languages or dialects.
The implications of this research are significant, particularly in areas where audio authenticity is critical, such as security and biometrics. By improving the interpretability of deepfake detection systems, the proposed framework could enhance trust in automated systems that rely on audio verification. The integration of human-inspired reasoning may also pave the way for more transparent AI systems in various domains, fostering greater public confidence in AI technologies. The paper presents a novel approach to speech deepfake detection by integrating human-inspired reasoning with LALMs, addressing both detection performance and interpretability. This contribution is particularly relevant in the context of increasing concerns over audio deepfakes and their potential misuse, making the research timely and impactful in the field of machine learning and audio processing.
Large language models (LLMs) provide strong semantic priors that can improve multi-talker automatic speech recognition (MT-ASR), but using an LLM as an autoregressive decoder is computationally expensive and remains fragile under heavy overlap. In this paper, we propose an encoder-only MT-ASR framework that adapts an LLM to multi-talker conditioning and distills its semantic guidance into the encoder during training, while retaining fast CTC-style decoding at inference. Our model employs a post-encoder separator with serialized CTC to produce talker-ordered transcripts, and leverages an adapted LLM-based SOT objective as a multi-talker-aware teacher signal to explicitly regularize mixed-speech representations. To further support variable numbers of talkers, we introduce a Talker-Count Head that predicts the talker count and dynamically selects the appropriate decoding branch. Experiments on LibriMix show that the proposed encoder-only model achieves comparable performance to LLM-based systems in the two-talker condition, while delivering significant improvements in the three-talker condition with significant small RTF.
Primary: SB Intuitions
All Institutions: SB Intuitions
This paper presents a novel encoder-only framework for multi-talker ASR that distills semantic guidance from LLMs during training, achieving competitive performance while maintaining inference efficiency. The integration of a Talker-Count Head and the focus on serialized CTC decoding represent meaningful advancements in the field, addressing key challenges in multi-talker speech recognition.
The proposed methodology effectively shifts the role of large language models (LLMs) from a computationally expensive decoder to a teacher for training an encoder-only multi-talker ASR system. The integration of a Talker-Count Head (TCH) to dynamically adapt to varying numbers of talkers is a notable innovation that addresses a common limitation in existing systems. The use of serialized CTC for efficient inference while maintaining semantic guidance through distillation is well-conceived, though the paper could benefit from clearer descriptions of the training process and hyperparameter choices.
The experiments conducted on the LibriMix dataset provide a solid foundation for evaluating the proposed model's performance. The results demonstrate that the encoder-only model achieves comparable performance to LLM-based systems in two-talker scenarios and outperforms them in three-talker conditions, showcasing the effectiveness of the proposed approach. However, the paper lacks detailed comparisons with more recent state-of-the-art methods, which could strengthen its claims.
While the paper outlines the model architecture and training phases, it does not provide sufficient implementation details or code availability, which may hinder reproducibility. Including a link to a code repository or supplementary materials would enhance the paper's reproducibility.
The reliance on a fixed number of encoder layers and the challenges associated with talker-count accuracy in three-talker conditions are notable limitations. Additionally, the paper does not address potential performance degradation in more complex or noisy environments beyond those tested.
The proposed framework has significant implications for real-world applications in automatic speech recognition, particularly in environments with overlapping speech. By improving the efficiency and accuracy of multi-talker ASR systems, this research could enhance communication technologies, accessibility tools, and voice-activated systems in various domains. This paper presents a novel encoder-only framework for multi-talker ASR that distills semantic guidance from LLMs during training, achieving competitive performance while maintaining inference efficiency. The integration of a Talker-Count Head and the focus on serialized CTC decoding represent meaningful advancements in the field, addressing key challenges in multi-talker speech recognition.
In Extended Reality (XR), complex acoustic environments often overwhelm users, compromising both scene awareness and social engagement due to entangled sound sources. We introduce MoXaRt, a real-time XR system that uses audio-visual cues to separate these sources and enable fine-grained sound interaction. MoXaRt's core is a cascaded architecture that performs coarse, audio-only separation in parallel with visual detection of sources (e.g., faces, instruments). These visual anchors then guide refinement networks to isolate individual sources, separating complex mixes of up to 5 concurrent sources (e.g., 2 voices + 3 instruments) with ~2 second processing latency. We validate MoXaRt through a technical evaluation on a new dataset of 30 one-minute recordings featuring concurrent speech and music, and a 22-participant user study. Empirical results indicate that our system significantly enhances speech intelligibility, yielding a 36.2% (p < 0.01) increase in listening comprehension within adversarial acoustic environments while substantially reducing cognitive load (p < 0.001), thereby paving the way for more perceptive and socially adept XR experiences.
Primary: unknown
All Institutions: unknown
MoXaRt presents a novel approach to audio-visual sound interaction in XR, significantly improving speech intelligibility and cognitive load management. The integration of visual cues with audio processing represents a meaningful advancement in the field, although further work is needed to enhance reproducibility and explore broader applications.
The methodology presented in MoXaRt is innovative, utilizing a cascaded architecture that combines audio-only separation with visual detection to enhance sound interaction in XR environments. The dual approach of coarse separation followed by refinement using visual cues is a novel integration that addresses the challenges of complex acoustic environments. However, the paper could benefit from a more detailed description of the algorithms used in the cascaded architecture and the specific techniques for visual detection and refinement.
The evaluation is robust, featuring a new dataset specifically designed for the study, which includes 30 one-minute recordings of concurrent speech and music. The user study with 22 participants provides empirical evidence of the system's effectiveness, demonstrating significant improvements in speech intelligibility and cognitive load reduction. The statistical significance of the results (p < 0.01 and p < 0.001) adds credibility to the findings, although further details on participant demographics and experimental controls would strengthen the evaluation.
The paper lacks sufficient implementation details that would allow for full reproducibility of the results. While the architecture is described, specific parameters, training procedures, and the dataset's characteristics are not thoroughly documented. Providing a code repository or supplementary materials would enhance reproducibility.
One limitation is the relatively small dataset size, which may affect the generalizability of the findings. Additionally, the system's performance in more diverse acoustic environments or with different types of sound sources has not been explored. The processing latency of ~2 seconds may also be a concern for real-time applications in XR.
The implications of MoXaRt are significant for the fields of XR and audio processing, as it addresses a critical challenge in creating immersive and socially engaging experiences. By improving speech intelligibility in complex environments, this research could enhance communication in various applications, from virtual meetings to gaming, thereby influencing user experience and interaction design in XR. MoXaRt presents a novel approach to audio-visual sound interaction in XR, significantly improving speech intelligibility and cognitive load management. The integration of visual cues with audio processing represents a meaningful advancement in the field, although further work is needed to enhance reproducibility and explore broader applications.
Existing video personalization methods preserve visual likeness but treat video and audio separately. Without access to the visual scene, audio models cannot synchronize sounds with on-screen actions; and because classical voice-cloning models condition only on a reference recording, a text prompt cannot redirect speaking style or acoustic environment. We propose ID-LoRA (Identity-Driven In-Context LoRA), which jointly generates a subject's appearance and voice in a single model, letting a text prompt, a reference image, and a short audio clip govern both modalities together. ID-LoRA adapts the LTX-2 joint audio-video diffusion backbone via parameter-efficient In-Context LoRA and, to our knowledge, is the first method to personalize visual appearance and voice in a single generative pass. Two challenges arise. Reference and generation tokens share the same positional-encoding space, making them hard to distinguish; we address this with negative temporal positions, placing reference tokens in a disjoint RoPE region while preserving their internal temporal structure. Speaker characteristics also tend to be diluted during denoising; we introduce identity guidance, a classifier-free guidance variant that amplifies speaker-specific features by contrasting predictions with and without the reference signal. In human preference studies, ID-LoRA is preferred over Kling 2.6 Pro by 73% of annotators for voice similarity and 65% for speaking style. On cross-environment settings, speaker similarity improves by 24% over Kling, with the gap widening as conditions diverge. A preliminary user study further suggests that joint generation provides a useful inductive bias for physically grounded sound synthesis. ID-LoRA achieves these results with only ~3K training pairs on a single GPU. Code, models, and data will be released.
Primary: Tel Aviv University
All Institutions: Tel Aviv University
The main contribution of this paper is the introduction of ID-LoRA, a unified framework for audio-video personalization that allows for joint generation of visual and auditory content based on text prompts, significantly advancing the capabilities of generative models in the multimedia domain. The technical contributions, particularly in methodology and experimental validation, position this work as a notable advancement in the field of machine learning and generative media.
The proposed ID-LoRA framework innovatively integrates audio and video generation through a unified model, addressing the limitations of existing cascaded approaches. The introduction of negative temporal positions and identity guidance are significant methodological advancements that enhance the model's ability to preserve speaker identity while allowing for flexible text-based control over both audio and visual outputs. The use of a joint audio-video diffusion backbone (LTX-2) is a strong choice, leveraging the latest advancements in diffusion models to achieve high-quality generative results.
The experiments are robust, featuring a comprehensive evaluation protocol that includes both automatic metrics and human preference studies. The paper demonstrates significant improvements over state-of-the-art models, including commercial solutions, in speaker similarity and lip synchronization. The use of multiple datasets (CelebV-HQ and TalkVid) and the careful construction of evaluation splits (easy and hard) provide a thorough assessment of the model's performance across different conditions. The human evaluation metrics, including A/B preference tests and Mean Opinion Scores, add credibility to the findings.
The paper provides detailed implementation information, including training parameters, dataset preprocessing, and evaluation metrics, which enhances reproducibility. However, the reliance on specific datasets and the mention of proprietary models (like Kling 2.6 Pro) could pose challenges for complete reproducibility in diverse contexts.
While the model shows strong performance, it may still struggle with extreme variations in acoustic environments or highly dynamic visual scenes that were not extensively tested. The model's reliance on a relatively small training dataset (~3K pairs) raises questions about its generalization capabilities in broader applications. Additionally, the ethical implications of generating realistic audio-visual content without consent are not fully addressed.
The potential applications of ID-LoRA are significant, including personalized content creation, multilingual dubbing, and accessibility tools. However, the technology also poses risks related to misuse, such as deepfakes and unauthorized impersonation. The authors acknowledge these risks and suggest mitigations, emphasizing the need for ethical considerations in deployment. The main contribution of this paper is the introduction of ID-LoRA, a unified framework for audio-video personalization that allows for joint generation of visual and auditory content based on text prompts, significantly advancing the capabilities of generative models in the multimedia domain. The technical contributions, particularly in methodology and experimental validation, position this work as a notable advancement in the field of machine learning and generative media.
Recent advancements in speech captioning models have enabled the generation of rich, fine-grained captions for emotional speech. However, the evaluation of such captions remains a critical bottleneck: traditional N-gram metrics fail to capture semantic nuances, while LLM judges often suffer from reasoning inconsistency and context-collapse when processing long-form descriptions. In this work, we propose EmoSURA, a novel evaluation framework that shifts the paradigm from holistic scoring to atomic verification. EmoSURA decomposes complex captions into Atomic Perceptual Units, which are self-contained statements regarding vocal or emotional attributes, and employs an audio-grounded verification mechanism to validate each unit against the raw speech signal. Furthermore, we address the scarcity of standardized evaluation resources by introducing SURABench, a carefully balanced and stratified benchmark. Our experiments show that EmoSURA achieves a positive correlation with human judgments, offering a more reliable assessment for long-form captions compared to traditional metrics, which demonstrated negative correlations due to their sensitivity to caption length.
Primary: TUM University Hospital
All Institutions: TUM University Hospital, Imperial College London, Munich Center for Machine Learning
The paper presents EmoSURA, a novel evaluation framework for emotional speech captions that significantly improves upon traditional methods by focusing on atomic verification and introducing a new benchmark for evaluation. This work is poised to advance the state of the art in emotional speech processing and evaluation methodologies.
The proposed EmoSURA framework introduces a novel approach to evaluating emotional speech captions by breaking down complex captions into Atomic Perceptual Units. This decomposition allows for a more granular analysis of emotional and vocal attributes, which is a significant improvement over traditional evaluation metrics that often fail to capture the nuances of long-form captions. The audio-grounded verification mechanism adds robustness to the evaluation process, ensuring that the assessments are closely tied to the actual audio content. The methodology is well-structured, with clear definitions and rules for the extraction and verification processes.
The experiments conducted demonstrate a positive correlation between EmoSURA's assessments and human judgments, which is a critical validation of the framework's effectiveness. The introduction of SURABench as a benchmark for evaluation is commendable, as it addresses the scarcity of standardized resources in this domain. However, the paper could benefit from more extensive quantitative results and comparisons with existing evaluation metrics to further substantiate the claims.
The paper lacks detailed implementation specifics, such as the algorithms used for audio verification and the criteria for selecting the datasets. Providing access to code or a clear description of the experimental setup would enhance reproducibility and allow other researchers to validate the findings.
One limitation is the reliance on human judgment for correlation validation, which can be subjective. Additionally, while the framework shows promise, its effectiveness across diverse languages and emotional contexts has not been explored, which could limit its applicability.
EmoSURA has the potential to significantly impact the fields of speech processing and emotional AI by providing a more reliable evaluation framework for emotional speech captions. This could enhance applications in areas such as virtual assistants, therapy bots, and any system that relies on understanding emotional nuances in speech. The paper presents EmoSURA, a novel evaluation framework for emotional speech captions that significantly improves upon traditional methods by focusing on atomic verification and introducing a new benchmark for evaluation. This work is poised to advance the state of the art in emotional speech processing and evaluation methodologies.
Recent advances in zero-shot voice conversion have exhibited potential in emotion control, yet the performance is suboptimal or inconsistent due to their limited expressive capacity. We propose Emotion-Aware Prefix for explicit emotion control in a two-stage voice conversion backbone. We significantly improve emotion conversion performance, doubling the baseline Emotion Conversion Accuracy (ECA) from 42.40% to 85.50% while maintaining linguistic integrity and speech quality, without compromising speaker identity. Our ablation study suggests that a joint control of both sequence modulation and acoustic realization is essential to synthesize distinct emotions. Furthermore, comparative analysis verifies the generalizability of proposed method, while it provides insights on the role of acoustic decoupling in maintaining speaker identity.
Primary: The University of Texas at Dallas
All Institutions: The University of Texas at Dallas
The paper introduces the Emotion-Aware Prefix, significantly enhancing explicit emotion control in voice conversion models. This work presents a well-structured methodology with substantial technical contributions, demonstrating a clear impact on the field of audio processing and voice synthesis.
The proposed Emotion-Aware Prefix introduces a novel approach to emotion control in voice conversion models, utilizing a two-stage backbone that enhances the expressiveness of synthesized speech. The methodology is well-structured, leveraging Deep-Prefix Prompting to achieve significant improvements in Emotion Conversion Accuracy (ECA). The paper provides a clear framework for joint control of sequence modulation and acoustic realization, which is a critical advancement in the field. The methodology is sound and addresses a pertinent challenge in voice conversion.
The experiments are robust, demonstrating a clear improvement in ECA from 42.40% to 85.50%. The use of ablation studies to analyze the contributions of different components of the model adds depth to the evaluation. Comparative analysis further strengthens the findings by showcasing the generalizability of the proposed method. However, the paper could benefit from more detailed descriptions of the datasets used and the specific metrics applied during evaluation.
The paper mentions the use of Gen AI tools for grammar and wording corrections, but it lacks detailed implementation specifics that would facilitate reproducibility. There is no mention of code availability or specific datasets, which are critical for other researchers to replicate the results.
While the paper presents significant improvements, it does not address potential limitations such as the model's performance across diverse languages or accents. Additionally, the reliance on a two-stage model may introduce complexity that could hinder real-time applications. The paper could also explore the trade-offs between emotion expressiveness and speaker identity preservation more thoroughly.
The advancements in emotion-aware voice conversion have significant implications for various applications, including virtual assistants, gaming, and therapeutic tools. By improving emotional expressiveness in synthesized speech, this work could enhance user experience and engagement in interactive systems. The insights on acoustic decoupling could also inform future research in maintaining speaker identity while allowing for emotional variability. The paper introduces the Emotion-Aware Prefix, significantly enhancing explicit emotion control in voice conversion models. This work presents a well-structured methodology with substantial technical contributions, demonstrating a clear impact on the field of audio processing and voice synthesis.
While multi-audio understanding is critical for large audio-language models (LALMs), it remains underexplored. We introduce MUGEN, a comprehensive benchmark evaluating this capability across speech, general audio, and music. Our experiments reveal consistent weaknesses in multi-audio settings, and performance degrades sharply as the number of concurrent audio inputs increases, identifying input scaling as a fundamental bottleneck. We further investigate training-free strategies and observe that Audio-Permutational Self-Consistency, which diversifies the order of audio candidates, helps models form more robust aggregated predictions, yielding up to 6.28% accuracy gains. Combining this permutation strategy with Chain-of-Thought further improves performance to 6.74%. These results expose blind spots in current LALMs and provide a foundation for evaluating complex auditory comprehension.
Primary: National Taiwan University
All Institutions: National Taiwan University
This paper presents MUGEN, a comprehensive benchmark for evaluating multi-audio understanding in LALMs, revealing critical blind spots in current models and proposing innovative strategies for improvement. The work is significant as it addresses a largely underexplored area in audio processing, providing a foundation for future advancements in the field.
The paper introduces MUGEN, a novel benchmark for multi-audio understanding in large audio-language models (LALMs). The methodology is robust, employing a comprehensive task design that emphasizes cross-audio comparison and integrates various auditory dimensions. The use of training-free strategies like Audio-Permutational Self-Consistency (APSC) is innovative, showcasing a thoughtful approach to overcoming positional biases in audio processing. The combination of self-consistency and permutation strategies represents a significant methodological advancement in the field.
The experimental setup is thorough, benchmarking multiple advanced LALMs across diverse tasks. The results clearly demonstrate the limitations of current models in multi-audio scenarios, particularly in non-semantic dimensions. The systematic evaluation of performance degradation with increasing audio inputs is particularly insightful, revealing critical challenges in scaling multi-audio understanding. The use of a variety of models, including proprietary and open-source, adds depth to the comparative analysis.
The paper provides sufficient detail regarding the experimental setup, including model configurations and evaluation metrics. However, the reliance on proprietary models may limit reproducibility for some aspects of the study. The availability of the MUGEN benchmark on Hugging Face enhances reproducibility for future research.
The paper acknowledges that while the proposed benchmark reveals significant weaknesses in current LALMs, it does not address potential solutions for improving non-semantic reasoning capabilities. Additionally, the focus on training-free strategies may overlook the potential benefits of fine-tuning models for enhanced performance.
The introduction of MUGEN has the potential to significantly advance research in multi-audio understanding, paving the way for improved applications in real-world audio-centric tasks. By highlighting the limitations of existing models, the paper encourages further exploration and development of more robust audio-language systems, which could have implications in areas like speech analytics, audio retrieval, and interactive voice agents. This paper presents MUGEN, a comprehensive benchmark for evaluating multi-audio understanding in LALMs, revealing critical blind spots in current models and proposing innovative strategies for improvement. The work is significant as it addresses a largely underexplored area in audio processing, providing a foundation for future advancements in the field.
While large-scale omni-models have demonstrated impressive capabilities across various modalities, their strong performance heavily relies on massive multimodal data and incurs substantial computational costs. This work introduces Speech-Omni-Lite, a cost-efficient framework for extending pre-trained Visual-Language (VL) backbones with speech understanding and generation capabilities, while fully preserving the backbones' vision-language performance. Specifically, the VL backbone is equipped with two lightweight, trainable plug-and-play modules, a speech projector and a speech token generator, while keeping the VL backbone fully frozen. To mitigate the scarcity of spoken QA corpora, a low-cost data construction strategy is proposed to generate Question-Text Answer-Text-Speech (QTATS) data from existing ASR speech-text pairs, facilitating effective speech generation training. Experimental results show that, even with only thousands of hours of speech training data, Speech-Omni-Lite achieves excellent spoken QA performance, which is comparable to omni-models trained on millions of hours of speech data. Furthermore, the learned speech modules exhibit strong transferability across VL backbones.
Primary: Huawei Leibniz Research Center
All Institutions: Huawei Leibniz Research Center, The Hong Kong Polytechnic University, Harbin Institute of Technology, Shenzhen, Hong Kong University of Science and Technology, Shenzhen Loop Area Institute
This paper presents a cost-efficient framework for integrating speech capabilities into vision-language models, significantly advancing multimodal machine learning by reducing resource requirements and enhancing accessibility. The technical contributions, particularly in methodology and experimental validation, position this work as a meaningful advancement in the field.
The methodology is innovative, introducing a cost-efficient framework that utilizes lightweight, trainable modules to extend pre-trained vision-language models with speech capabilities. The approach of keeping the backbone frozen while adding modular components is a significant advancement, allowing for efficient training and transferability across different backbones. The use of a novel data construction strategy (QTATS) to mitigate data scarcity is particularly noteworthy, as it leverages existing resources effectively.
The experimental results demonstrate strong performance in spoken QA tasks, achieving results comparable to larger models trained on extensive datasets. The evaluation metrics used, including WER and CER, provide a solid foundation for assessing the model's capabilities. However, the paper could benefit from more extensive ablation studies to further validate the contributions of individual components.
The paper provides detailed descriptions of the model architecture and training procedures, which enhances reproducibility. However, the absence of a public code repository limits the ability for other researchers to replicate the results fully. Including a demo or project URL would have further supported reproducibility efforts.
One limitation is the reliance on synthetic data for training the speech token generator, which may introduce biases or inconsistencies compared to real-world data. Additionally, the performance on ASR tasks is noted to be lower than that of larger models, indicating potential areas for improvement. The paper also does not address the computational requirements for the training of the lightweight modules, which could be a concern for smaller institutions.
The work has significant implications for democratizing access to advanced multimodal AI systems, as it reduces the computational burden and data requirements typically associated with training omni-modal models. This could lead to broader applications in accessibility technologies, particularly for individuals with disabilities who rely on speech interfaces. The framework also raises ethical considerations regarding the deployment of generative models, particularly in ensuring safety and alignment with societal values. This paper presents a cost-efficient framework for integrating speech capabilities into vision-language models, significantly advancing multimodal machine learning by reducing resource requirements and enhancing accessibility. The technical contributions, particularly in methodology and experimental validation, position this work as a meaningful advancement in the field.
Achieving high perceptual quality without hallucination remains a challenge in generative speech enhancement (SE). A representative approach, PASE, is robust to hallucination but has limited perceptual quality under adverse conditions. We propose StuPASE, built upon PASE to achieve studio-level quality while retaining its low-hallucination property. First, we show that finetuning PASE with dry targets rather than targets containing simulated early reflections substantially improves dereverberation. Second, to address performance limitations under strong additive noise, we replace the GAN-based generative module in PASE with a flow-matching module, enabling studio-quality generation even under highly challenging conditions. Experiments demonstrate that StuPASE consistently produces perceptually high-quality speech while maintaining low hallucination, outperforming state-of-the-art SE methods. Audio demos are available at: https://xiaobin-rong.github.io/stupase_demo/.
Primary: Nanjing University
All Institutions: Nanjing University, Collaboration AI, Key Laboratory of Modern Acoustics, NJU-Horizon Intelligent Audio Lab
The main contribution of this paper is the introduction of StuPASE, a generative speech enhancement framework that achieves studio-quality output while maintaining low hallucination through innovative methodologies. The technical contributions, particularly the integration of flow-matching and dry-target finetuning, represent a meaningful advancement in the field of generative audio processing, addressing critical challenges in speech enhancement.
The paper introduces a novel approach to generative speech enhancement (SE) by building on the existing PASE framework. The authors effectively leverage dry-target finetuning to improve dereverberation and replace the GAN-based generative module with a flow-matching module, which enhances the quality of generated speech under adverse conditions. The methodology is well-structured, with clear explanations of the components involved, including the semantic enhancement module and the acoustic enhancement module. The use of flow-matching for high-fidelity generation is a significant advancement over traditional GAN approaches, which are known to struggle with artifacts and noise suppression.
The experiments are comprehensive, utilizing a substantial dataset of around 2,000 hours of clean speech and various noise types. The evaluation metrics are robust, encompassing both objective measures (DNSMOS, UTMOS, etc.) and subjective assessments (Q-MOS, S-MOS). The results demonstrate that StuPASE consistently outperforms state-of-the-art methods, showcasing significant improvements in perceptual quality and linguistic integrity. The ablation studies further validate the contributions of dry-target finetuning and the flow-matching module, reinforcing the effectiveness of the proposed approach.
The paper provides detailed implementation details, including training protocols, dataset descriptions, and evaluation metrics. However, the lack of a publicly available code repository limits reproducibility. While the methodology is described thoroughly, access to the actual implementation would enhance the ability of other researchers to replicate the results.
One limitation is the reliance on specific datasets, which may not generalize across all speech enhancement scenarios. Additionally, while the paper demonstrates improvements in perceptual quality, it does not extensively explore the computational efficiency of the new model compared to its predecessors. The subjective evaluation is based on a limited number of samples, which may not capture the full variability of real-world applications.
The advancements presented in this paper have significant implications for applications in speech enhancement, particularly in environments with challenging acoustic conditions. The ability to produce high-quality speech with low hallucination can benefit various fields, including telecommunications, assistive technologies, and media production. The findings could pave the way for further research into generative models in audio processing, potentially influencing future developments in AI-driven audio technologies. The main contribution of this paper is the introduction of StuPASE, a generative speech enhancement framework that achieves studio-quality output while maintaining low hallucination through innovative methodologies. The technical contributions, particularly the integration of flow-matching and dry-target finetuning, represent a meaningful advancement in the field of generative audio processing, addressing critical challenges in speech enhancement.
Explainable speech quality assessment requires moving beyond Mean Opinion Scores (MOS) to analyze underlying perceptual dimensions. To address this, we introduce a novel post-training method that tailors the foundational Audio Large Language Model for multidimensional reasoning, detection and classification of audio artifacts. First, a calibration stage aligns the model to predict predefined perceptual dimensions. Second, a reinforcement learning stage leverages Group Relative Policy Optimization (GRPO) with dimension-specific rewards to heavily enhance accuracy of descriptions and temporal localization of quality issues. With this approach we reach state-of-the-art results of 0.71 mean PCC score on the multidimensional QualiSpeech benchmark and 13% improvement in MOS prediction driven by RL-based reasoning. Furthermore, our fine-grained GRPO rewards substantially advance the model's ability to pinpoint and classify audio artifacts in time.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of a novel Calibration-Reasoning framework that significantly enhances the interpretability and accuracy of speech quality assessment through a two-stage methodology. This work represents a meaningful advancement in the integration of audio processing and large language models, addressing critical gaps in existing approaches and setting a new standard for future research in the domain.
The paper introduces a two-stage Calibration-Reasoning framework that effectively enhances the interpretability and accuracy of speech quality assessment. The Calibration stage aligns the model to predict perceptual dimensions, while the Reasoning stage employs Group Relative Policy Optimization (GRPO) with dimension-specific rewards to improve the model's diagnostic capabilities. This dual approach is innovative, as it addresses the limitations of existing models that either lack interpretability or fail to provide detailed artifact classification and localization.
The authors conduct extensive experiments using the QualiSpeech benchmark, achieving a state-of-the-art mean Pearson Correlation Coefficient (PCC) of 0.71 and a 13% improvement in MOS prediction. The results are rigorously presented, with a clear comparison against existing baselines, demonstrating the effectiveness of the proposed methods in both numerical accuracy and descriptive capabilities. The ablation studies further validate the necessity of both stages in the framework.
The paper provides links to the model weights and demo pages, which enhances reproducibility. However, the implementation details could be more explicit, particularly regarding the training configurations and hyperparameters used for the models.
The paper acknowledges limitations, such as the computational overhead introduced by unfreezing the audio encoder and the dependency on the predefined artifact taxonomy of the QualiSpeech benchmark. This could hinder the model's performance on novel audio artifacts not represented in the training data.
The proposed framework has significant implications for the field of speech quality assessment, particularly in applications requiring high interpretability and precision, such as telecommunication and audio streaming services. The ability to classify and localize audio artifacts can lead to improved user experiences and more effective quality control measures in audio processing systems. The main contribution of this paper is the introduction of a novel Calibration-Reasoning framework that significantly enhances the interpretability and accuracy of speech quality assessment through a two-stage methodology. This work represents a meaningful advancement in the integration of audio processing and large language models, addressing critical gaps in existing approaches and setting a new standard for future research in the domain.
Emotions play a central role in human communication, shaping trust, engagement, and social interaction. As artificial intelligence systems powered by large language models become increasingly integrated into everyday life, enabling them to reliably understand and generate human emotions remains an important challenge. While emotional expression is inherently multimodal, this thesis focuses on emotions conveyed through spoken language and investigates how acoustic and semantic information can be jointly modeled to advance both emotion understanding and emotion synthesis from speech. The first part of the thesis studies emotion-aware representation learning through pre-training. We propose strategies that incorporate acoustic and semantic supervision to learn representations that better capture affective cues in speech. A speech-driven supervised pre-training framework is also introduced to enable large-scale emotion-aware text modeling without requiring manually annotated text corpora. The second part addresses emotion recognition in conversational settings. Hierarchical architectures combining cross-modal attention and mixture-of-experts fusion are developed to integrate acoustic and semantic information across conversational turns. Finally, the thesis introduces a textless and non-parallel speech-to-speech framework for emotion style transfer that enables controllable emotional transformations while preserving speaker identity and linguistic content. The results demonstrate improved emotion transfer and show that style-transferred speech can be used for data augmentation to improve emotion recognition.
Primary: Indian Institute of Science
All Institutions: Indian Institute of Science
The main contribution of this paper is the development of a multimodal framework for emotion recognition and synthesis in spoken language, which integrates acoustic and semantic information to enhance the understanding and generation of human emotions. This work is significant as it addresses a critical challenge in AI and human-computer interaction, paving the way for more emotionally aware systems.
The paper presents a comprehensive methodology that integrates acoustic and semantic modeling for emotion recognition in spoken language. The approach is innovative in its use of pre-training strategies that combine acoustic and semantic supervision, which is a notable advancement in the field. The hierarchical architectures developed for emotion recognition in conversational settings, particularly the cross-modal attention and mixture-of-experts fusion, demonstrate a sophisticated understanding of multimodal interactions. The introduction of a textless speech-to-speech framework for emotion style transfer is particularly novel, as it addresses the challenge of preserving speaker identity while allowing for emotional transformations.
The experiments are well-structured, utilizing established datasets such as IEMOCAP and MELD, which are appropriate for the tasks at hand. The results indicate significant improvements in emotion recognition and style transfer capabilities, showcasing the effectiveness of the proposed methods. However, the paper could benefit from a more detailed analysis of the performance metrics used, as well as comparisons with existing state-of-the-art systems to contextualize the improvements.
The paper includes some implementation details, such as dropout rates and training epochs, but lacks comprehensive information on the experimental setup, code availability, and data preprocessing steps. This omission may hinder reproducibility, as other researchers may struggle to replicate the results without access to the full methodology and code.
One limitation is the reliance on existing datasets, which may not fully capture the diversity of emotional expressions in real-world scenarios. Additionally, the paper does not address potential biases in the datasets used, which could affect the generalizability of the models. The emotional categories considered may also be limited, potentially excluding nuanced emotional states.
The findings of this research have significant implications for the development of emotionally intelligent AI systems, which can enhance human-computer interaction in various applications, including virtual assistants, customer service bots, and therapeutic tools. By improving emotion recognition and synthesis, the work contributes to creating more engaging and empathetic AI systems that can better understand and respond to human emotions. The main contribution of this paper is the development of a multimodal framework for emotion recognition and synthesis in spoken language, which integrates acoustic and semantic information to enhance the understanding and generation of human emotions. This work is significant as it addresses a critical challenge in AI and human-computer interaction, paving the way for more emotionally aware systems.
Advances in large language models (LLMs) have enabled significant capabilities in audio processing, resulting in state-of-the-art models now known as Large Audio Language Models (LALMs). However, minimal work has been done to measure audio understanding beyond automatic speech recognition (ASR). This paper closes that gap by proposing a benchmark suite, SCENEBench (Spatial, Cross-lingual, Environmental, Non-speech Evaluation), that targets a broad form of audio comprehension across four real-world categories: background sound understanding, noise localization, cross-linguistic speech understanding, and vocal characterizer recognition. These four categories are selected based on understudied needs from accessibility technology and industrial noise monitoring. In addition to performance, we also measure model latency. The purpose of this benchmark suite is to assess audio beyond just what words are said - rather, how they are said and the non-speech components of the audio. Because our audio samples are synthetically constructed (e.g., by overlaying two natural audio samples), we further validate our benchmark against 20 natural audio items per task, sub-sampled from existing datasets to match our task criteria, to assess ecological validity. We assess five state-of-the-art LALMs and find critical gaps: performance varies across tasks, with some tasks performing below random chance and others achieving high accuracy. These results provide direction for targeted improvements in model capabilities.
Primary: Stanford University
All Institutions: Stanford University
This paper presents a novel benchmark suite for audio understanding, SCENEBench, which targets critical yet understudied areas in audio processing. The methodology and experimental results provide valuable insights into the capabilities and limitations of current LALMs, paving the way for future research and development in the field.
The methodology presented in this paper is robust, focusing on the creation of SCENEBench, which is a comprehensive benchmark suite for evaluating audio understanding across diverse tasks. The authors have carefully chosen categories that address real-world needs, particularly in assistive technology and industrial applications. The synthetic construction of audio samples, while innovative, raises questions about the ecological validity of the results. The validation against natural audio samples is a positive aspect, but further details on the synthesis process and selection criteria could enhance understanding.
The experimental evaluation is thorough, assessing five state-of-the-art LALMs across the proposed tasks. The results reveal significant performance gaps, indicating that some models struggle with specific tasks, which is critical for guiding future research. However, the paper could benefit from a more detailed analysis of the performance metrics used and how they correlate with real-world applicability. The inclusion of model latency as a performance metric is a valuable addition, emphasizing practical considerations in audio processing.
The paper does not provide sufficient details regarding the implementation of the benchmark or the models used in the experiments. This lack of transparency may hinder reproducibility. Including more information on the datasets, model configurations, and evaluation protocols would be beneficial for other researchers looking to replicate or build upon this work.
One limitation is the reliance on synthetic audio samples, which may not fully capture the complexities of real-world audio environments. Additionally, the paper acknowledges that some tasks performed below random chance, indicating potential issues with either the benchmark design or the models tested. The authors could further explore the reasons behind these performance discrepancies.
The proposed benchmark has the potential to significantly impact both academic research and practical applications in audio understanding. By addressing gaps in current evaluation methods, SCENEBench could facilitate advancements in assistive technologies and industrial noise monitoring, ultimately improving accessibility and safety in various environments. This paper presents a novel benchmark suite for audio understanding, SCENEBench, which targets critical yet understudied areas in audio processing. The methodology and experimental results provide valuable insights into the capabilities and limitations of current LALMs, paving the way for future research and development in the field.
Keyword spotting (KWS) is crucial for many speech-driven applications, but robust KWS in noisy environments remains challenging. Conventional systems often rely on single-channel inputs and a cascaded pipeline separating front-end enhancement from KWS. This precludes joint optimization, inherently limiting performance. We present an end-to-end multi-channel KWS framework that exploits spatial cues to improve noise robustness. A spatial encoder learns inter-channel features, while a spatial embedding injects directional priors; the fused representation is processed by a streaming backbone. Experiments in simulated noisy conditions across multiple signal-to-noise ratios (SNRs) show that spatial modeling and directional priors each yield clear gains over baselines, with their combination achieving the best results. These findings validate end-to-end multi-channel spatial modeling, indicating strong potential for the target-speaker-aware detection in complex acoustic scenarios.
Primary: Midea Group (Shanghai) Co
All Institutions: Midea Group (Shanghai) Co
The main contribution of this paper is the introduction of an end-to-end direction-aware KWS framework that effectively utilizes spatial cues to improve keyword detection in noisy environments. This work represents a meaningful advancement in the field of audio processing, particularly for applications requiring robust performance in complex acoustic scenarios.
The paper presents a novel end-to-end multi-channel keyword spotting (KWS) framework that integrates a spatial encoder and a direction-aware embedding. This approach is significant as it allows for joint optimization of feature extraction and detection, which is a departure from traditional cascaded systems. The methodology is well-structured, detailing the components of the framework, including the spatial encoder, spatial embedding, and the streaming KWS model. The use of multi-channel signals and the explicit incorporation of spatial cues are innovative aspects that enhance noise robustness. However, the assumption of known direction-of-arrival (DOA) during training and evaluation may limit practical applicability.
The experimental setup is thorough, utilizing the Google Speech Commands dataset and simulating various noisy environments to evaluate the proposed framework's performance. The results demonstrate clear advantages over traditional single-channel and cascaded systems, particularly in challenging acoustic conditions. The paper provides a comprehensive analysis of performance metrics across different signal-to-noise ratios (SNRs), showcasing the effectiveness of the spatial modeling and directional priors. However, the results could benefit from comparisons with more contemporary KWS systems to better contextualize the advancements.
The paper lacks detailed implementation specifics, such as hyperparameter settings and training protocols, which are crucial for reproducibility. While the methodology is described, the absence of a publicly available code repository or demo limits the ability for other researchers to replicate the findings. Future work should consider releasing the model and code to enhance reproducibility.
One limitation is the reliance on known DOA during training and evaluation, which may not be feasible in real-world applications. Additionally, the performance gains from spatial priors are modest, particularly in scenarios without strong directional interference, suggesting that the model may not always leverage the spatial information effectively. The paper also does not address potential computational overhead introduced by the multi-channel processing.
The proposed KWS framework has significant implications for voice-controlled applications, particularly in noisy environments where traditional systems struggle. By improving noise robustness and leveraging spatial information, this research can enhance user experience in various applications, including smart home devices and personal assistants. The modular nature of the framework also opens avenues for future research, such as integrating dynamic DOA estimation and enhancing the model's adaptability to diverse acoustic conditions. The main contribution of this paper is the introduction of an end-to-end direction-aware KWS framework that effectively utilizes spatial cues to improve keyword detection in noisy environments. This work represents a meaningful advancement in the field of audio processing, particularly for applications requiring robust performance in complex acoustic scenarios.
Audiovisual speech recognition (AVSR) combines acoustic and visual cues to improve transcription robustness under challenging conditions but remains out of reach for most under-resourced languages due to the lack of labeled video corpora for training. We propose a zero-AV-resource AVSR framework that relies on synthetic visual streams generated by lip-syncing static facial images with real audio. We first evaluate synthetic visual augmentation on Spanish benchmarks, then apply it to Catalan, a language with no annotated audiovisual corpora. We synthesize over 700 hours of talking-head video and fine-tune a pre-trained AV-HuBERT model. On a manually annotated Catalan benchmark, our model achieves near state-of-the-art performance with much fewer parameters and training data, outperforms an identically trained audio-only baseline, and preserves multimodal advantages in noise. Scalable synthetic video thus offers a viable substitute for real recordings in zero-AV-resource AVSR.
Primary: Universitat Politรจcnica de Catalunya (UPC)
All Institutions: Barcelona Supercomputing Center (BSC), Universitat Politรจcnica de Catalunya (UPC)
The main contribution of this paper is the introduction of a zero-AV-resource AVSR framework that utilizes synthetic visual data to enhance speech recognition capabilities in under-resourced languages. This innovative approach not only addresses a critical gap in the field but also opens avenues for future research and development in multimodal speech recognition.
The proposed methodology leverages synthetic visual data generated from static images to create a training framework for AVSR in zero-resource scenarios. The use of lip-syncing techniques to generate talking-head videos is innovative, particularly in the context of under-resourced languages like Catalan. The end-to-end pipeline for generating synthetic audiovisual data is well-structured and language-agnostic, which enhances the applicability of the approach. The integration of a semi-automatic annotation pipeline further strengthens the methodology by providing a means to evaluate the model effectively. However, the reliance on synthetic data may raise questions about the generalizability of the results to real-world applications.
The experiments conducted are thorough, comparing the proposed model against both audio-only baselines and state-of-the-art ASR systems. The results demonstrate significant improvements in transcription accuracy when using synthetic visual data, particularly in challenging noise conditions. The authors provide clear metrics (WER) to quantify performance, and the comparative analysis with existing models like Whisper adds depth to the evaluation. However, the paper could benefit from more extensive ablation studies to further dissect the contributions of various components of the model.
The paper includes a link to the GitHub repository containing the code and resources for synthetic data generation and annotation, which is a positive aspect for reproducibility. However, the details regarding the datasets and specific configurations used in the experiments could be more explicitly stated to facilitate replication by other researchers.
One limitation is the potential gap between synthetic and real-world data, as the synthetic videos may not fully capture the complexities of natural speech and visual cues. Additionally, while the model shows promise for Catalan, its performance on other under-resourced languages remains untested. The reliance on a single method for generating synthetic videos may also limit the robustness of the approach.
This research has the potential to significantly impact the field of speech recognition, particularly for under-resourced languages, by providing a scalable method for training AVSR systems without the need for extensive audiovisual datasets. The implications extend to various applications in accessibility, communication technologies, and language preservation. The main contribution of this paper is the introduction of a zero-AV-resource AVSR framework that utilizes synthetic visual data to enhance speech recognition capabilities in under-resourced languages. This innovative approach not only addresses a critical gap in the field but also opens avenues for future research and development in multimodal speech recognition.
While autoregressive (AR) LLM-based ASR systems achieve strong accuracy, their sequential decoding limits parallelism and incurs high latency. We propose NLE, a non-autoregressive (NAR) approach that formulates speech recognition as conditional transcript editing, enabling fully parallel prediction. NLE extracts acoustic embeddings and an initial hypothesis from a pretrained speech encoder, then refines the hypothesis using a bidirectional LLM editor trained with a latent alignment objective. An interleaved padding strategy exploits the identity mapping bias of Transformers, allowing the model to focus on corrections rather than full reconstruction. On the Open ASR leaderboard, NLE++ achieves 5.67% average WER with an RTFx (inverse real-time factor) of 1630. In single-utterance scenarios, NLE achieves 27x speedup over the AR baseline, making it suitable for real-time applications.
Primary: IBM Research
All Institutions: IBM Research
The main contribution of this paper is the introduction of a non-autoregressive LLM-based ASR system that effectively combines the strengths of pretrained speech encoders and language models through a novel editing approach, significantly improving transcription speed and maintaining competitive accuracy. The methodology is innovative, and the experimental results demonstrate substantial technical impact, making it a valuable contribution to the field of machine learning and speech recognition.
The proposed methodology introduces a non-autoregressive (NAR) approach to automatic speech recognition (ASR) by framing it as conditional transcript editing. This is achieved through a bidirectional LLM editor that refines an initial hypothesis generated by a pretrained speech encoder. The interleaved padding strategy is a notable innovation, allowing the model to focus on corrections rather than full reconstructions, which enhances the efficiency of the editing process. The use of lightweight LoRA adapters for model adaptation is also a significant methodological contribution, enabling the model to leverage pretrained linguistic knowledge effectively while maintaining a manageable number of trainable parameters.
The experiments conducted are rigorous, with the authors evaluating their model against leading ASR systems on the Open ASR leaderboard. The reported results demonstrate a competitive word error rate (WER) of 5.67% for NLE++, with a substantial speedup of 27x over autoregressive baselines in single-utterance scenarios. The inclusion of ablation studies further strengthens the evaluation, providing insights into the impact of various design choices on performance. However, the paper could benefit from more extensive comparisons with a broader range of models and additional datasets to validate the robustness of the findings.
The paper provides a detailed description of the model architecture, training procedures, and evaluation metrics, which enhances reproducibility. However, the lack of a publicly available code repository or demo URL limits the ability for others to directly replicate the results. The authors mention using specific datasets and configurations, which is helpful, but sharing the implementation would significantly improve reproducibility.
The paper acknowledges that the NLE approach is less flexible than autoregressive models in scenarios requiring substantial changes to the hypothesis. It also highlights potential latency overhead due to the need for retokenization when using different tokenizers for the CTC encoder and the LLM. Moreover, the performance in multilingual settings appears to be weaker, suggesting that the model's training data may not be adequately representative of all languages.
The proposed NLE system has significant implications for real-time ASR applications, particularly in conversational settings where low latency is critical. By enabling faster and more accurate transcription, this approach could enhance user experiences in various domains, including virtual assistants, customer service, and accessibility technologies. The ability to refine initial hypotheses rather than regenerate them from scratch could also lead to more efficient use of computational resources. The main contribution of this paper is the introduction of a non-autoregressive LLM-based ASR system that effectively combines the strengths of pretrained speech encoders and language models through a novel editing approach, significantly improving transcription speed and maintaining competitive accuracy. The methodology is innovative, and the experimental results demonstrate substantial technical impact, making it a valuable contribution to the field of machine learning and speech recognition.
Automatic speech intelligibility assessment is crucial for monitoring speech disorders and therapy efficacy. However, existing methods are difficult to compare: research is fragmented across private datasets with inconsistent protocols. We introduce PathBench, a unified benchmark for pathological speech assessment using public datasets. We compare reference-free, reference-text, and reference-audio methods across three protocols (Matched Content, Extended, and Full) representing how a linguist (controlled stimuli) versus machine learning specialist (maximum data) would approach the same data. We establish benchmark baselines across six datasets, enabling systematic evaluation of future methodological advances, and introduce Dual-ASR Articulatory Precision (DArtP), achieving the highest average correlation among reference-free methods.
Primary: University of Cologne
All Institutions: University of Cologne, Nagoya University, University of Groningen
The paper presents PathBench, a comprehensive benchmarking framework for assessing speech intelligibility in pathological speech, addressing critical gaps in the field. The innovative methodology and rigorous experimental evaluation contribute significantly to advancing the state of research in automatic speech assessment, with potential applications in clinical settings.
The paper introduces PathBench, a systematic benchmarking framework for assessing speech intelligibility in pathological speech, which is a significant advancement given the fragmented nature of existing research. The methodology is robust, employing a variety of protocols that cater to both linguistic and machine learning perspectives. The introduction of Dual-ASR Articulatory Precision (DArtP) as a reference-free method is particularly innovative, providing a new way to evaluate articulatory precision without the need for labeled training data. The authors also address confounding factors such as speaker age and recording noise, which enhances the credibility of their findings.
The experiments are comprehensive, utilizing six datasets and establishing baseline performances across multiple protocols. The results demonstrate that DArtP achieves the highest correlation among reference-free methods, which is a notable contribution. The statistical analyses, including Wilcoxon Signed-Rank Tests, are well-executed, providing strong evidence for the superiority of certain methodologies over others. The detailed reporting of results across various conditions adds to the rigor of the evaluation.
The authors provide a GitHub repository with code and resources, which is essential for reproducibility. However, the paper could benefit from more detailed descriptions of the datasets and specific implementation details to facilitate easier replication of the results by other researchers.
The study is limited to four languages (English, Italian, Spanish, and Dutch), which may restrict its applicability to a broader audience. Additionally, while the authors address confounding factors, the impact of noise in real-world scenarios remains untested, which is critical for clinical applications. The reliance on public datasets may also introduce variability that could affect the generalizability of the findings.
The implications of this research are significant for the fields of speech therapy and clinical assessment of speech disorders. By providing a standardized benchmarking framework, PathBench can facilitate future research and development of more effective speech intelligibility assessment tools. This could ultimately improve patient outcomes in clinical settings by enabling better monitoring and evaluation of speech disorders. The paper presents PathBench, a comprehensive benchmarking framework for assessing speech intelligibility in pathological speech, addressing critical gaps in the field. The innovative methodology and rigorous experimental evaluation contribute significantly to advancing the state of research in automatic speech assessment, with potential applications in clinical settings.
End-to-end full-duplex speech models feed user audio through an always-on LLM backbone, yet the speaker privacy implications of their hidden representations remain unexamined. Following the VoicePrivacy 2024 protocol with a lazy-informed attacker, we show that the hidden states of SALM-Duplex and Moshi leak substantial speaker identity across all transformer layers. Layer-wise and turn-wise analyses reveal that leakage persists across all layers, with SALM-Duplex showing stronger leakage in early layers while Moshi leaks uniformly, and that Linkability rises sharply within the first few turns. We propose two streaming anonymization setups using Stream-Voice-Anon: a waveform-level front-end (Anon-W2W) and a feature-domain replacement (Anon-W2F). Anon-W2F raises EER by over 3.5x relative to the discrete encoder baseline (11.2% to 41.0%), approaching the 50% random-chance ceiling, while Anon-W2W retains 78-93% of baseline sBERT across setups with sub-second response latency (FRL under 0.8 s).
Primary: The Chinese University of Hong Kong
All Institutions: The Chinese University of Hong Kong, Huawei Leibniz Research Center, Nanyang Technological University, The Hong Kong Polytechnic University
The paper effectively characterizes speaker identity leakage in full-duplex speech dialogue models and proposes innovative anonymization techniques that significantly enhance privacy without sacrificing usability. This work is a crucial step towards ensuring the responsible deployment of AI-driven speech technologies.
The paper introduces a novel approach to analyzing speaker identity leakage in end-to-end full-duplex speech dialogue models, specifically SALM-Duplex and Moshi. The authors employ a lazy-informed attacker scenario to assess privacy risks, which is a relevant and timely concern given the increasing use of always-on speech systems. The proposed anonymization techniques, Anon-W2W and Anon-W2F, are well-structured, with clear distinctions between waveform-level and feature-domain methods. The methodology is rigorous, utilizing established metrics like Equal Error Rate (EER) and Linkability to quantify privacy improvements.
The experiments are comprehensive, employing a standardized dataset from the VoicePrivacy 2024 Challenge and a well-defined evaluation protocol. The results demonstrate significant improvements in privacy metrics, particularly with the Anon-W2F method, which achieves a notable increase in EER, indicating strong privacy protection. The authors also provide a thorough analysis of the impact of anonymization on dialogue quality and efficiency, showcasing a balanced consideration of privacy and usability.
The paper includes sufficient details regarding the experimental setup, including model architectures, training datasets, and evaluation metrics, which should facilitate reproducibility. However, the reliance on specific datasets and the proprietary nature of some components may pose challenges for full replication.
The study primarily focuses on two specific models (SALM-Duplex and Moshi), which may limit the generalizability of the findings to other full-duplex systems. Additionally, while the proposed anonymization methods show promise, the impact on speech quality and naturalness remains an area for further exploration. The authors also acknowledge that their quality metrics may not fully capture speech-level attributes.
The implications of this research are significant, particularly in the context of privacy regulations like GDPR. By addressing the privacy risks associated with always-on speech systems, the work contributes to the development of safer AI technologies that can be deployed in real-world applications without compromising user privacy. The findings could influence future designs of speech dialogue systems, emphasizing the need for privacy-by-design principles. The paper effectively characterizes speaker identity leakage in full-duplex speech dialogue models and proposes innovative anonymization techniques that significantly enhance privacy without sacrificing usability. This work is a crucial step towards ensuring the responsible deployment of AI-driven speech technologies.
Although deep neural networks have facilitated significant progress of neural vocoders in recent years, they usually suffer from intrinsic challenges like opaque modeling, inflexible retraining under different input configurations, and parameter-performance trade-off. These inherent hurdles can heavily impede the development of this field. To resolve these problems, in this paper, we propose a novel neural vocoder in the time-frequency (T-F) domain. Specifically, we bridge the connection between the classical range-null decomposition (RND) theory and the vocoder task, where the reconstruction of the target spectrogram is formulated into the superimposition between range-space and null-space. The former aims to project the representation in the original mel-domain into the target linear-scale domain, and the latter can be instantiated via neural networks to further infill the spectral details. To fully leverage the spectrum prior, an elaborate dual-path framework is devised, where the spectrum is hierarchically encoded and decoded, and the cross- and narrow-band modules are leveraged for effectively modeling along sub-band and time dimensions. To enable inference under various configurations, we propose a simple yet effective strategy, which transforms the multi-condition adaption in the inference stage into the data augmentation in the training stage. Comprehensive experiments are conducted on various benchmarks. Quantitative and qualitative results show that while enjoying lightweight network structure and scalable inference paradigm, the proposed framework achieves state-ofthe-art performance among existing advanced methods. Code is available at https://github.com/Andong-Li-speech/RNDVoC.
Primary: Institute of Acoustics, Chinese Academy of Sciences
All Institutions: Institute of Acoustics, Chinese Academy of Sciences, Chongqing University of Posts and Telecommunications, Tencent AI Lab, University of Chinese Academy of Sciences
This paper makes a significant contribution to the field of neural vocoding by introducing a novel architecture that effectively utilizes range-null space decomposition, enhancing both the interpretability and performance of audio synthesis models. The methodology is well-structured, and the experimental results substantiate its effectiveness, positioning it as a valuable advancement in the audio processing domain.
The paper introduces a novel neural vocoder architecture based on range-null space decomposition (RND), which effectively separates the reconstruction of audio spectrograms into two orthogonal components: range-space and null-space. This approach is innovative as it leverages classical signal processing theory to enhance the interpretability and robustness of neural vocoders. The dual-path framework proposed allows for hierarchical encoding and decoding of spectral features, which is a significant advancement over existing methods that typically use full-band modules. The introduction of a multi-condition-as-data-augmentation strategy is also noteworthy, as it allows for scalable inference without the need for retraining, addressing a common limitation in neural vocoders.
The authors conducted comprehensive experiments on established benchmarks, including LJSpeech and LibriTTS, demonstrating state-of-the-art performance compared to existing methods. The quantitative metrics and qualitative assessments indicate that the proposed method not only achieves high-quality audio synthesis but also maintains a lightweight network structure, enhancing its practical applicability. The ablation studies further validate the effectiveness of the proposed components, providing a thorough evaluation of their contributions to performance.
The paper provides a GitHub repository link for code access, which is crucial for reproducibility. However, the detailed implementation specifics, such as hyperparameter settings and training configurations, could be better documented to facilitate easier replication of results by other researchers.
While the proposed method shows promise, it may still face challenges in handling extreme variations in input conditions that were not covered in the training data. Additionally, the reliance on the pseudo-inverse operation might introduce computational overhead in real-time applications, which could limit its deployment in resource-constrained environments.
The advancements in neural vocoding presented in this paper have significant implications for various audio processing applications, including text-to-speech synthesis, music generation, and speech enhancement. By improving the quality and efficiency of vocoders, this work could enhance user experiences in voice interfaces and multimedia applications, contributing to the broader field of artificial intelligence in audio processing. This paper makes a significant contribution to the field of neural vocoding by introducing a novel architecture that effectively utilizes range-null space decomposition, enhancing both the interpretability and performance of audio synthesis models. The methodology is well-structured, and the experimental results substantiate its effectiveness, positioning it as a valuable advancement in the audio processing domain.
Although deep neural networks have facilitated significant progress of neural vocoders in recent years, they usually suffer from intrinsic challenges like opaque modeling, inflexible retraining under different input configurations, and parameter-performance trade-off. These inherent hurdles can heavily impede the development of this field. To resolve these problems, in this paper, we propose a novel neural vocoder in the time-frequency (T-F) domain. Specifically, we bridge the connection between the classical range-null decomposition (RND) theory and the vocoder task, where the reconstruction of the target spectrogram is formulated into the superimposition between range-space and null-space. The former aims to project the representation in the original mel-domain into the target linear-scale domain, and the latter can be instantiated via neural networks to further infill the spectral details. To fully leverage the spectrum prior, an elaborate dual-path framework is devised, where the spectrum is hierarchically encoded and decoded, and the cross- and narrow-band modules are leveraged for effectively modeling along sub-band and time dimensions. To enable inference under various configurations, we propose a simple yet effective strategy, which transforms the multi-condition adaption in the inference stage into the data augmentation in the training stage. Comprehensive experiments are conducted on various benchmarks. Quantitative and qualitative results show that while enjoying lightweight network structure and scalable inference paradigm, the proposed framework achieves state-ofthe-art performance among existing advanced methods. Code is available at https://github.com/Andong-Li-speech/RNDVoC.
Primary: Institute of Acoustics, Chinese Academy of Sciences
All Institutions: Institute of Acoustics, Chinese Academy of Sciences, Chongqing University of Posts and Telecommunications, Tencent AI Lab, University of Chinese Academy of Sciences
The paper presents a significant advancement in neural vocoding by introducing a scalable framework that effectively integrates range-null space decomposition, addressing key challenges in the field. The innovative methodology and comprehensive experimental validation position this work as a valuable contribution to the audio processing community.
The proposed methodology introduces a novel neural vocoder framework based on range-null space decomposition (RND), which effectively addresses common challenges in existing vocoders, such as opaque modeling and inflexible retraining. The dual-path framework allows for hierarchical encoding and decoding of spectral features, leveraging both range-space and null-space modeling. The introduction of a multiple-condition-as-data-augmentation (MCDA) strategy enhances the model's adaptability to various mel configurations without the need for retraining, showcasing an innovative approach to scalability in neural vocoders.
The experiments are comprehensive, utilizing well-known benchmarks like LJSpeech and LibriTTS. The results demonstrate that the proposed method achieves state-of-the-art performance, outperforming existing models such as BigVGAN with significantly fewer parameters. The quantitative metrics, including PESQ and MCD, alongside qualitative assessments, indicate a robust evaluation of the model's effectiveness.
The paper provides a GitHub repository for code access, which is crucial for reproducibility. However, the detailed implementation specifics, such as hyperparameter settings and training procedures, should be clearly documented to facilitate replication by other researchers.
While the proposed framework shows promise, it may still struggle with certain edge cases in phase recovery and may require further optimization for real-time applications. Additionally, the reliance on specific datasets may limit the generalizability of the findings.
The advancements in neural vocoding have significant implications for various applications in speech synthesis, music generation, and audio processing. The ability to efficiently adapt to different configurations can enhance the deployment of these models in real-world scenarios, potentially leading to broader adoption in commercial products. The paper presents a significant advancement in neural vocoding by introducing a scalable framework that effectively integrates range-null space decomposition, addressing key challenges in the field. The innovative methodology and comprehensive experimental validation position this work as a valuable contribution to the audio processing community.
Text-to-audio diffusion models produce high-fidelity audio but require tens of function evaluations (NFEs), incurring multi-second latency and limited throughput. We present SoundWeaver, the first training-free, model-agnostic serving system that accelerates text-to-audio diffusion by warm-starting from semantically similar cached audio. SoundWeaver introduces three components: a Reference Selector that retrieves and temporally aligns cached candidates via semantic and duration-aware gating; a Skip Gater that dynamically determines the percentage of NFEs to skip; and a lightweight Cache Manager that maintains cache utility through quality-aware eviction and refinement. On real-world audio traces, SoundWeaver achieves 1.8--3.0$ \times $ latency reduction with a cache of only ${\sim}$1K entries while preserving or improving perceptual quality.
Primary: University of Illinois Urbana-Champaign
All Institutions: University of Illinois Urbana-Champaign
SoundWeaver introduces a novel approach to accelerating text-to-audio diffusion models through semantic warm-starting, demonstrating substantial improvements in latency and quality. The comprehensive methodology and experimental validation position this work as a meaningful contribution to the field of machine learning in audio generation.
The methodology presented in SoundWeaver is innovative, focusing on warm-starting text-to-audio diffusion models by leveraging semantically similar cached audio. The system comprises three main components: a Reference Selector for retrieving and aligning cached audio, a Skip Gater for determining the number of NFEs to skip, and a Cache Manager for maintaining cache quality. The use of a contextual multi-arm bandit approach for the Skip Gater is particularly noteworthy, as it adapts to varying user prompts and optimizes performance dynamically. The integration of semantic and duration-aware retrieval mechanisms adds depth to the approach, allowing for more efficient audio generation while preserving quality.
The experimental evaluation is robust, utilizing real-world audio traces and a variety of metrics to assess performance. The results demonstrate significant latency reductions (1.8-3.0x) while maintaining or improving perceptual quality across different models. The ablation studies effectively illustrate the contributions of each component, reinforcing the importance of the proposed methods. However, the reliance on specific datasets and the absence of extensive user studies could limit the generalizability of the findings.
The paper provides a detailed description of the experimental setup, including the models used, metrics evaluated, and the caching mechanism. However, the lack of a publicly accessible code repository or demo limits reproducibility. The authors mention using generative AI for writing and evaluation, which raises questions about the transparency of the evaluation process.
The paper acknowledges limitations such as potential phase vocoder distortion on longer audio requests and the lack of dedicated request schedulers. Additionally, the system's performance with complex samplers remains untested, which could impact its applicability in diverse scenarios.
SoundWeaver has significant implications for real-time audio generation applications, such as music composition and sound design. By reducing latency and improving throughput, it can enhance user experience in various audio-related services. The model-agnostic nature of the approach also suggests potential for broader adoption across different diffusion models and applications. SoundWeaver introduces a novel approach to accelerating text-to-audio diffusion models through semantic warm-starting, demonstrating substantial improvements in latency and quality. The comprehensive methodology and experimental validation position this work as a meaningful contribution to the field of machine learning in audio generation.
We propose Universal Speech Content Factorization (USCF), a simple and invertible linear method for extracting a low-rank speech representation in which speaker timbre is suppressed while phonetic content is preserved. USCF extends Speech Content Factorization, a closed-set voice conversion (VC) method, to an open-set setting by learning a universal speech-to-content mapping via least-squares optimization and deriving speaker-specific transformations from only a few seconds of target speech. We show through embedding analysis that USCF effectively removes speaker-dependent variation. As a zero-shot VC system, USCF achieves competitive intelligibility, naturalness, and speaker similarity compared to methods that require substantially more target-speaker data or additional neural training. Finally, we demonstrate that as a training-efficient timbre-disentangled speech feature, USCF features can serve as the acoustic representation for training timbre-prompted text-to-speech models. Speech samples and code are publicly available.
Primary: Johns Hopkins University
All Institutions: Johns Hopkins University
The main contribution of this paper is the introduction of Universal Speech Content Factorization (USCF), a novel method that enables speaker-agnostic content extraction from speech, significantly enhancing the capabilities of voice conversion systems while maintaining competitive performance with minimal target data. This work advances the field of speech processing by providing a practical solution for open-set voice conversion and potential applications in text-to-speech synthesis.
The proposed Universal Speech Content Factorization (USCF) method builds on the existing Speech Content Factorization (SCF) framework but extends it to an open-set scenario, allowing for speaker-agnostic content extraction. The methodology employs a linear transformation approach through least-squares optimization, which is a straightforward yet effective technique for deriving speaker-specific transformations from minimal target speech data. The authors provide a clear mathematical formulation and rationale for their approach, demonstrating how the linear structure of SCF can be generalized. The use of embedding analysis to validate the effectiveness of USCF in removing speaker-dependent variations while preserving phonetic content is a strong methodological aspect. However, the reliance on linear assumptions may limit the generalizability of the method to more complex speech patterns.
The experimental setup is robust, utilizing diverse datasets (LibriSpeech and TIMIT) and comparing USCF against several baseline methods, including kNN-VC and LinearVC. The evaluation metrics are comprehensive, including both objective measures (ASR WER, UTMOS) and subjective evaluations (MOS, SMOS), which provide a well-rounded assessment of the voice conversion quality. The results indicate that USCF performs competitively, particularly in content preservation, although it shows some degradation in speaker similarity compared to other methods. The paper also includes ablation studies that offer insights into the influence of various parameters on performance.
The paper mentions that speech samples and code are publicly available, which is crucial for reproducibility. However, the detailed implementation specifics, such as hyperparameter settings and the exact configurations used for experiments, are not fully elaborated, which may pose challenges for others attempting to replicate the results. The inclusion of a GitHub repository is a positive aspect, but further documentation would enhance reproducibility.
One limitation of the USCF approach is its dependence on linear transformations, which may not capture the full complexity of speaker variations in speech. Additionally, while the method shows promise in zero-shot scenarios, the performance does degrade in terms of speaker similarity, indicating potential areas for improvement. The requirement for a minimum amount of target speaker data (10 seconds) could also limit its applicability in scenarios where only very limited data is available.
The implications of this research are significant for applications in voice conversion and text-to-speech systems, particularly in scenarios requiring speaker adaptation with minimal data. The ability to effectively disentangle speaker timbre from phonetic content could enhance personalized voice synthesis technologies, improve accessibility features, and support various applications in entertainment and communication. The method's efficiency and effectiveness could lead to broader adoption in real-world systems, particularly in environments where diverse speaker profiles are encountered. The main contribution of this paper is the introduction of Universal Speech Content Factorization (USCF), a novel method that enables speaker-agnostic content extraction from speech, significantly enhancing the capabilities of voice conversion systems while maintaining competitive performance with minimal target data. This work advances the field of speech processing by providing a practical solution for open-set voice conversion and potential applications in text-to-speech synthesis.
Speech Large Language Models (LLMs) show great promise for speech emotion recognition (SER) via generative interfaces. However, shifting from closed-set classification to open text generation introduces zero-shot stochasticity, making evaluation highly sensitive to prompts. Additionally, conventional speech LLMs benchmarks overlook the inherent ambiguity of human emotion. Hence, we present VoxEmo, a comprehensive SER benchmark encompassing 35 emotion corpora across 15 languages for Speech LLMs. VoxEmo provides a standardized toolkit featuring varying prompt complexities, from direct classification to paralinguistic reasoning. To reflect real-world perception/application, we introduce a distribution-aware soft-label protocol and a prompt-ensemble strategy that emulates annotator disagreement. Experiments reveal that while zero-shot speech LLMs trail supervised baselines in hard-label accuracy, they uniquely align with human subjective distributions.
Primary: University of Sheffield
All Institutions: University of Sheffield, University of Southern California
The main contribution of this paper is the introduction of VoxEmo, a comprehensive benchmarking framework for speech emotion recognition that addresses the challenges of prompt sensitivity and human emotion ambiguity in the evaluation of speech LLMs. The technical contributions, including the standardized toolkit and innovative evaluation strategies, position this work as a significant advancement in the field of SER.
The paper introduces VoxEmo, a novel benchmarking framework for speech emotion recognition (SER) using speech LLMs. The methodology is well-structured, addressing the challenges of prompt sensitivity and human emotion ambiguity through a comprehensive toolkit that includes a distribution-aware soft-label protocol and a prompt-ensemble strategy. The approach of utilizing multiple prompts to capture the stochastic nature of LLM outputs is innovative, although it may lead to increased complexity in evaluation.
The experiments are extensive, covering 35 emotion corpora across 15 languages. The results demonstrate the performance of two speech LLMs (Qwen2-Audio and Audio Flamingo) under various prompt configurations. The analysis of zero-shot performance and the impact of supervised fine-tuning is thorough, providing valuable insights into the strengths and weaknesses of the models. However, the paper could benefit from more detailed comparisons with existing state-of-the-art methods.
The paper emphasizes reproducibility by providing a standardized evaluation toolkit and clear descriptions of the experimental setup, including the selection of models and evaluation metrics. However, the reliance on specific models and the absence of a public code repository may hinder full reproducibility.
The paper acknowledges several limitations, including the focus on only two models with the same audio encoder, the potential for hyperparameter mismatch during fine-tuning, and the restriction of soft-label evaluation to a limited number of datasets. Additionally, the study does not explore within-dataset factors that could affect performance.
The proposed benchmark has significant implications for the development of affect-aware systems in human-computer interaction and speech analytics. By addressing the ambiguity of human emotion and providing a framework for evaluating generative models, this work could lead to advancements in more nuanced and effective emotion recognition systems. The main contribution of this paper is the introduction of VoxEmo, a comprehensive benchmarking framework for speech emotion recognition that addresses the challenges of prompt sensitivity and human emotion ambiguity in the evaluation of speech LLMs. The technical contributions, including the standardized toolkit and innovative evaluation strategies, position this work as a significant advancement in the field of SER.
Whispered speech lacks vocal fold vibration and fundamental frequency, resulting in degraded acoustic cues and making whisper-to-normal (W2N) conversion challenging, especially with limited parallel data. We propose WhispEar, a bidirectional framework based on unified semantic representations that capture speaking-mode-invariant information shared by whispered and normal speech. The framework contains both W2N and normal-to-whisper (N2W) models. Notably, the N2W model enables zero-shot pseudo-parallel whisper generation from abundant normal speech, allowing scalable data augmentation for W2N training. Increasing generated data consistently improves performance. We also release the largest bilingual (Chinese-English) whispered-normal parallel corpus to date. Experiments demonstrate that WhispEar outperforms strong baselines and benefits significantly from scalable pseudo-parallel data.
Primary: The Chinese University of Hong Kong
All Institutions: The Chinese University of Hong Kong
WhispEar presents a novel bidirectional framework for whispered speech conversion, effectively addressing data scarcity through innovative pseudo-parallel data generation. The paper's contributions significantly advance the field of speech processing, particularly in enhancing the intelligibility and naturalness of whispered speech.
The methodology presented in WhispEar is innovative, leveraging a bidirectional framework that allows for both whisper-to-normal (W2N) and normal-to-whisper (N2W) conversions. The use of semantic representations to bridge the gap between the two modalities is a significant advancement. The three-stage training process, particularly the zero-shot pseudo-parallel whisper generation, is a clever approach to mitigate the scarcity of parallel data. The incorporation of a lightweight semantic tokenizer and a shared Flow-Matching Transformer model demonstrates a solid understanding of the underlying acoustic characteristics and the need for efficient data utilization.
The experiments are well-structured, comparing WhispEar against strong baselines and demonstrating clear performance improvements across various metrics, including intelligibility, naturalness, and prosody recovery. The release of the wEar dataset, the largest bilingual whispered-normal parallel corpus, adds significant value to the research community. The systematic scaling study provides compelling evidence of the effectiveness of the proposed methods, showcasing how increasing the amount of pseudo-parallel data leads to consistent performance gains.
The paper provides sufficient details regarding the training process, data collection, and evaluation metrics, which should enable other researchers to replicate the experiments. However, the absence of a publicly available code repository limits full reproducibility, as potential users cannot directly implement the proposed methods without access to the code.
One limitation noted is the reliance on the quality of the generated pseudo-whispered data, which may not fully capture the nuances of real whispered speech. Additionally, while the framework shows promise, its performance in noisy environments or with diverse speaker characteristics has not been thoroughly evaluated. Future work should address these aspects to enhance robustness and generalizability.
The implications of this research are significant, particularly in areas requiring whispered speech conversion for privacy and communication enhancement. The ability to generate high-quality whispered speech from normal speech could have applications in assistive technologies, voice restoration, and privacy-focused communication tools. The release of the wEar dataset also paves the way for further research in this domain, potentially leading to advancements in speech synthesis and recognition technologies. WhispEar presents a novel bidirectional framework for whispered speech conversion, effectively addressing data scarcity through innovative pseudo-parallel data generation. The paper's contributions significantly advance the field of speech processing, particularly in enhancing the intelligibility and naturalness of whispered speech.
Speech-to-speech models handle turn-taking naturally but offer limited support for tool-calling or complex reasoning, while production ASR-LLM-TTS voice pipelines offer these capabilities but rely on silence timeouts, which lead to unnatural turn-taking. We present DualTurn, which narrows this gap through generative pretraining on dual-channel conversational audio. The model generates both speakers' future audio autoregressively, implicitly learning conversational dynamics without any labels, and is then fine-tuned to predict interpretable turn-taking signals that map directly to agent actions. DualTurn monitors both channels continuously, anticipating turn boundaries and producing five agent actions. On standard benchmarks, DualTurn (0.5B) outperforms both VAP on agent action prediction (wF1 0.633 vs. 0.389) and a 3.1B audio-text model on word-level turn prediction (AUC 0.930 vs. 0.880), while anticipating turn boundaries earlier with fewer interruptions.
Primary: Anyreach AI
All Institutions: Anyreach AI
The main contribution of this paper is the introduction of DualTurn, a model that effectively learns turn-taking dynamics in conversational audio through generative pretraining, outperforming existing methods in both anticipation of turn boundaries and prediction of agent actions. This work represents a meaningful advancement in the field of conversational AI, addressing limitations in current models and providing a foundation for future research in multi-speaker interaction systems.
The methodology presented in DualTurn is innovative, leveraging dual-channel generative pretraining to learn turn-taking dynamics without labeled data. The use of a lightweight neural codec for audio encoding, combined with a two-stage training process, allows the model to effectively capture conversational context and predict turn-taking signals. The architecture is well thought out, with a clear distinction between generative pretraining and subsequent fine-tuning for specific tasks, which enhances the model's performance in predicting agent actions.
The experimental evaluation is robust, utilizing standard benchmarks such as Switchboard and otoSpeech to compare DualTurn against existing models like VAP and a large audio-text fusion model. The results demonstrate significant improvements in both word-level turn prediction and agent action prediction, with clear metrics provided (e.g., wF1 and AUC scores). The ablation studies further validate the contributions of different components of the model, showcasing the effectiveness of the generative pretraining stage.
The paper provides sufficient details about the architecture, training procedures, and datasets used, which supports reproducibility. However, the absence of URLs for code or demo implementations limits the ability for others to directly replicate the results. Including a public repository would enhance reproducibility significantly.
One limitation noted is the reliance on a single language (English) and a relatively small dataset (453 hours of dual-channel conversation audio), which may affect the generalizability of the model to other languages or larger, more diverse datasets. Additionally, while the model anticipates turn boundaries earlier, the practical implications of this in real-world applications need further exploration.
The implications of DualTurn are significant for applications in conversational AI, particularly in enhancing the naturalness of interactions in voice assistants and other automated systems. By improving turn-taking dynamics, the model can contribute to more fluid and human-like conversations, which is critical for user satisfaction and engagement in AI-driven communication tools. The main contribution of this paper is the introduction of DualTurn, a model that effectively learns turn-taking dynamics in conversational audio through generative pretraining, outperforming existing methods in both anticipation of turn boundaries and prediction of agent actions. This work represents a meaningful advancement in the field of conversational AI, addressing limitations in current models and providing a foundation for future research in multi-speaker interaction systems.
Quantization has become essential for the efficient deployment of speech processing systems. Although widely studied, most existing quantization methods were developed for vision and NLP architectures, while the specific challenges of audio signals remain largely overlooked. In particular, we show that audio activations can exhibit large calibration ranges, leading to significant information loss when standard calibration techniques are applied. To address this, we propose ESC, an Evolution Strategy-based Calibration method that formulates activation scaling as an optimization problem and solves it using a two-step local-global scheme driven by an evolution strategy. ESC enables unaltered performance under full INT8 quantization and is the first calibration method to achieve near-lossless performance for full INT4 quantization across multiple speech tasks. Integrating ESC with PTQ methods further reduces performance loss, achieving a 1% relative accuracy degradation on the AST model.
Primary: cortAIx Labs
All Institutions: cortAIx Labs
The paper presents a novel calibration method for low-bit quantization of speech models that leverages evolution strategies to optimize activation scaling, demonstrating significant performance improvements across various tasks. The technical contributions are substantial, addressing a critical gap in the quantization of audio models and paving the way for more efficient deployment in resource-constrained environments.
The proposed Evolution Strategy-Based Calibration (ESC) method is innovative, particularly in its formulation of calibration as a two-step optimization problem that integrates local and global objectives. The use of evolution strategies to optimize activation scaling factors is a novel approach tailored specifically for the audio domain, addressing the unique challenges posed by audio activations that differ significantly from those in vision and NLP. The methodology is well-structured, with clear steps for initialization and optimization, although it could benefit from more detailed explanations of the algorithm's parameters and their tuning.
The experiments conducted are comprehensive, covering multiple speech tasks and models, which strengthens the validity of the results. The paper reports significant improvements over existing calibration methods, particularly in INT4 quantization, which is crucial for deploying models in resource-constrained environments. However, the paper lacks detailed descriptions of datasets and specific evaluation metrics used, which could enhance the reproducibility and understanding of the results.
While the paper outlines the methodology and experimental setup, it does not provide sufficient implementation details or code availability, which are critical for reproducibility. The absence of a project URL or demo further limits the ability of other researchers to replicate the findings.
One limitation is the reliance on a specific hardware configuration (NVIDIA RTX 3090) for performance evaluation, which may not generalize across different platforms. Additionally, while the method shows promise for INT4 quantization, the paper does not explore the trade-offs or potential degradation in performance for other model architectures or tasks outside those tested.
The proposed ESC method has the potential to significantly impact the deployment of speech models in real-world applications, particularly in scenarios where computational resources are limited. By enabling near-lossless performance at lower bit-widths, this work could facilitate the broader adoption of advanced speech processing technologies in mobile and embedded systems. The paper presents a novel calibration method for low-bit quantization of speech models that leverages evolution strategies to optimize activation scaling, demonstrating significant performance improvements across various tasks. The technical contributions are substantial, addressing a critical gap in the quantization of audio models and paving the way for more efficient deployment in resource-constrained environments.
Autoregressive "language" models (LMs) trained on raw waveforms can be repurposed for lossless audio compression, but prior work is limited to 8-bit audio, leaving open whether such approaches work for practical settings (16/24-bit) and can compete with existing codecs. We benchmark LM-based compression on full-fidelity audio across diverse domains (music, speech, bioacoustics), sampling rates (16kHz-48kHz), and bit depths (8, 16, 24-bit). Standard sample-level tokenization becomes intractable at higher bit depths due to vocabulary size (65K for 16-bit; 16.7M for 24-bit). We propose Trilobyte, a byte-level tokenization schema for full resolution audio, improving vocabulary scaling from $O(2^{b})$ to $O(1)$ and enabling the first tractable 24-bit LM-based lossless compression. While LMs consistently outperform FLAC and yield state-of-the-art compression at 8-bit and 16-bit, we observe that compression gains become more modest as bit depth increases beyond 8-bit.
Primary: Carnegie Mellon University
All Institutions: Carnegie Mellon University, University of California
The main contribution of this paper is the introduction of Trilobyte, a byte-level tokenization schema that enables tractable modeling of 24-bit audio for lossless compression using autoregressive language models. This work significantly advances the application of machine learning in audio compression, addressing a critical gap in the literature and providing a foundation for future research in the area.
The paper introduces a novel byte-level tokenization schema, Trilobyte, which effectively addresses the vocabulary explosion problem in autoregressive language models (LMs) for lossless audio compression. By reducing the vocabulary size from exponential scaling to a constant size, the authors enable tractable modeling of 24-bit audio, a significant advancement over prior work limited to 8-bit audio. The methodology is well-structured, detailing the compression pipeline, the use of arithmetic coding, and the training of models on diverse audio datasets. The approach is theoretically sound and leverages established principles of autoregressive modeling, making it a meaningful contribution to the field.
The authors conduct a comprehensive benchmarking of their proposed method across various audio domains (music, speech, bioacoustics) and bit depths (8, 16, 24-bit). The experiments are rigorous, with comparisons to industry-standard codecs like FLAC, and they provide detailed results that highlight the performance of Trilobyte in different scenarios. The evaluation demonstrates that while the compression gains are modest at higher bit depths, the method consistently outperforms FLAC at 8-bit and shows competitive results at 16-bit.
The authors provide a GitHub repository for the Trilobyte implementation, which enhances reproducibility. However, the paper could benefit from more detailed descriptions of the experimental setup, including hyperparameters and training conditions, to facilitate replication of results by other researchers.
The paper acknowledges that the computational cost of the proposed ML approaches is significantly higher than traditional codecs like FLAC, which may limit their practical deployment in real-world scenarios. Additionally, the modest compression gains at higher bit depths suggest that further optimization is needed to make these methods more competitive.
The work has significant implications for the field of audio compression, particularly in contexts where lossless audio fidelity is critical, such as professional audio production and archival storage. By demonstrating the potential of LMs for lossless audio compression, this research opens avenues for future exploration of machine learning techniques in audio processing. The main contribution of this paper is the introduction of Trilobyte, a byte-level tokenization schema that enables tractable modeling of 24-bit audio for lossless compression using autoregressive language models. This work significantly advances the application of machine learning in audio compression, addressing a critical gap in the literature and providing a foundation for future research in the area.