Human-computer interaction has traditionally relied on the acoustic channel, a dependency that introduces systemic vulnerabilities to environmental noise, privacy constraints, and physiological speech impairments. Silent Speech Interfaces (SSIs) emerge as a transformative paradigm that bypasses the acoustic stage by decoding linguistic intent directly from the neuro-muscular-articulatory continuum. This review provides a high-level synthesis of the SSI landscape, transitioning from traditional transducer-centric analysis to a holistic intent-to-execution taxonomy. We systematically evaluate sensing modalities across four critical physiological interception points: neural oscillations, neuromuscular activation, articulatory kinematics (ultrasound/magnetometry), and pervasive active probing via acoustic or radio-frequency sensing. Critically, we analyze the current paradigm shift from heuristic signal processing to Latent Semantic Alignment. In this new era, Large Language Models (LLMs) and deep generative architectures serve as high-level linguistic priors to resolve the ``informational sparsity'' and non-stationarity of biosignals. By mapping fragmented physiological gestures into structured semantic latent spaces, modern SSI frameworks have, for the first time, approached the Word Error Rate usability threshold required for real-world deployment. We further examine the transition of SSIs from bulky laboratory instrumentation to ``invisible interfaces'' integrated into commodity-grade wearables, such as earables and smart glasses. Finally, we outline a strategic roadmap addressing the ``user-dependency paradox'' through self-supervised foundation models and define the ethical boundaries of ``neuro-security'' to protect cognitive liberty in an increasingly interfaced world.
Primary: National University of Defense Technology
All Institutions: National University of Defense Technology, Hunan Normal University, Hunan University
The paper provides a comprehensive synthesis of Silent Speech Interfaces, detailing their evolution and the integration of advanced machine learning techniques. It significantly contributes to the understanding of non-acoustic speech recognition and its potential applications in various domains, marking a notable advancement in the field of human-computer interaction.
The paper presents a comprehensive taxonomy of Silent Speech Interfaces (SSIs) and evaluates various sensing modalities that capture speech intent from physiological signals. The methodology transitions from traditional acoustic-based systems to a focus on non-acoustic modalities, integrating deep learning techniques and Large Language Models (LLMs) to enhance the decoding of silent speech. The authors provide a rigorous classification of SSIs based on their interception points along the neuro-muscular-articulatory continuum, which is a significant advancement in the field. Furthermore, the analysis of algorithmic evolution from heuristic methods to end-to-end neural architectures is well-articulated, showcasing the transition to modern computational frameworks.
While the paper is primarily a systematic review, it synthesizes existing experimental results from various studies, providing a comparative analysis of performance metrics across different SSI modalities. The benchmarks presented, including Word Error Rates (WER) and accuracy metrics, demonstrate the progress made in the field. However, the paper lacks original experimental data or new empirical results, which would have strengthened its contributions.
The paper does not provide specific implementation details or datasets for reproduction, which is a limitation in terms of reproducibility. However, it does mention the importance of open science and benchmarking, indicating a commitment to facilitating reproducible research in the field.
One limitation of the paper is its reliance on existing literature without presenting new experimental findings. Additionally, the ethical considerations surrounding the deployment of SSIs, particularly regarding privacy and cognitive liberty, are discussed but could benefit from a more in-depth exploration of practical implications.
The implications of this research are significant, as SSIs have the potential to revolutionize human-computer interaction, especially for individuals with speech impairments or in environments where vocal communication is impractical. The integration of LLMs into SSI frameworks could lead to more intuitive and accessible communication technologies, impacting assistive technologies and privacy-preserving communication. The paper provides a comprehensive synthesis of Silent Speech Interfaces, detailing their evolution and the integration of advanced machine learning techniques. It significantly contributes to the understanding of non-acoustic speech recognition and its potential applications in various domains, marking a notable advancement in the field of human-computer interaction.
Recent advancements in speech captioning models have enabled the generation of rich, fine-grained captions for emotional speech. However, the evaluation of such captions remains a critical bottleneck: traditional N-gram metrics fail to capture semantic nuances, while LLM judges often suffer from reasoning inconsistency and context-collapse when processing long-form descriptions. In this work, we propose EmoSURA, a novel evaluation framework that shifts the paradigm from holistic scoring to atomic verification. EmoSURA decomposes complex captions into Atomic Perceptual Units, which are self-contained statements regarding vocal or emotional attributes, and employs an audio-grounded verification mechanism to validate each unit against the raw speech signal. Furthermore, we address the scarcity of standardized evaluation resources by introducing SURABench, a carefully balanced and stratified benchmark. Our experiments show that EmoSURA achieves a positive correlation with human judgments, offering a more reliable assessment for long-form captions compared to traditional metrics, which demonstrated negative correlations due to their sensitivity to caption length.
Primary: Imperial College London
All Institutions: Imperial College London, TUM University Hospital, Munich Center for Machine Learning
The main contribution of this paper is the development of EmoSURA, a framework that enhances the evaluation of emotional speech captions by focusing on atomic verification rather than holistic scoring. This approach represents a significant advancement in the field, addressing critical challenges in evaluating emotional speech and providing a pathway for future research in this area.
The paper introduces EmoSURA, an innovative evaluation framework that decomposes emotional speech captions into Atomic Perceptual Units (APUs). This approach is significant as it moves away from traditional holistic scoring methods, which often fail to capture the nuances of emotional speech. The methodology is well-structured, employing an audio-grounded verification mechanism that enhances the reliability of the evaluation process. The decomposition into APUs allows for a more granular analysis of emotional attributes, which is a novel contribution to the field of speech captioning.
The experiments conducted demonstrate a positive correlation between EmoSURA's assessments and human judgments, which is a critical validation of the framework's effectiveness. The introduction of SURABench as a benchmark for evaluating the proposed method adds to the robustness of the experimental design. However, the paper could benefit from a more detailed description of the datasets used and the specific metrics employed in the evaluation process.
The paper lacks sufficient details regarding the implementation of EmoSURA, including code availability and specific configurations used in the experiments. This omission raises concerns about reproducibility, as other researchers may find it challenging to replicate the results without access to the underlying code or datasets.
One limitation noted is the reliance on human judgments for validation, which, while valuable, may introduce subjectivity into the evaluation process. Additionally, the framework's performance on diverse emotional speech contexts outside the training set remains to be thoroughly assessed.
EmoSURA has the potential to significantly advance the field of emotional speech processing by providing a more nuanced evaluation framework. This could lead to improvements in applications such as affective computing, human-computer interaction, and accessibility tools for individuals with communication difficulties. The implications of this work could extend to various domains, including mental health assessment and entertainment, where understanding emotional nuances in speech is crucial. The main contribution of this paper is the development of EmoSURA, a framework that enhances the evaluation of emotional speech captions by focusing on atomic verification rather than holistic scoring. This approach represents a significant advancement in the field, addressing critical challenges in evaluating emotional speech and providing a pathway for future research in this area.
We propose Relativistic Adversarial Feedback (RAF), a novel training objective for GAN vocoders that improves in-domain fidelity and generalization to unseen scenarios. Although modern GAN vocoders employ advanced architectures, their training objectives often fail to promote generalizable representations. RAF addresses this problem by leveraging speech self-supervised learning models to assist discriminators in evaluating sample quality, encouraging the generator to learn richer representations. Furthermore, we utilize relativistic pairing for real and fake waveforms to improve the modeling of the training data distribution. Experiments across multiple datasets show consistent gains in both objective and subjective metrics on GAN-based vocoders. Importantly, the RAF-trained BigVGAN-base outperforms the LSGAN-trained BigVGAN in perceptual quality using only 12\% of the parameters. Comparative studies further confirm the effectiveness of RAF as a training framework for GAN vocoders.
Primary: Korea Advanced Institute of Science and Technology (KAIST)
All Institutions: Korea Advanced Institute of Science and Technology (KAIST)
The main contribution of this work is the introduction of the RAF framework, which enhances GAN-based vocoders' fidelity and generalization capabilities through innovative training objectives that leverage self-supervised learning. This research represents a meaningful step forward in the field of neural vocoding, addressing critical challenges while paving the way for future explorations in efficient and ethical audio synthesis technologies.
The proposed Relativistic Adversarial Feedback (RAF) framework innovatively integrates self-supervised learning (SSL) models into the training of GAN-based vocoders, enhancing their ability to generalize to unseen scenarios while maintaining fidelity. The methodology is well-structured, with clear definitions of the quality and discriminator gaps, and the use of relativistic pairing is a significant advancement over traditional GAN approaches. The incorporation of SSL models as perceptual guidance for the discriminator is a novel approach that addresses the limitations of existing GAN training objectives.
The experiments conducted are comprehensive, utilizing multiple datasets and GAN architectures, which demonstrate the robustness and versatility of the RAF framework. The results indicate consistent improvements in both objective and subjective metrics, showcasing the effectiveness of the proposed method across various scenarios. The comparative studies against baseline methods further validate the advantages of RAF, although the paper could benefit from additional statistical analysis to strengthen claims of significance.
The authors provide a project URL with source code, which is essential for reproducibility. However, the paper lacks detailed hyperparameter settings and training configurations, which could hinder the ability of other researchers to replicate the results fully. Including more specific training details and configurations would enhance reproducibility.
The paper acknowledges the high computational costs associated with training RAF due to the use of long segments and heavy SSL models. Additionally, while the framework shows promise, the authors do not provide a rigorous theoretical foundation for the convergence of RAF, which could be a potential area for future work. Ethical considerations regarding the potential misuse of generated audio deepfakes are also mentioned but not deeply explored.
The advancements in speech synthesis through RAF have significant implications for applications in text-to-speech systems, voice conversion, and potentially in areas like accessibility technology. However, the ethical concerns surrounding the generation of realistic audio deepfakes necessitate careful consideration and the development of countermeasures to prevent misuse. The main contribution of this work is the introduction of the RAF framework, which enhances GAN-based vocoders' fidelity and generalization capabilities through innovative training objectives that leverage self-supervised learning. This research represents a meaningful step forward in the field of neural vocoding, addressing critical challenges while paving the way for future explorations in efficient and ethical audio synthesis technologies.
Human-computer interaction has traditionally relied on the acoustic channel, a dependency that introduces systemic vulnerabilities to environmental noise, privacy constraints, and physiological speech impairments. Silent Speech Interfaces (SSIs) emerge as a transformative paradigm that bypasses the acoustic stage by decoding linguistic intent directly from the neuro-muscular-articulatory continuum. This review provides a high-level synthesis of the SSI landscape, transitioning from traditional transducer-centric analysis to a holistic intent-to-execution taxonomy. We systematically evaluate sensing modalities across four critical physiological interception points: neural oscillations, neuromuscular activation, articulatory kinematics (ultrasound/magnetometry), and pervasive active probing via acoustic or radio-frequency sensing. Critically, we analyze the current paradigm shift from heuristic signal processing to Latent Semantic Alignment. In this new era, Large Language Models (LLMs) and deep generative architectures serve as high-level linguistic priors to resolve the ``informational sparsity'' and non-stationarity of biosignals. By mapping fragmented physiological gestures into structured semantic latent spaces, modern SSI frameworks have, for the first time, approached the Word Error Rate usability threshold required for real-world deployment. We further examine the transition of SSIs from bulky laboratory instrumentation to ``invisible interfaces'' integrated into commodity-grade wearables, such as earables and smart glasses. Finally, we outline a strategic roadmap addressing the ``user-dependency paradox'' through self-supervised foundation models and define the ethical boundaries of ``neuro-security'' to protect cognitive liberty in an increasingly interfaced world.
Primary: National University of Defense Technology
All Institutions: National University of Defense Technology, Hunan Normal University, Hunan University
The paper provides a comprehensive synthesis of Silent Speech Interfaces, detailing their evolution and the integration of advanced machine learning techniques. It significantly contributes to the understanding of non-acoustic speech recognition and its potential applications in various domains, marking a notable advancement in the field of human-computer interaction.
The paper presents a comprehensive taxonomy of Silent Speech Interfaces (SSIs) and evaluates various sensing modalities that capture speech intent from physiological signals. The methodology transitions from traditional acoustic-based systems to a focus on non-acoustic modalities, integrating deep learning techniques and Large Language Models (LLMs) to enhance the decoding of silent speech. The authors provide a rigorous classification of SSIs based on their interception points along the neuro-muscular-articulatory continuum, which is a significant advancement in the field. Furthermore, the analysis of algorithmic evolution from heuristic methods to end-to-end neural architectures is well-articulated, showcasing the transition to modern computational frameworks.
While the paper is primarily a systematic review, it synthesizes existing experimental results from various studies, providing a comparative analysis of performance metrics across different SSI modalities. The benchmarks presented, including Word Error Rates (WER) and accuracy metrics, demonstrate the progress made in the field. However, the paper lacks original experimental data or new empirical results, which would have strengthened its contributions.
The paper does not provide specific implementation details or datasets for reproduction, which is a limitation in terms of reproducibility. However, it does mention the importance of open science and benchmarking, indicating a commitment to facilitating reproducible research in the field.
One limitation of the paper is its reliance on existing literature without presenting new experimental findings. Additionally, the ethical considerations surrounding the deployment of SSIs, particularly regarding privacy and cognitive liberty, are discussed but could benefit from a more in-depth exploration of practical implications.
The implications of this research are significant, as SSIs have the potential to revolutionize human-computer interaction, especially for individuals with speech impairments or in environments where vocal communication is impractical. The integration of LLMs into SSI frameworks could lead to more intuitive and accessible communication technologies, impacting assistive technologies and privacy-preserving communication. The paper provides a comprehensive synthesis of Silent Speech Interfaces, detailing their evolution and the integration of advanced machine learning techniques. It significantly contributes to the understanding of non-acoustic speech recognition and its potential applications in various domains, marking a notable advancement in the field of human-computer interaction.
We propose Relativistic Adversarial Feedback (RAF), a novel training objective for GAN vocoders that improves in-domain fidelity and generalization to unseen scenarios. Although modern GAN vocoders employ advanced architectures, their training objectives often fail to promote generalizable representations. RAF addresses this problem by leveraging speech self-supervised learning models to assist discriminators in evaluating sample quality, encouraging the generator to learn richer representations. Furthermore, we utilize relativistic pairing for real and fake waveforms to improve the modeling of the training data distribution. Experiments across multiple datasets show consistent gains in both objective and subjective metrics on GAN-based vocoders. Importantly, the RAF-trained BigVGAN-base outperforms the LSGAN-trained BigVGAN in perceptual quality using only 12\% of the parameters. Comparative studies further confirm the effectiveness of RAF as a training framework for GAN vocoders.
Primary: Korea Advanced Institute of Science and Technology (KAIST)
All Institutions: Korea Advanced Institute of Science and Technology (KAIST)
The main contribution of this work is the introduction of the RAF framework, which enhances GAN-based vocoders' fidelity and generalization capabilities through innovative training objectives that leverage self-supervised learning. This research represents a meaningful step forward in the field of neural vocoding, addressing critical challenges while paving the way for future explorations in efficient and ethical audio synthesis technologies.
The proposed Relativistic Adversarial Feedback (RAF) framework innovatively integrates self-supervised learning (SSL) models into the training of GAN-based vocoders, enhancing their ability to generalize to unseen scenarios while maintaining fidelity. The methodology is well-structured, with clear definitions of the quality and discriminator gaps, and the use of relativistic pairing is a significant advancement over traditional GAN approaches. The incorporation of SSL models as perceptual guidance for the discriminator is a novel approach that addresses the limitations of existing GAN training objectives.
The experiments conducted are comprehensive, utilizing multiple datasets and GAN architectures, which demonstrate the robustness and versatility of the RAF framework. The results indicate consistent improvements in both objective and subjective metrics, showcasing the effectiveness of the proposed method across various scenarios. The comparative studies against baseline methods further validate the advantages of RAF, although the paper could benefit from additional statistical analysis to strengthen claims of significance.
The authors provide a project URL with source code, which is essential for reproducibility. However, the paper lacks detailed hyperparameter settings and training configurations, which could hinder the ability of other researchers to replicate the results fully. Including more specific training details and configurations would enhance reproducibility.
The paper acknowledges the high computational costs associated with training RAF due to the use of long segments and heavy SSL models. Additionally, while the framework shows promise, the authors do not provide a rigorous theoretical foundation for the convergence of RAF, which could be a potential area for future work. Ethical considerations regarding the potential misuse of generated audio deepfakes are also mentioned but not deeply explored.
The advancements in speech synthesis through RAF have significant implications for applications in text-to-speech systems, voice conversion, and potentially in areas like accessibility technology. However, the ethical concerns surrounding the generation of realistic audio deepfakes necessitate careful consideration and the development of countermeasures to prevent misuse. The main contribution of this work is the introduction of the RAF framework, which enhances GAN-based vocoders' fidelity and generalization capabilities through innovative training objectives that leverage self-supervised learning. This research represents a meaningful step forward in the field of neural vocoding, addressing critical challenges while paving the way for future explorations in efficient and ethical audio synthesis technologies.
Reinforcement Learning (RL) has become an effective paradigm for enhancing Large Language Models (LLMs) and visual generative models. However, its application in text-to-audio (TTA) generation remains largely under-explored. Prior work typically employs offline methods like Direct Preference Optimization (DPO) and leverages Contrastive Language-Audio Pretraining (CLAP) models as reward functions. In this study, we investigate the integration of online Group Relative Policy Optimization (GRPO) into TTA generation. We adapt the algorithm for Flow Matching-based audio models and demonstrate that online RL significantly outperforms its offline counterparts. Furthermore, we incorporate rewards derived from Large Audio Language Models (LALMs), which can provide fine-grained scoring signals that are better aligned with human perception. With only 470M parameters, our final model, \textbf{Resonate}, establishes a new SOTA on TTA-Bench in terms of both audio quality and semantic alignment.
Primary: Shanghai Jiao Tong University
All Institutions: Shanghai Jiao Tong University, SJTU Paris Elite Institute of Technology, X-LANCE Lab
The paper presents a novel integration of online reinforcement learning into text-to-audio generation, achieving state-of-the-art performance and addressing key limitations of previous methods. The technical contributions, particularly in methodology and experimental validation, significantly advance the field of audio generation.
The paper introduces a novel approach to text-to-audio generation by integrating online reinforcement learning (GRPO) with flow-matching audio models. This is a significant methodological advancement, as it addresses the limitations of offline reinforcement learning techniques that have been predominantly used in the field. The adaptation of GRPO for audio generation, along with the use of Large Audio Language Models (LALMs) for reward modeling, presents a fresh perspective on aligning audio generation with human perceptual standards. The methodology is well-structured, with clear definitions and a logical flow from problem identification to proposed solutions.
The experiments are robust, utilizing a comprehensive dataset of 3.7 million audio-text pairs and a well-defined evaluation framework (TTA-Bench). The results demonstrate significant improvements in both audio quality and semantic alignment, establishing state-of-the-art performance. The use of both objective and subjective evaluation metrics strengthens the findings, although the paper could benefit from more extensive comparisons with a broader range of existing models.
The authors provide sufficient details regarding the model architecture, training procedures, and evaluation metrics, which aids reproducibility. The availability of code and model weights on GitHub further enhances the potential for other researchers to replicate the study. However, the paper could include more detailed hyperparameter settings and training configurations to facilitate a complete reproduction of the results.
While the paper presents a compelling advancement in TTA generation, it does not thoroughly address potential limitations, such as the scalability of the proposed method to larger datasets or more complex audio generation tasks. Additionally, the reliance on LALMs for reward modeling may introduce biases based on the training data of these models, which could affect the generalizability of the results.
The implications of this research are significant, particularly in fields such as gaming, filmmaking, and virtual reality, where high-fidelity audio generation is crucial. The integration of reinforcement learning in audio generation could pave the way for more interactive and responsive audio systems, enhancing user experiences in various applications. The open-sourcing of the model and code also promotes further research and development in this area. The paper presents a novel integration of online reinforcement learning into text-to-audio generation, achieving state-of-the-art performance and addressing key limitations of previous methods. The technical contributions, particularly in methodology and experimental validation, significantly advance the field of audio generation.
Neural vocoders have recently advanced waveform generation, yielding natural and expressive audio. Among these approaches, iSTFT-based vocoders have recently gained attention. They predict a complex-valued spectrogram and then synthesize the waveform via iSTFT, thereby avoiding learned upsampling stages that can increase computational cost. However, current approaches use real-valued networks that process the real and imaginary parts independently. This separation limits their ability to capture the inherent structure of complex spectrograms. We present ComVo, a Complex-valued neural Vocoder whose generator and discriminator use native complex arithmetic. This enables an adversarial training framework that provides structured feedback in complex-valued representations. To guide phase transformations in a structured manner, we introduce phase quantization, which discretizes phase values and regularizes the training process. Finally, we propose a block-matrix computation scheme to improve training efficiency by reducing redundant operations. Experiments demonstrate that ComVo achieves higher synthesis quality than comparable real-valued baselines, and that its block-matrix scheme reduces training time by 25%. Audio samples and code are available at https://hs-oh-prml.github.io/ComVo/.
Primary: Korea University
All Institutions: Korea University, Institute of Information & Communications Technology Planning & Evaluation (IITP)
The main contribution of this paper is the introduction of ComVo, a complex-valued neural vocoder that enhances waveform generation by effectively modeling the interactions between real and imaginary components of spectrograms, leading to improved synthesis quality and efficiency. The comprehensive methodology and experimental validation position this work as a significant advancement in the field of audio processing and neural vocoders.
The paper introduces ComVo, a novel complex-valued neural vocoder that leverages complex-valued neural networks (CVNNs) for waveform generation using an iSTFT-based approach. The methodology is well-structured, employing a GAN framework that operates entirely in the complex domain, which allows for more effective modeling of the inherent relationships between the real and imaginary components of spectrograms. The introduction of phase quantization as a structured nonlinearity and the block-matrix computation scheme for efficiency are notable innovations that enhance both the training process and synthesis quality.
The experiments are comprehensive, comparing ComVo against several established vocoders using both subjective and objective metrics. The results demonstrate a clear advantage in synthesis quality, with ComVo achieving the highest scores across various metrics. The use of diverse datasets, including LibriTTS and MUSDB18-HQ, adds robustness to the evaluation. The paper also includes qualitative assessments through visualizations, which further substantiate the claims regarding the effectiveness of the proposed methods.
The paper provides sufficient details regarding the experimental setup, including model architecture, training parameters, and evaluation metrics. The availability of audio samples and code enhances reproducibility, allowing other researchers to validate the findings and build upon the work. However, the complexity of the implementation and the specific configurations used may require additional clarifications for full reproducibility.
While the paper presents significant advancements, it acknowledges limitations such as the high computational overhead associated with complex-valued parameters and potential numerical issues in multi-GPU setups. The reliance on split designs for loss functions and activations may also restrict the exploration of more advanced architectures in future work.
The implications of this research extend to various applications in speech synthesis and audio processing, where improved waveform generation can enhance the quality of synthetic speech and music. The integration of complex-valued networks into vocoders represents a significant step forward in audio generation technologies, potentially influencing future developments in the field. The main contribution of this paper is the introduction of ComVo, a complex-valued neural vocoder that enhances waveform generation by effectively modeling the interactions between real and imaginary components of spectrograms, leading to improved synthesis quality and efficiency. The comprehensive methodology and experimental validation position this work as a significant advancement in the field of audio processing and neural vocoders.
We investigate continued pretraining (CPT) for adapting wav2vec2-bert-2.0 to Swahili automatic speech recognition (ASR). Our approach combines unlabeled audio with limited labeled data through pseudo-labeled CPT followed by supervised finetuning. With 20,000 labeled samples, we achieve 3.24% WER on Common Voice Swahili-an 82% relative improvement over the baseline. This result surpasses the best previously reported academic system (8.3% WER from XLS-R) by 61% relative improvement. We provide concrete data requirements and a replicable methodology applicable to other low-resource languages.
Primary: Harvard University
All Institutions: Harvard University, Thiomi-Lugha NLP
The paper presents a systematic evaluation of continued pretraining for Swahili ASR, achieving state-of-the-art performance with minimal labeled data. The innovative methodology and significant results provide a valuable framework for advancing ASR technology in low-resource languages, highlighting the potential for broader applications in underserved linguistic communities.
The methodology presented in the paper is well-structured and innovative, focusing on continued pretraining (CPT) for low-resource Swahili ASR. The authors systematically explore the impact of combining unlabeled audio with limited labeled data, employing a pseudo-labeling approach to enhance model performance. The clear three-stage training pipeline (labeling model, continued pretraining, and supervised finetuning) is a practical contribution that can be replicated in other low-resource language contexts. The use of a strong baseline model and conservative hyperparameter tuning are commendable practices that enhance the robustness of the methodology.
The experimental evaluation is thorough, comparing models trained with and without CPT across different labeled data scales (5K and 20K samples). The results demonstrate a significant improvement in word error rate (WER), achieving state-of-the-art performance for Swahili ASR. The comparative analysis with a baseline model trained on a larger dataset (50K samples) provides a strong basis for the claims made. The use of the Common Voice dataset adds credibility, as it is a widely recognized resource in the ASR community.
The paper provides sufficient detail regarding the experimental setup, including data sources, model architecture, and training procedures, which supports reproducibility. However, the absence of a publicly accessible code repository or demo limits the practical reproducibility of the findings. Clear documentation of hyperparameters and training configurations aids in understanding the methodology.
One limitation is the reliance on pseudo-labeling, which can introduce noise if the baseline model's performance is not sufficiently high. The study also does not explore the impact of varying the quality of unlabeled data on the results, which could provide further insights into the robustness of the approach. Additionally, while the focus on Swahili is valuable, the generalizability of the findings to other low-resource languages remains to be tested.
The research has significant implications for the development of ASR technologies in low-resource languages, particularly for Swahili, which has over 100 million speakers. By demonstrating that high-quality ASR can be achieved with minimal labeled data, the findings can facilitate the creation of educational tools, accessibility applications, and voice interfaces that serve underrepresented language communities. This work contributes to the broader goal of making technology more inclusive and accessible. The paper presents a systematic evaluation of continued pretraining for Swahili ASR, achieving state-of-the-art performance with minimal labeled data. The innovative methodology and significant results provide a valuable framework for advancing ASR technology in low-resource languages, highlighting the potential for broader applications in underserved linguistic communities.
In target speaker extraction (TSE), we aim to recover target speech from a multi-talker mixture using a short enrollment utterance as reference. Recent studies on diffusion and flow-matching generators have improved target-speech fidelity. However, multi-step sampling increases latency, and one-step solutions often rely on a mixture-dependent time coordinate that can be unreliable for real-world conversations. We present AlphaFlowTSE, a one-step conditional generative model trained with a Jacobian-vector product (JVP)-free AlphaFlow objective. AlphaFlowTSE learns mean-velocity transport along a mixture-to-target trajectory starting from the observed mixture, eliminating auxiliary mixing-ratio prediction, and stabilizes training by combining flow matching with an interval-consistency teacher-student target. Experiments on Libri2Mix and REAL-T confirm that AlphaFlowTSE improves target-speaker similarity and real-mixture generalization for downstream automatic speech recognition (ASR).
Primary: Nanjing University
All Institutions: Nanjing University, The Chinese University of Hong Kong, Xiamen University, Shenzhen Loop Area Institute
The main contribution of this paper is the introduction of AlphaFlowTSE, a one-step generative model for target speaker extraction that effectively reduces latency while maintaining high fidelity and generalization performance. This work represents a meaningful advancement in the field of audio processing, particularly in applications requiring real-time speaker extraction from mixed audio environments.
The paper presents AlphaFlowTSE, a novel one-step generative model for target speaker extraction (TSE) that leverages a Jacobian-vector product (JVP)-free AlphaFlow objective. The methodology is innovative in its approach to learning mean-velocity transport along a mixture-to-target trajectory, effectively addressing the challenges of multi-step sampling and mixture-dependent time coordinates. The combination of flow matching with an interval-consistency teacher-student framework enhances training stability and aligns the training process with inference, which is a significant advancement in the field.
The experimental setup is robust, utilizing well-established datasets such as Libri2Mix and REAL-T to validate the model's performance. The results demonstrate that AlphaFlowTSE achieves superior target-speaker similarity and generalization capabilities in downstream automatic speech recognition (ASR) tasks compared to existing methods. The paper provides comprehensive quantitative metrics, including PESQ, ESTOI, and SI-SDR, which support the claims of improved performance.
The paper lacks a dedicated section on code availability or implementation details, which raises concerns about reproducibility. While the methodology is described in detail, the absence of a project URL or demo limits the ability of other researchers to replicate the results independently.
One limitation is the reliance on synthetic datasets for training, which may not fully capture the complexities of real-world scenarios. Additionally, the model's performance in extremely noisy environments or with overlapping speech from multiple speakers remains to be thoroughly evaluated.
The advancements made in this paper have the potential to significantly improve applications in real-time speech processing, such as virtual assistants, conference call technologies, and hearing aids, where accurate speaker extraction is crucial. The low-latency inference capability of AlphaFlowTSE could enhance user experiences in interactive settings. The main contribution of this paper is the introduction of AlphaFlowTSE, a one-step generative model for target speaker extraction that effectively reduces latency while maintaining high fidelity and generalization performance. This work represents a meaningful advancement in the field of audio processing, particularly in applications requiring real-time speaker extraction from mixed audio environments.
We propose self-speculative decoding for speech-aware LLMs by using the CTC encoder as a draft model to accelerate auto-regressive (AR) inference and improve ASR accuracy. Our three-step procedure works as follows: (1) if the frame entropies of the CTC output distributions are below a threshold, the greedy CTC hypothesis is accepted as final; (2) otherwise, the CTC hypothesis is verified in a single LLM forward pass using a relaxed acceptance criterion based on token likelihoods; (3) if verification fails, AR decoding resumes from the accepted CTC prefix. Experiments on nine corpora and five languages show that this approach can simultaneously accelerate decoding and reduce WER. On the HuggingFace Open ASR benchmark with a 1B parameter LLM and 440M parameter CTC encoder, we achieve a record 5.58% WER and improve the inverse real time factor by a factor of 4.4 with only a 12% relative WER increase over AR search. Code and model weights are publicly available under a permissive license.
Primary: IBM Research
All Institutions: IBM Research
The main contribution of this paper is the introduction of self-speculative decoding, which leverages CTC encoders to enhance the efficiency and accuracy of ASR systems using LLMs. This work represents a meaningful advancement in the field of ASR, combining innovative methodology with rigorous experimental validation to address key challenges in the domain.
The proposed self-speculative decoding method is innovative in its use of a CTC encoder as a draft model to enhance the efficiency of auto-regressive inference in ASR systems. The three-step procedure effectively combines the strengths of CTC and LLMs, allowing for a more efficient decoding process while maintaining accuracy. The method is well-structured, with clear delineation of the verification and fallback processes, which showcases a thoughtful integration of existing techniques with novel adaptations. However, the reliance on a specific architecture (CTC-trained SLM) may limit its applicability to other frameworks.
The experiments conducted on nine corpora across five languages provide a robust evaluation of the method's performance. Achieving a record WER of 5.58% and a significant improvement in the inverse real-time factor (RTF) demonstrates the practical benefits of the proposed approach. The paper includes a detailed analysis of the results, including ablation studies that validate the necessity of both verification stages, which adds credibility to the findings. However, the lack of comparison with a broader range of existing methods could be seen as a limitation.
The paper mentions that code and model weights are publicly available, which is a positive aspect for reproducibility. However, the paper could benefit from more detailed descriptions of the experimental setup, including hyperparameters and specific configurations used during training and evaluation, to facilitate easier replication by other researchers.
The method has several limitations, including its dependency on a specific SLM architecture with a frozen CTC encoder, which may not generalize well to other models or tasks outside of ASR. Additionally, the utterance-based verification process could lead to inefficiencies in scenarios with low acceptance rates, as the entire utterance must be re-decoded from the point of failure. The paper also acknowledges the potential for language model bias, which could affect the accuracy of the final outputs.
The proposed method has significant implications for real-time ASR applications, particularly in environments where both speed and accuracy are critical. By improving the efficiency of ASR systems, this research could enhance user experiences in various domains, including virtual assistants, transcription services, and accessibility technologies. The approach also opens avenues for further research into joint training strategies for encoders and LLMs, potentially leading to advancements in related speech tasks. The main contribution of this paper is the introduction of self-speculative decoding, which leverages CTC encoders to enhance the efficiency and accuracy of ASR systems using LLMs. This work represents a meaningful advancement in the field of ASR, combining innovative methodology with rigorous experimental validation to address key challenges in the domain.
This paper introduces V2A-DPO, a novel Direct Preference Optimization (DPO) framework tailored for flow-based video-to-audio generation (V2A) models, incorporating key adaptations to effectively align generated audio with human preferences. Our approach incorporates three core innovations: (1) AudioScore-a comprehensive human preference-aligned scoring system for assessing semantic consistency, temporal alignment, and perceptual quality of synthesized audio; (2) an automated AudioScore-driven pipeline for generating large-scale preference pair data for DPO optimization; (3) a curriculum learning-empowered DPO optimization strategy specifically tailored for flow-based generative models. Experiments on benchmark VGGSound dataset demonstrate that human-preference aligned Frieren and MMAudio using V2A-DPO outperform their counterparts optimized using Denoising Diffusion Policy Optimization (DDPO) as well as pre-trained baselines. Furthermore, our DPO-optimized MMAudio achieves state-of-the-art performance across multiple metrics, surpassing published V2A models.
Primary: Shanghai Jiao Tong University
All Institutions: Shanghai Jiao Tong University, The Chinese University of Hong Kong, The University of Warwick, National Research Council Canada, Hong Kong SAR
The paper introduces V2A-DPO, a novel framework for aligning audio generation with human preferences in video-to-audio tasks, significantly advancing the state-of-the-art in this domain. The comprehensive evaluation of the proposed methodology and its implications for the field highlight its potential impact on future research and applications in multimedia generation.
The methodology presented in this paper introduces a novel framework, V2A-DPO, which effectively integrates human preference alignment into the video-to-audio generation process. The three core innovations—AudioScore, an automated preference pair data generation pipeline, and a curriculum learning-empowered DPO optimization strategy—are well-articulated and address significant limitations in existing V2A models. The use of a comprehensive scoring system to evaluate multiple dimensions of audio quality is particularly noteworthy, as it enhances the robustness of the evaluation process. However, the complexity of the model and the reliance on human-annotated data may pose challenges in terms of scalability and generalizability.
The experimental evaluation is thorough, utilizing the VGGSound dataset to benchmark the performance of the proposed method against state-of-the-art models. The results indicate that the V2A-DPO framework significantly improves audio generation quality, semantic alignment, and temporal coherence. The paper provides clear comparisons with baseline models and other contemporary approaches, showcasing the effectiveness of the proposed method. However, further exploration of diverse datasets and real-world scenarios could strengthen the findings.
The paper includes detailed implementation specifics, such as model parameters, training protocols, and evaluation metrics, which enhance reproducibility. The use of well-defined datasets and the provision of a demo URL further support the reproducibility of the results. However, the complexity of the AudioScore system and the curriculum learning approach may require additional documentation for full replication.
One limitation of the study is the dependency on human-annotated data for training the AudioScore, which may not be feasible for larger-scale applications. Additionally, the performance metrics primarily focus on quantitative assessments, which may overlook qualitative aspects of audio generation. The model's performance in more diverse and challenging scenarios remains to be fully explored.
The V2A-DPO framework has significant implications for multimedia applications, particularly in enhancing the quality of audio generation in video content. This could benefit various industries, including entertainment, education, and accessibility technologies. The integration of human preferences into generative models represents a step towards more user-centered AI applications, potentially leading to more engaging and immersive experiences. The paper introduces V2A-DPO, a novel framework for aligning audio generation with human preferences in video-to-audio tasks, significantly advancing the state-of-the-art in this domain. The comprehensive evaluation of the proposed methodology and its implications for the field highlight its potential impact on future research and applications in multimedia generation.
Large language models are increasingly adopted as semantic backbones for neural text-to-speech systems. However, frozen LLM representations are insufficient for modeling speaker specific acoustic and perceptual characteristics. Our experiments involving fine tuning of the Language Model backbone of TTS show promise in improving the voice consistency and Signal to Noise ratio SNR in voice cloning task. Across multiple speakers LoRA finetuning consistently outperforms the non-finetuned base Qwen-0.5B model across three complementary dimensions of speech quality. First, perceptual quality improves significantly with DNS-MOS gains of up to 0.42 points for speakers whose training data exhibits sufficient acoustic variability. Second, speaker fidelity improves for all evaluated speakers with consistent increases in voice similarity indicating that LoRA effectively adapts speaker identity representations without degrading linguistic modeling. Third, signal level quality improves in most cases with signal to noise ratio increasing by as much as 34 percent. Crucially these improvements are strongly governed by the characteristics of the training data. Speakers with high variability in acoustic energy and perceptual quality achieve simultaneous gains in DNS-MOS voice similarity and SNR. Overall this work establishes that LoRA finetuning is not merely a parameter efficient optimization technique but an effective mechanism for better speaker level adaptation in compact LLM-based TTS systems. When supported by sufficiently diverse training data LoRA adapted Qwen-0.5B consistently surpasses its frozen base model in perceptual quality speaker similarity with low latency using GGUF model hosted in quantized form.
Primary: Sprinklr AI
All Institutions: Sprinklr AI
The paper establishes that LoRA fine-tuning is a powerful mechanism for enhancing speaker-level adaptation in LLM-based TTS systems, emphasizing the critical role of training data diversity in achieving high-quality voice synthesis. The comprehensive analysis of methodology, experiments, and results positions this work as a valuable contribution to the field of machine learning and speech synthesis.
The paper employs a systematic approach to fine-tuning the Qwen-0.5B language model backbone for TTS applications using Low-Rank Adaptation (LoRA). The methodology is robust, utilizing both full fine-tuning and LoRA fine-tuning techniques, and includes a comprehensive analysis of the impact of training data variability on model performance. The identification of the loss-quality divergence phenomenon is a significant methodological contribution that challenges conventional practices in model evaluation. The experiments are well-structured, focusing on perceptual quality metrics such as DNS-MOS and SNR, which are critical for TTS applications.
The experiments are extensive, involving multiple datasets and speakers, which provide a thorough evaluation of the proposed methods. The results demonstrate clear improvements in voice consistency and quality metrics across different conditions, particularly highlighting the importance of data diversity. The paper presents quantitative results that substantiate the claims made regarding the effectiveness of LoRA fine-tuning, with detailed tables and figures that illustrate the performance gains achieved. However, the paper could benefit from a more explicit discussion of the statistical significance of the results presented.
The paper includes sufficient details about the experimental setup, including the datasets used, training parameters, and evaluation metrics, which enhances reproducibility. However, the lack of a publicly available code repository or dataset limits the ability for others to fully replicate the findings. Providing access to the datasets and code would significantly improve the reproducibility of the results.
The paper acknowledges the limitations related to the quality of training data, particularly the impact of low variability in acoustic energy on the fine-tuning outcomes. Additionally, the findings regarding the loss-quality divergence could benefit from further exploration, particularly in terms of how this phenomenon might affect different types of TTS applications. The results are also somewhat speaker-dependent, which may limit generalizability across diverse TTS scenarios.
The findings have significant implications for the development of more efficient and effective TTS systems, particularly in applications requiring high-quality voice cloning. The insights into data diversity and the effectiveness of LoRA fine-tuning could influence future research and development in the field of speech synthesis, making it more accessible and adaptable to various use cases. The potential for multi-speaker models to generalize across unseen speakers is particularly promising for real-world applications. The paper establishes that LoRA fine-tuning is a powerful mechanism for enhancing speaker-level adaptation in LLM-based TTS systems, emphasizing the critical role of training data diversity in achieving high-quality voice synthesis. The comprehensive analysis of methodology, experiments, and results positions this work as a valuable contribution to the field of machine learning and speech synthesis.
We present FireRedASR2S, a state-of-the-art industrial-grade all-in-one automatic speech recognition (ASR) system. It integrates four modules in a unified pipeline: ASR, Voice Activity Detection (VAD), Spoken Language Identification (LID), and Punctuation Prediction (Punc). All modules achieve SOTA performance on the evaluated benchmarks: FireRedASR2: An ASR module with two variants, FireRedASR2-LLM (8B+ parameters) and FireRedASR2-AED (1B+ parameters), supporting speech and singing transcription for Mandarin, Chinese dialects and accents, English, and code-switching. Compared to FireRedASR, FireRedASR2 delivers improved recognition accuracy and broader dialect and accent coverage. FireRedASR2-LLM achieves 2.89% average CER on 4 public Mandarin benchmarks and 11.55% on 19 public Chinese dialects and accents benchmarks, outperforming competitive baselines including Doubao-ASR, Qwen3-ASR, and Fun-ASR. FireRedVAD: An ultra-lightweight module (0.6M parameters) based on the Deep Feedforward Sequential Memory Network (DFSMN), supporting streaming VAD, non-streaming VAD, and multi-label VAD (mVAD). On the FLEURS-VAD-102 benchmark, it achieves 97.57% frame-level F1 and 99.60% AUC-ROC, outperforming Silero-VAD, TEN-VAD, FunASR-VAD, and WebRTC-VAD. FireRedLID: An Encoder-Decoder LID module supporting 100+ languages and 20+ Chinese dialects and accents. On FLEURS (82 languages), it achieves 97.18% utterance-level accuracy, outperforming Whisper and SpeechBrain. FireRedPunc: A BERT-style punctuation prediction module for Chinese and English. On multi-domain benchmarks, it achieves 78.90% average F1, outperforming FunASR-Punc (62.77%). To advance research in speech processing, we release model weights and code at https://github.com/FireRedTeam/FireRedASR2S.
Primary: Super Intelligence Team
All Institutions: Super Intelligence Team
FireRedASR2S represents a significant advancement in automatic speech recognition systems by integrating multiple essential modules into a cohesive framework. The technical contributions, particularly in the areas of multilingual support and dialect recognition, are well-founded and demonstrate a strong potential for real-world applications, making the system a valuable resource for both research and industry.
The methodology presented in FireRedASR2S is robust, integrating multiple modules (ASR, VAD, LID, Punc) into a single pipeline, which is a significant advancement over traditional systems that rely on disparate components. Each module is well-defined, with clear architectural choices and training strategies that leverage large datasets, particularly for dialect coverage and multilingual support. The use of human-annotated data for VAD is a notable improvement over conventional methods that depend on ASR forced alignments.
The experimental evaluation is comprehensive, utilizing a variety of public benchmarks to assess the performance of each module independently. The reported results demonstrate state-of-the-art accuracy across multiple tasks, indicating the effectiveness of the proposed system. However, the paper could benefit from more detailed comparisons with additional state-of-the-art systems beyond the mentioned baselines.
The authors have made their model weights and code publicly available, which enhances reproducibility. However, the paper could provide more detailed implementation instructions or configuration settings to facilitate easier replication of results by other researchers.
While the system shows strong performance, it may still face challenges in highly noisy environments or with heavily accented speech that was not well-represented in the training data. Additionally, the reliance on large-scale data may limit accessibility for smaller research teams or institutions.
The integration of multiple speech processing tasks into a single system has significant implications for real-world applications, particularly in multilingual and dialect-rich environments. This could enhance accessibility and usability in various domains, including education, customer service, and content creation. FireRedASR2S represents a significant advancement in automatic speech recognition systems by integrating multiple essential modules into a cohesive framework. The technical contributions, particularly in the areas of multilingual support and dialect recognition, are well-founded and demonstrate a strong potential for real-world applications, making the system a valuable resource for both research and industry.
Recent advances in generative models have amplified the risk of malicious misuse of speech synthesis technologies, enabling adversaries to impersonate target speakers and access sensitive resources. Although speech deepfake detection has progressed rapidly, most existing countermeasures lack formal robustness guarantees or fail to generalize to unseen generation techniques. We propose PV-VASM, a probabilistic framework for verifying the robustness of voice anti-spoofing models (VASMs). PV-VASM estimates the probability of misclassification under text-to-speech (TTS), voice cloning (VC), and parametric signal transformations. The approach is model-agnostic and enables robustness verification against unseen speech synthesis techniques and input perturbations. We derive a theoretical upper bound on the error probability and validate the method across diverse experimental settings, demonstrating its effectiveness as a practical robustness verification tool.
Primary: Central University
All Institutions: Applied AI Institute, Central University, City University of Hong Kong, Trusted AI Research Center
The paper presents PV-VASM, a robust framework for verifying voice anti-spoofing models, addressing critical gaps in existing methodologies and offering a probabilistic approach to enhance security against advanced speech synthesis threats. The comprehensive evaluation of its effectiveness across diverse experimental settings underscores its potential impact on the field of audio machine learning.
The proposed PV-VASM framework introduces a novel probabilistic approach for verifying the robustness of voice anti-spoofing models. It is model-agnostic and capable of estimating misclassification probabilities under various transformations, including text-to-speech and voice cloning. The use of probabilistic concentration inequalities to derive theoretical upper bounds on error probabilities is a significant methodological contribution, as it addresses a gap in existing robustness verification techniques that often lack formal guarantees. The framework is well-structured, with clear definitions and procedures for estimating the necessary statistics, although the complexity of the equations may pose a barrier for some practitioners.
The experiments conducted cover a wide range of transformations and generative models, providing a comprehensive evaluation of the PV-VASM framework. The results demonstrate the effectiveness of the proposed method in various settings, particularly in its ability to certify robustness against unseen transformations. The use of diverse datasets, including ASVspoof and others, strengthens the empirical validation. However, the paper could benefit from more extensive comparisons with existing methods to highlight the advantages of PV-VASM more clearly.
The paper provides a detailed description of the experimental setup, including the datasets used, model architectures, and hyperparameters. However, the lack of publicly available code or a project URL limits the reproducibility of the results. Future work should consider releasing the implementation to facilitate validation by the research community.
The main limitations include the potential conservativeness of the upper bounds derived, which may not accurately reflect the true misclassification probabilities in all scenarios. Additionally, the complexity of the methodology may hinder its adoption by practitioners unfamiliar with probabilistic methods. The paper also acknowledges that robustness against generative models is more challenging, indicating that further research is needed to improve performance in this area.
The proposed framework has significant implications for the security of voice recognition systems, particularly in the context of increasing threats from deepfake technologies. By providing a systematic approach to verify the robustness of voice anti-spoofing models, PV-VASM can enhance the reliability of these systems in real-world applications, potentially leading to safer authentication processes in various domains, including finance and personal security. The paper presents PV-VASM, a robust framework for verifying voice anti-spoofing models, addressing critical gaps in existing methodologies and offering a probabilistic approach to enhance security against advanced speech synthesis threats. The comprehensive evaluation of its effectiveness across diverse experimental settings underscores its potential impact on the field of audio machine learning.
Recent advances in generative models have amplified the risk of malicious misuse of speech synthesis technologies, enabling adversaries to impersonate target speakers and access sensitive resources. Although speech deepfake detection has progressed rapidly, most existing countermeasures lack formal robustness guarantees or fail to generalize to unseen generation techniques. We propose PV-VASM, a probabilistic framework for verifying the robustness of voice anti-spoofing models (VASMs). PV-VASM estimates the probability of misclassification under text-to-speech (TTS), voice cloning (VC), and parametric signal transformations. The approach is model-agnostic and enables robustness verification against unseen speech synthesis techniques and input perturbations. We derive a theoretical upper bound on the error probability and validate the method across diverse experimental settings, demonstrating its effectiveness as a practical robustness verification tool.
Primary: Central University
All Institutions: Central University, Applied AI Institute, City University of Hong Kong, Trusted AI Research Center
The paper presents a robust framework for verifying the resilience of voice anti-spoofing models against emerging generative threats. Its comprehensive methodology and experimental validation position it as a valuable contribution to the field, although improvements in reproducibility and clarity would enhance its impact.
The proposed PV-VASM framework introduces a novel probabilistic approach to verify the robustness of voice anti-spoofing models against various transformations, including text-to-speech and voice cloning. The methodology is model-agnostic and leverages probabilistic concentration inequalities to derive a theoretical upper bound on misclassification probabilities. The authors provide a detailed mathematical formulation, which is commendable, but the complexity may hinder understanding for practitioners. The approach is well-grounded in existing literature, addressing a significant gap in the certification of voice anti-spoofing systems.
The experimental validation is extensive, covering a wide range of transformations and generative models. The authors utilize a combination of datasets, including ASVspoof and other open-source collections, to evaluate the robustness of their framework. The results demonstrate meaningful robustness certificates and highlight the framework's applicability in real-world scenarios. However, the paper could benefit from clearer visualizations of results and comparisons with existing methods to better illustrate the advantages of PV-VASM.
While the paper provides a comprehensive description of the methodology and experimental setup, it lacks specific implementation details and code availability, which are crucial for reproducibility. The absence of a project URL further complicates efforts to replicate the findings. Providing a GitHub repository or supplementary materials would greatly enhance the reproducibility of the results.
The paper acknowledges certain limitations, such as the potential over-conservativeness of the upper bounds and the varying performance of the framework against different types of perturbations. Additionally, the complexity of the methodology may pose challenges for practical implementation in real-world applications. The authors also note that the robustness against generative models is less effective, indicating a need for further refinement.
The proposed framework has significant implications for the field of audio security, particularly in enhancing the robustness of voice anti-spoofing systems. As generative models become increasingly sophisticated, the ability to certify the robustness of these systems is crucial for preventing misuse. The research contributes to the ongoing discourse on security in machine learning and could influence future developments in anti-spoofing technologies. The paper presents a robust framework for verifying the resilience of voice anti-spoofing models against emerging generative threats. Its comprehensive methodology and experimental validation position it as a valuable contribution to the field, although improvements in reproducibility and clarity would enhance its impact.
Although the deep integration of the Automatic Speech Recognition (ASR) system with Large Language Models (LLMs) has significantly improved accuracy, the deployment of such systems in low-latency streaming scenarios remains challenging. In this paper, we propose Uni-ASR, a unified framework based on LLMs that integrates both non-streaming and streaming speech recognition capabilities. We propose a joint training paradigm that enables the system to seamlessly transition between two recognition modes without any architectural modifications. Furthermore, we introduce a context-aware training paradigm and a co-designed fallback decoding strategy, which can enhance streaming recognition accuracy without introducing additional latency. The experimental results demonstrate that Uni-ASR not only achieves competitive performance within non-streaming mode, but also demonstrates strong effectiveness in streaming scenarios under diverse latency constraints.
Primary: Tongyi AI Lab
All Institutions: Tongyi AI Lab
The main contribution of this paper is the introduction of Uni-ASR, a unified LLM-based architecture that effectively integrates non-streaming and streaming ASR capabilities, demonstrating competitive performance across diverse latency constraints. This work represents a meaningful step forward in the development of flexible and efficient ASR systems, with the potential to significantly impact both academic research and practical applications in the field.
The proposed methodology of Uni-ASR is innovative in its integration of both non-streaming and streaming ASR capabilities within a single architecture. The joint training paradigm allows for seamless transitions between modes, which is a significant advancement over existing systems that typically require separate models or complex adaptations. The introduction of a context-aware training paradigm and a fallback decoding strategy enhances the robustness of the streaming recognition process, addressing key challenges such as latency and accuracy without compromising performance. The use of established architectures like Conformer and the pre-trained Qwen3-1.7B model provides a solid foundation for the proposed methods, although the paper could benefit from a more thorough comparison with other recent architectures.
The experimental setup is comprehensive, utilizing multiple well-known ASR benchmarks to evaluate the performance of Uni-ASR. The results indicate that the model achieves competitive performance in both non-streaming and streaming modes, outperforming several state-of-the-art systems. The ablation studies are particularly valuable, as they provide insights into the contributions of different components of the model. However, the focus on a bilingual corpus may limit the generalizability of the findings to other languages or dialects.
The paper provides a detailed description of the training process, data preprocessing, and evaluation metrics, which supports reproducibility. However, the absence of a publicly available code repository or demo limits the ease with which other researchers can replicate the results. Including such resources would significantly enhance the paper's impact.
One limitation of the study is the focus on a specific bilingual corpus, which may not fully capture the diversity of speech recognition challenges across different languages and dialects. Additionally, while the fallback decoding strategy is innovative, it may introduce complexities in real-world applications where contextual dependencies are not easily modeled. The paper could also benefit from a discussion on the computational efficiency of the proposed methods, particularly in terms of resource requirements for deployment.
The advancements presented in Uni-ASR have significant implications for real-time applications of ASR, such as live transcription services, voice assistants, and accessibility tools. By addressing the latency issues associated with streaming ASR, this work could enhance user experiences in various domains, including education, healthcare, and customer service. The unified framework also opens avenues for further research into hybrid ASR systems that can adapt to different operational contexts. The main contribution of this paper is the introduction of Uni-ASR, a unified LLM-based architecture that effectively integrates non-streaming and streaming ASR capabilities, demonstrating competitive performance across diverse latency constraints. This work represents a meaningful step forward in the development of flexible and efficient ASR systems, with the potential to significantly impact both academic research and practical applications in the field.
In Extended Reality (XR), complex acoustic environments often overwhelm users, compromising both scene awareness and social engagement due to entangled sound sources. We introduce MoXaRt, a real-time XR system that uses audio-visual cues to separate these sources and enable fine-grained sound interaction. MoXaRt's core is a cascaded architecture that performs coarse, audio-only separation in parallel with visual detection of sources (e.g., faces, instruments). These visual anchors then guide refinement networks to isolate individual sources, separating complex mixes of up to 5 concurrent sources (e.g., 2 voices + 3 instruments) with ~2 second processing latency. We validate MoXaRt through a technical evaluation on a new dataset of 30 one-minute recordings featuring concurrent speech and music, and a 22-participant user study. Empirical results indicate that our system significantly enhances speech intelligibility, yielding a 36.2% (p < 0.01) increase in listening comprehension within adversarial acoustic environments while substantially reducing cognitive load (p < 0.001), thereby paving the way for more perceptive and socially adept XR experiences.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of MoXaRt, a novel audio-visual system for real-time sound separation in XR environments, which significantly enhances user experience by improving speech intelligibility and reducing cognitive load. The combination of audio and visual cues represents a meaningful advancement in the field, although further work on reproducibility and dataset diversity is needed to fully realize its potential.
The methodology presented in MoXaRt is innovative, utilizing a cascaded architecture that combines audio-only separation with visual detection to enhance sound interaction in XR environments. The approach of using visual anchors to guide audio source separation is particularly noteworthy, as it leverages multimodal data to improve performance in complex acoustic scenarios. However, the paper could benefit from a more detailed explanation of the architecture's components and the rationale behind the design choices, as well as comparisons to existing methods.
The experimental evaluation is robust, featuring a new dataset specifically designed for the task, which adds value to the research. The results demonstrate a significant improvement in speech intelligibility and cognitive load reduction, supported by statistical significance. The user study with 22 participants provides practical insights into the system's effectiveness, although a larger sample size could enhance the reliability of the findings.
The paper lacks detailed implementation specifics, such as hyperparameter settings, model architectures, and training procedures, which are crucial for reproducibility. Providing access to code or detailed supplementary materials would greatly enhance the ability of other researchers to replicate the results.
One limitation is the relatively small dataset size (30 recordings), which may not capture the full diversity of real-world acoustic environments. Additionally, the study primarily focuses on speech and music separation; other sound sources or more complex scenarios may not be adequately addressed. The processing latency of ~2 seconds, while acceptable for some applications, may still pose challenges in real-time interactions.
The potential applications of MoXaRt are significant, particularly in enhancing user experiences in XR environments, where sound plays a crucial role in immersion and social interaction. By improving speech intelligibility and reducing cognitive load, this system could facilitate better communication and engagement in virtual settings, with implications for education, gaming, and remote collaboration. The main contribution of this paper is the introduction of MoXaRt, a novel audio-visual system for real-time sound separation in XR environments, which significantly enhances user experience by improving speech intelligibility and reducing cognitive load. The combination of audio and visual cues represents a meaningful advancement in the field, although further work on reproducibility and dataset diversity is needed to fully realize its potential.
We study timestamped speaker-attributed ASR for long-form, multi-party speech with overlap, where chunk-wise inference must preserve meeting-level speaker identity consistency while producing time-stamped, speaker-labeled transcripts. Previous Speech-LLM systems tend to prioritize either local diarization or global labeling, but often lack the ability to capture fine-grained temporal boundaries or robust cross-chunk identity linking. We propose G-STAR, an end-to-end system that couples a time-aware speaker-tracking module with a Speech-LLM transcription backbone. The tracker provides structured speaker cues with temporal grounding, and the LLM generates attributed text conditioned on these cues. G-STAR supports both component-wise optimization and joint end-to-end training, enabling flexible learning under heterogeneous supervision and domain shift. Experiments analyze cue fusion, local versus long-context trade-offs and hierarchical objectives.
Primary: Shenzhen Research Institute of Big Data
All Institutions: Shenzhen Research Institute of Big Data, ETH ZĂĽrich, Central Media Technology Institute, Nanjing University, Shanghai Jiao Tong University
The main contribution of this paper is the introduction of G-STAR, an innovative end-to-end system for timestamped speaker-attributed ASR that effectively combines speaker tracking with advanced transcription techniques. This work addresses critical gaps in existing methodologies, offering a promising direction for future research in speaker recognition and transcription technologies.
The G-STAR system presents a novel approach to timestamped speaker-attributed ASR by integrating a time-aware speaker-tracking module with a Speech-LLM transcription backbone. This dual-component system allows for both local and global speaker identity tracking, addressing the limitations of existing systems that either focus on local diarization or global labeling. The methodology is well-structured, with a clear explanation of how the tracker provides structured cues and how the LLM generates attributed text based on these cues. The flexibility of supporting both component-wise optimization and joint end-to-end training is a significant strength, allowing for adaptation to various supervision scenarios and domain shifts.
The experiments conducted in the paper are comprehensive, analyzing multiple aspects such as cue fusion, local versus long-context trade-offs, and hierarchical objectives. However, the paper could benefit from more detailed results, including quantitative metrics that compare G-STAR's performance against baseline models. The absence of a clear benchmark or comparison with state-of-the-art systems in the results section limits the ability to fully assess the effectiveness of the proposed method.
The paper lacks detailed implementation specifics that would facilitate reproducibility. While the methodology is described, there are no clear instructions or links to code repositories or datasets used in the experiments. This absence makes it challenging for other researchers to replicate the study or build upon the findings.
One limitation of the G-STAR system is its reliance on the quality of the underlying Speech-LLM transcription backbone. If the LLM struggles with transcription accuracy, it could adversely affect the overall performance of the system. Additionally, the paper does not address potential challenges in real-world applications, such as varying acoustic environments or speaker accents, which could impact the robustness of the speaker-tracking module.
The implications of this research are significant, as accurate speaker attribution in multi-party conversations is crucial for various applications, including meeting transcriptions, customer service interactions, and accessibility tools for the hearing impaired. By improving the consistency and accuracy of speaker identification in complex audio environments, G-STAR has the potential to enhance communication technologies and facilitate better understanding in multi-speaker scenarios. The main contribution of this paper is the introduction of G-STAR, an innovative end-to-end system for timestamped speaker-attributed ASR that effectively combines speaker tracking with advanced transcription techniques. This work addresses critical gaps in existing methodologies, offering a promising direction for future research in speaker recognition and transcription technologies.
Large language models (LLMs) provide strong semantic priors that can improve multi-talker automatic speech recognition (MT-ASR), but using an LLM as an autoregressive decoder is computationally expensive and remains fragile under heavy overlap. In this paper, we propose an encoder-only MT-ASR framework that adapts an LLM to multi-talker conditioning and distills its semantic guidance into the encoder during training, while retaining fast CTC-style decoding at inference. Our model employs a post-encoder separator with serialized CTC to produce talker-ordered transcripts, and leverages an adapted LLM-based SOT objective as a multi-talker-aware teacher signal to explicitly regularize mixed-speech representations. To further support variable numbers of talkers, we introduce a Talker-Count Head that predicts the talker count and dynamically selects the appropriate decoding branch. Experiments on LibriMix show that the proposed encoder-only model achieves comparable performance to LLM-based systems in the two-talker condition, while delivering significant improvements in the three-talker condition with significant small RTF.
Primary: SB Intuitions
All Institutions: SB Intuitions
The main contribution of this paper is the development of an encoder-only multi-talker ASR system that distills semantic knowledge from LLMs, achieving competitive performance while maintaining efficient inference. This work represents a meaningful advancement in the field of speech recognition, particularly in handling overlapping speech scenarios.
The proposed methodology introduces a novel encoder-only framework for multi-talker ASR that leverages the semantic guidance from LLMs during training while maintaining a fast CTC-style decoding at inference. The adaptation of LLMs to multi-talker conditioning and the introduction of a Talker-Count Head (TCH) are significant innovations that address the limitations of existing approaches, particularly in handling variable talker counts and improving performance in challenging conditions. The use of serialized CTC for efficient inference is a well-thought-out choice that enhances the practicality of the model.
The experiments conducted on the LibriMix dataset are thorough and demonstrate the effectiveness of the proposed model. The results indicate that the encoder-only model achieves comparable performance to LLM-based systems in two-talker scenarios and outperforms them in three-talker conditions, showcasing the robustness of the approach. The evaluation metrics, including word error rates (WER), provide a clear picture of the model's performance across different configurations.
The paper provides sufficient details on the model architecture, training procedures, and datasets used, which aids in reproducibility. However, the absence of a public code repository or demo URL limits the ease with which other researchers can replicate the results.
One notable limitation is the reliance on accurate talker-count estimation, which may not be robust under all conditions, particularly in noisy environments. Additionally, while the model performs well in controlled settings, its generalizability to real-world applications with more than three talkers or varying noise conditions remains uncertain.
The proposed framework has significant implications for real-time multi-talker ASR applications, such as in conference settings, call centers, and assistive technologies. By improving the accuracy and efficiency of ASR systems, this work could enhance communication accessibility and user experience in various domains. The main contribution of this paper is the development of an encoder-only multi-talker ASR system that distills semantic knowledge from LLMs, achieving competitive performance while maintaining efficient inference. This work represents a meaningful advancement in the field of speech recognition, particularly in handling overlapping speech scenarios.
Recent advancements in speech captioning models have enabled the generation of rich, fine-grained captions for emotional speech. However, the evaluation of such captions remains a critical bottleneck: traditional N-gram metrics fail to capture semantic nuances, while LLM judges often suffer from reasoning inconsistency and context-collapse when processing long-form descriptions. In this work, we propose EmoSURA, a novel evaluation framework that shifts the paradigm from holistic scoring to atomic verification. EmoSURA decomposes complex captions into Atomic Perceptual Units, which are self-contained statements regarding vocal or emotional attributes, and employs an audio-grounded verification mechanism to validate each unit against the raw speech signal. Furthermore, we address the scarcity of standardized evaluation resources by introducing SURABench, a carefully balanced and stratified benchmark. Our experiments show that EmoSURA achieves a positive correlation with human judgments, offering a more reliable assessment for long-form captions compared to traditional metrics, which demonstrated negative correlations due to their sensitivity to caption length.
Primary: Imperial College London
All Institutions: Imperial College London, TUM University Hospital, Munich Center for Machine Learning
The main contribution of this paper is the development of EmoSURA, a framework that enhances the evaluation of emotional speech captions by focusing on atomic verification rather than holistic scoring. This approach represents a significant advancement in the field, addressing critical challenges in evaluating emotional speech and providing a pathway for future research in this area.
The paper introduces EmoSURA, an innovative evaluation framework that decomposes emotional speech captions into Atomic Perceptual Units (APUs). This approach is significant as it moves away from traditional holistic scoring methods, which often fail to capture the nuances of emotional speech. The methodology is well-structured, employing an audio-grounded verification mechanism that enhances the reliability of the evaluation process. The decomposition into APUs allows for a more granular analysis of emotional attributes, which is a novel contribution to the field of speech captioning.
The experiments conducted demonstrate a positive correlation between EmoSURA's assessments and human judgments, which is a critical validation of the framework's effectiveness. The introduction of SURABench as a benchmark for evaluating the proposed method adds to the robustness of the experimental design. However, the paper could benefit from a more detailed description of the datasets used and the specific metrics employed in the evaluation process.
The paper lacks sufficient details regarding the implementation of EmoSURA, including code availability and specific configurations used in the experiments. This omission raises concerns about reproducibility, as other researchers may find it challenging to replicate the results without access to the underlying code or datasets.
One limitation noted is the reliance on human judgments for validation, which, while valuable, may introduce subjectivity into the evaluation process. Additionally, the framework's performance on diverse emotional speech contexts outside the training set remains to be thoroughly assessed.
EmoSURA has the potential to significantly advance the field of emotional speech processing by providing a more nuanced evaluation framework. This could lead to improvements in applications such as affective computing, human-computer interaction, and accessibility tools for individuals with communication difficulties. The implications of this work could extend to various domains, including mental health assessment and entertainment, where understanding emotional nuances in speech is crucial. The main contribution of this paper is the development of EmoSURA, a framework that enhances the evaluation of emotional speech captions by focusing on atomic verification rather than holistic scoring. This approach represents a significant advancement in the field, addressing critical challenges in evaluating emotional speech and providing a pathway for future research in this area.
Emotions play a central role in human communication, shaping trust, engagement, and social interaction. As artificial intelligence systems powered by large language models become increasingly integrated into everyday life, enabling them to reliably understand and generate human emotions remains an important challenge. While emotional expression is inherently multimodal, this thesis focuses on emotions conveyed through spoken language and investigates how acoustic and semantic information can be jointly modeled to advance both emotion understanding and emotion synthesis from speech. The first part of the thesis studies emotion-aware representation learning through pre-training. We propose strategies that incorporate acoustic and semantic supervision to learn representations that better capture affective cues in speech. A speech-driven supervised pre-training framework is also introduced to enable large-scale emotion-aware text modeling without requiring manually annotated text corpora. The second part addresses emotion recognition in conversational settings. Hierarchical architectures combining cross-modal attention and mixture-of-experts fusion are developed to integrate acoustic and semantic information across conversational turns. Finally, the thesis introduces a textless and non-parallel speech-to-speech framework for emotion style transfer that enables controllable emotional transformations while preserving speaker identity and linguistic content. The results demonstrate improved emotion transfer and show that style-transferred speech can be used for data augmentation to improve emotion recognition.
Primary: Indian Institute of Science
All Institutions: Indian Institute of Science
The main contribution of this paper is the innovative integration of acoustic and semantic modeling for emotion in spoken language, advancing the field of emotion recognition and synthesis. The comprehensive methodology and promising experimental results position this work as a significant step towards creating emotionally aware AI systems, though further details on reproducibility and limitations could strengthen its impact.
The paper introduces a comprehensive approach to emotion modeling in spoken language by integrating acoustic and semantic information. The proposed methods, including emotion-aware representation learning through pre-training and a speech-driven supervised framework, are innovative. The hierarchical architectures for emotion recognition in conversational settings and the textless speech-to-speech framework for emotion style transfer demonstrate a solid understanding of the multimodal nature of emotions. However, the details on the implementation of these methods could be elaborated further to enhance clarity.
The experiments utilize well-established datasets such as IEMOCAP and MELD, which are appropriate for the tasks at hand. The results indicate improved performance in emotion recognition and transfer, showcasing the effectiveness of the proposed models. However, the paper would benefit from a more detailed comparison with state-of-the-art methods to contextualize its contributions better.
While the paper claims to provide a robust framework, the reproducibility of the results is somewhat hampered by the lack of detailed implementation specifics, such as hyperparameter settings and training procedures. Including a supplementary material or a GitHub repository would significantly enhance reproducibility.
The paper does not address potential limitations in the datasets used, such as biases in emotion labeling or the generalizability of the models to different languages or cultures. Additionally, the reliance on large-scale pre-training may pose challenges in terms of computational resources and accessibility.
The work has significant implications for the development of emotionally intelligent AI systems, which can enhance human-computer interaction in various applications, including virtual assistants, therapy bots, and entertainment. The ability to synthesize emotional speech can also contribute to more engaging and relatable AI systems. The main contribution of this paper is the innovative integration of acoustic and semantic modeling for emotion in spoken language, advancing the field of emotion recognition and synthesis. The comprehensive methodology and promising experimental results position this work as a significant step towards creating emotionally aware AI systems, though further details on reproducibility and limitations could strengthen its impact.
Recent advances in zero-shot voice conversion have exhibited potential in emotion control, yet the performance is suboptimal or inconsistent due to their limited expressive capacity. We propose Emotion-Aware Prefix for explicit emotion control in a two-stage voice conversion backbone. We significantly improve emotion conversion performance, doubling the baseline Emotion Conversion Accuracy (ECA) from 42.40% to 85.50% while maintaining linguistic integrity and speech quality, without compromising speaker identity. Our ablation study suggests that a joint control of both sequence modulation and acoustic realization is essential to synthesize distinct emotions. Furthermore, comparative analysis verifies the generalizability of proposed method, while it provides insights on the role of acoustic decoupling in maintaining speaker identity.
Primary: The University of Texas at Dallas
All Institutions: The University of Texas at Dallas
The paper presents the Emotion-Aware Prefix, which significantly enhances emotion control in voice conversion models. The methodology is innovative, and the results demonstrate substantial improvements in performance, although further details on implementation and datasets would strengthen the overall contribution to the field.
The proposed Emotion-Aware Prefix introduces a novel approach to emotion control in voice conversion by utilizing a two-stage voice conversion framework. The methodology emphasizes a joint control mechanism that integrates sequence modulation and acoustic realization, which is a significant advancement over existing methods that often struggle with expressive capacity. However, the paper could benefit from a more detailed explanation of the underlying algorithms and their implementation, as well as a clearer description of how the Emotion-Aware Prefix is integrated into the existing architecture.
The experimental section demonstrates a robust evaluation of the proposed method, showing a substantial increase in Emotion Conversion Accuracy (ECA) from 42.40% to 85.50%. The use of ablation studies to analyze the contributions of different components of the model is commendable and adds credibility to the findings. However, the paper lacks detailed descriptions of the datasets used, including their size and diversity, which are critical for assessing the generalizability of the results.
The paper mentions the use of generative AI tools for grammar and word choice corrections, but it does not provide sufficient details regarding the implementation of the Emotion-Aware Prefix or the training process. Without clear instructions or access to code, reproducibility may be a concern for future researchers looking to build upon this work.
One limitation is the potential overfitting to the training data, which may affect the model's performance on unseen data. Additionally, while the results are impressive, the paper does not address the scalability of the approach or its performance in real-world applications, which could be critical for practical deployment.
This research has significant implications for applications in interactive voice response systems, virtual assistants, and entertainment technologies, where emotional expressiveness can enhance user experience. The ability to control emotions explicitly in voice conversion models could lead to more engaging and human-like interactions in various domains. The paper presents the Emotion-Aware Prefix, which significantly enhances emotion control in voice conversion models. The methodology is innovative, and the results demonstrate substantial improvements in performance, although further details on implementation and datasets would strengthen the overall contribution to the field.
Existing video personalization methods preserve visual likeness but treat video and audio separately. Without access to the visual scene, audio models cannot synchronize sounds with on-screen actions; and because classical voice-cloning models condition only on a reference recording, a text prompt cannot redirect speaking style or acoustic environment. We propose ID-LoRA (Identity-Driven In-Context LoRA), which jointly generates a subject's appearance and voice in a single model, letting a text prompt, a reference image, and a short audio clip govern both modalities together. ID-LoRA adapts the LTX-2 joint audio-video diffusion backbone via parameter-efficient In-Context LoRA and, to our knowledge, is the first method to personalize visual appearance and voice in a single generative pass. Two challenges arise. Reference and generation tokens share the same positional-encoding space, making them hard to distinguish; we address this with negative temporal positions, placing reference tokens in a disjoint RoPE region while preserving their internal temporal structure. Speaker characteristics also tend to be diluted during denoising; we introduce identity guidance, a classifier-free guidance variant that amplifies speaker-specific features by contrasting predictions with and without the reference signal. In human preference studies, ID-LoRA is preferred over Kling 2.6 Pro by 73% of annotators for voice similarity and 65% for speaking style. On cross-environment settings, speaker similarity improves by 24% over Kling, with the gap widening as conditions diverge. A preliminary user study further suggests that joint generation provides a useful inductive bias for physically grounded sound synthesis. ID-LoRA achieves these results with only ~3K training pairs on a single GPU. Code, models, and data will be released.
Primary: Tel Aviv University
All Institutions: Tel Aviv University
The main contribution of this paper is the introduction of ID-LoRA, a unified audio-video personalization method that effectively synthesizes a subject's appearance and vocal identity in a single generative pass. This work significantly advances the field of audio-visual generation by enabling coherent and contextually relevant outputs driven by a unified latent space, thus enhancing the controllability and fidelity of generated media.
The proposed ID-LoRA method innovatively combines audio and video generation in a unified framework, addressing the limitations of existing cascaded models that treat audio and video separately. The introduction of negative temporal positions and identity guidance is a significant advancement, allowing for better separation of reference and target tokens in the positional encoding space, which is crucial for maintaining the integrity of both modalities. The use of a shared latent space for audio and video generation is a novel approach that enhances the model's ability to generate coherent outputs based on a single text prompt, reference image, and audio clip.
The experiments are rigorously designed, utilizing two diverse datasets (CelebV-HQ and TalkVid) to validate the model's performance across different scenarios. The paper presents comprehensive quantitative results, demonstrating significant improvements in speaker similarity and lip synchronization over state-of-the-art models, including a commercial model (Kling 2.6 Pro). The human preference studies further substantiate the model's effectiveness, showing a strong preference for ID-LoRA in voice similarity and environmental sound adherence.
The paper provides detailed implementation specifics, including training parameters, dataset descriptions, and evaluation metrics. However, the reproducibility could be enhanced by providing access to the training code and datasets, which are mentioned to be available but not explicitly linked in the paper.
While the model shows impressive results, it relies on a relatively small training dataset (~3K pairs), which may limit its generalizability. Additionally, the focus on a single GPU for training may restrict scalability and performance in more complex scenarios. The potential for misuse of the technology, such as non-consensual impersonation, is also a significant ethical concern that is acknowledged but requires further exploration.
The ID-LoRA framework has the potential for various applications, including multilingual dubbing, personalized content creation, and accessibility tools. However, the risks associated with generating realistic audio-visual content that can impersonate individuals necessitate careful consideration of ethical implications and the establishment of safeguards to prevent misuse. The main contribution of this paper is the introduction of ID-LoRA, a unified audio-video personalization method that effectively synthesizes a subject's appearance and vocal identity in a single generative pass. This work significantly advances the field of audio-visual generation by enabling coherent and contextually relevant outputs driven by a unified latent space, thus enhancing the controllability and fidelity of generated media.
Engine sounds originate from sequential exhaust pressure pulses rather than sustained harmonic oscillations. While neural synthesis methods typically aim to approximate the resulting spectral characteristics, we propose directly modeling the underlying pulse shapes and temporal structure. We present the Pulse-Train-Resonator (PTR) model, a differentiable synthesis architecture that generates engine audio as parameterized pulse trains aligned to engine firing patterns and propagates them through recursive Karplus-Strong resonators simulating exhaust acoustics. The architecture integrates physics-informed inductive biases including harmonic decay, thermodynamic pitch modulation, valve-dynamics envelopes, exhaust system resonances and derived engine operating modes such as throttle operation and deceleration fuel cutoff (DCFO). Validated on three diverse engine types totaling 7.5 hours of audio, PTR achieves a 21% improvement in harmonic reconstruction and a 5.7% reduction in total loss over a harmonic-plus-noise baseline model, while providing interpretable parameters corresponding to physical phenomena. Complete code, model weights, and audio examples are openly available.
Primary: Impulse Audio Lab GmbH
All Institutions: Impulse Audio Lab GmbH, Universitat Pompeu Fabra
The main contribution of this paper is the introduction of the Pulse-Train-Resonator (PTR) model for engine sound synthesis, which leverages physics-informed inductive biases to improve audio reconstruction quality and interpretability. This work represents a significant advancement in the field of neural audio synthesis, combining innovative methodology with rigorous experimental validation.
The methodology presented in this paper is innovative, introducing the Pulse-Train-Resonator (PTR) model, which directly models the pulse shapes and temporal structures of engine sounds rather than relying solely on spectral characteristics. The integration of physics-informed inductive biases into the neural architecture is a significant advancement, allowing for a more interpretable and physically grounded synthesis process. The use of differentiable signal processing techniques, particularly the adaptation of the Karplus-Strong algorithm for gradient-based optimization, demonstrates a sophisticated approach to audio synthesis that is both novel and effective.
The experimental evaluation is robust, utilizing a diverse dataset of engine sounds totaling 7.5 hours across three engine types. The reported improvements in harmonic reconstruction and total loss compared to a baseline model provide strong evidence of the effectiveness of the PTR model. The validation metrics are well-defined, and the perceptual analysis adds depth to the evaluation, showcasing the model's ability to capture complex acoustic behaviors.
The paper provides a clear link to the code, model weights, and audio examples, which enhances reproducibility. However, details regarding the training setup, including hyperparameters and specific configurations used during experiments, could be more thoroughly documented to facilitate easier replication by other researchers.
One limitation noted is the potential challenge of generalizing the model to real-world recordings, as the validation was conducted on synthesized data. Additionally, while the model performs well across different engine types, its robustness to environmental noise and variations in recording conditions remains to be thoroughly tested.
The potential applications of this research extend beyond engine sound synthesis to other areas of audio processing and sound design, where physics-informed models could enhance realism and control. The insights gained from this work could also inform the development of more sophisticated audio synthesis techniques in music technology and virtual environments. The main contribution of this paper is the introduction of the Pulse-Train-Resonator (PTR) model for engine sound synthesis, which leverages physics-informed inductive biases to improve audio reconstruction quality and interpretability. This work represents a significant advancement in the field of neural audio synthesis, combining innovative methodology with rigorous experimental validation.
While large-scale omni-models have demonstrated impressive capabilities across various modalities, their strong performance heavily relies on massive multimodal data and incurs substantial computational costs. This work introduces Speech-Omni-Lite, a cost-efficient framework for extending pre-trained Visual-Language (VL) backbones with speech understanding and generation capabilities, while fully preserving the backbones' vision-language performance. Specifically, the VL backbone is equipped with two lightweight, trainable plug-and-play modules, a speech projector and a speech token generator, while keeping the VL backbone fully frozen. To mitigate the scarcity of spoken QA corpora, a low-cost data construction strategy is proposed to generate Question-Text Answer-Text-Speech (QTATS) data from existing ASR speech-text pairs, facilitating effective speech generation training. Experimental results show that, even with only thousands of hours of speech training data, Speech-Omni-Lite achieves excellent spoken QA performance, which is comparable to omni-models trained on millions of hours of speech data. Furthermore, the learned speech modules exhibit strong transferability across VL backbones.
Primary: Huawei Leibniz Research Center
All Institutions: Huawei Leibniz Research Center, Hong Kong Polytechnic University, Harbin Institute of Technology, Shenzhen, Hong Kong University of Science and Technology, Shenzhen Loop Area Institute
The paper presents a novel framework for integrating speech capabilities into visual-language models, demonstrating significant advancements in efficiency and performance. The innovative methodology and experimental results position it as a meaningful contribution to the field of multimodal machine learning, addressing both practical and theoretical challenges.
The paper introduces Speech-Omni-Lite, a framework that effectively integrates speech capabilities into existing visual-language (VL) models without retraining the entire backbone. The methodology is innovative in its use of lightweight, trainable modules (speech projector and speech token generator) that maintain the performance of the frozen VL backbone. The QTATS data construction strategy is particularly noteworthy, as it creatively generates training data from existing ASR pairs, addressing the challenge of data scarcity in spoken QA tasks. This approach not only reduces the need for extensive spoken QA datasets but also demonstrates a novel way to leverage existing resources efficiently.
The experimental results presented in the paper are robust, showcasing competitive performance in spoken QA tasks even with limited training data. The authors provide a thorough evaluation across multiple datasets, comparing their method against large-scale omni-models. The results indicate that Speech-Omni-Lite achieves performance on par with models trained on millions of hours of data, highlighting the effectiveness of their approach. However, the paper could benefit from clearer presentation of quantitative results, as some tables and figures are referenced but not fully detailed in the text.
The paper provides a detailed description of the architecture and training procedures, which supports reproducibility. However, the lack of publicly available code or a project repository limits the ability for others to directly replicate the results. The authors mention using specific datasets and models, but without access to the exact configurations and training scripts, full reproducibility may be challenging.
One limitation of the proposed framework is its reliance on the quality of the generated QTATS data. Since this data is constructed from existing ASR pairs, any inherent biases or inaccuracies in the ASR data could propagate into the training of the speech token generator. Additionally, while the model demonstrates strong performance, it may still struggle with nuanced speech understanding in diverse real-world scenarios, particularly in noisy environments or with varied accents.
The potential applications of Speech-Omni-Lite are significant, particularly in enhancing accessibility for individuals with disabilities and improving human-machine interaction. By lowering the computational and data requirements for integrating speech into multimodal models, the framework could democratize access to advanced AI technologies. Furthermore, the emphasis on resource efficiency aligns with growing concerns about the environmental impact of large-scale AI training, making it a timely contribution to the field. The paper presents a novel framework for integrating speech capabilities into visual-language models, demonstrating significant advancements in efficiency and performance. The innovative methodology and experimental results position it as a meaningful contribution to the field of multimodal machine learning, addressing both practical and theoretical challenges.
Explainable speech quality assessment requires moving beyond Mean Opinion Scores (MOS) to analyze underlying perceptual dimensions. To address this, we introduce a novel post-training method that tailors the foundational Audio Large Language Model for multidimensional reasoning, detection and classification of audio artifacts. First, a calibration stage aligns the model to predict predefined perceptual dimensions. Second, a reinforcement learning stage leverages Group Relative Policy Optimization (GRPO) with dimension-specific rewards to heavily enhance accuracy of descriptions and temporal localization of quality issues. With this approach we reach state-of-the-art results of 0.71 mean PCC score on the multidimensional QualiSpeech benchmark and 13% improvement in MOS prediction driven by RL-based reasoning. Furthermore, our fine-grained GRPO rewards substantially advance the model's ability to pinpoint and classify audio artifacts in time.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of a novel Calibration-Reasoning framework that significantly enhances the interpretability and accuracy of speech quality assessments by leveraging a two-stage post-training methodology. This work represents a meaningful advancement in the field, addressing critical gaps in existing approaches and setting a new standard for audio quality evaluation.
The proposed Calibration-Reasoning framework innovatively combines a two-stage post-training methodology that includes a calibration stage for aligning the model with perceptual dimensions and a reasoning stage utilizing Group Relative Policy Optimization (GRPO) with dimension-specific rewards. This approach effectively addresses the limitations of existing methods by enhancing both the accuracy of audio quality assessments and the interpretability of the model's outputs. The methodology is well-structured, with clear objectives and a logical flow from calibration to reasoning, making it a significant advancement in the field of speech quality assessment.
The experiments are robust, utilizing the QualiSpeech benchmark, which is comprehensive and well-annotated. The reported results demonstrate a clear improvement over existing methods, achieving state-of-the-art Pearson Correlation Coefficient (PCC) scores. The use of ablation studies to evaluate the impact of various components of the methodology adds rigor to the experimental evaluation, providing insights into the effectiveness of the proposed approach.
The paper provides sufficient detail regarding the experimental setup, including model configurations and training methodologies. However, the reproducibility could be enhanced by including more specifics about the data preprocessing steps and hyperparameter settings. The availability of model weights and code on Hugging Face is a positive aspect that supports reproducibility.
The paper acknowledges limitations such as the computational overhead introduced by unfreezing the audio encoder during calibration and the potential challenges in reasoning about novel audio artifacts not present in the training data. These limitations highlight areas for future research and improvement.
The proposed framework has significant implications for the field of audio processing and speech quality assessment. By improving the interpretability and accuracy of audio quality evaluations, it could enhance applications in various domains, including telecommunications, media production, and assistive technologies for the hearing impaired. The framework's potential for extension to other acoustic domains further broadens its applicability. The main contribution of this paper is the introduction of a novel Calibration-Reasoning framework that significantly enhances the interpretability and accuracy of speech quality assessments by leveraging a two-stage post-training methodology. This work represents a meaningful advancement in the field, addressing critical gaps in existing approaches and setting a new standard for audio quality evaluation.
Advances in large language models (LLMs) have enabled significant capabilities in audio processing, resulting in state-of-the-art models now known as Large Audio Language Models (LALMs). However, minimal work has been done to measure audio understanding beyond automatic speech recognition (ASR). This paper closes that gap by proposing a benchmark suite, SCENEBench (Spatial, Cross-lingual, Environmental, Non-speech Evaluation), that targets a broad form of audio comprehension across four real-world categories: background sound understanding, noise localization, cross-linguistic speech understanding, and vocal characterizer recognition. These four categories are selected based on understudied needs from accessibility technology and industrial noise monitoring. In addition to performance, we also measure model latency. The purpose of this benchmark suite is to assess audio beyond just what words are said - rather, how they are said and the non-speech components of the audio. Because our audio samples are synthetically constructed (e.g., by overlaying two natural audio samples), we further validate our benchmark against 20 natural audio items per task, sub-sampled from existing datasets to match our task criteria, to assess ecological validity. We assess five state-of-the-art LALMs and find critical gaps: performance varies across tasks, with some tasks performing below random chance and others achieving high accuracy. These results provide direction for targeted improvements in model capabilities.
Primary: Stanford University
All Institutions: Stanford University
The main contribution of this paper is the introduction of SCENEBench, a novel benchmark suite for evaluating audio understanding in real-world contexts. This work significantly advances the field by identifying critical gaps in current audio processing models and providing a structured approach to measure and improve their capabilities.
The methodology presented in this paper is robust, focusing on a comprehensive benchmark suite (SCENEBench) that evaluates various aspects of audio understanding beyond traditional ASR. The choice of tasks is well-justified, targeting real-world applications in accessibility and industrial monitoring. The synthetic construction of audio samples, while innovative, raises questions about the ecological validity of the results. The paper does a commendable job of validating the benchmark against natural audio items, which enhances the credibility of the proposed tasks.
The experimental evaluation is thorough, assessing five state-of-the-art LALMs across multiple tasks. The results highlight significant performance gaps, with some models performing below random chance in certain categories. This is a critical finding that points to the need for further research and improvement in LALMs for audio understanding. However, the paper could benefit from a more detailed analysis of the factors contributing to the performance discrepancies observed across tasks.
The paper does not provide explicit details on the implementation of the benchmark or the models used, which could hinder reproducibility. Clearer guidelines on how to replicate the experiments and access the datasets would enhance the paper's impact.
One limitation is the reliance on synthetic audio samples, which may not fully capture the complexities of real-world audio environments. Additionally, the performance metrics could be expanded to include more nuanced evaluations of model behavior in different contexts.
The proposed benchmark has significant implications for advancing audio understanding technologies, particularly in assistive technologies and industrial applications. By addressing gaps in current models, this work could lead to more effective audio processing systems that better serve diverse user needs. The main contribution of this paper is the introduction of SCENEBench, a novel benchmark suite for evaluating audio understanding in real-world contexts. This work significantly advances the field by identifying critical gaps in current audio processing models and providing a structured approach to measure and improve their capabilities.
Keyword spotting (KWS) is crucial for many speech-driven applications, but robust KWS in noisy environments remains challenging. Conventional systems often rely on single-channel inputs and a cascaded pipeline separating front-end enhancement from KWS. This precludes joint optimization, inherently limiting performance. We present an end-to-end multi-channel KWS framework that exploits spatial cues to improve noise robustness. A spatial encoder learns inter-channel features, while a spatial embedding injects directional priors; the fused representation is processed by a streaming backbone. Experiments in simulated noisy conditions across multiple signal-to-noise ratios (SNRs) show that spatial modeling and directional priors each yield clear gains over baselines, with their combination achieving the best results. These findings validate end-to-end multi-channel spatial modeling, indicating strong potential for the target-speaker-aware detection in complex acoustic scenarios.
Primary: Midea Group (Shanghai) Co
All Institutions: Midea Group (Shanghai) Co
The paper presents a novel end-to-end multi-channel KWS framework that effectively incorporates spatial modeling and directional priors to enhance noise robustness. The technical contributions are significant, with a well-defined methodology and promising experimental results, although practical limitations and reproducibility issues need to be addressed for broader application.
The paper introduces an innovative end-to-end multi-channel keyword spotting (KWS) framework that integrates a spatial encoder and a spatial embedding to leverage spatial cues for improved noise robustness. The architecture is well-structured, combining feature extraction, spatial modeling, and KWS into a unified framework, which is a significant departure from traditional cascaded systems. The methodology is sound, with clear descriptions of the components and their interactions. However, the assumption that the direction-of-arrival (DOA) is known during training and evaluation may limit the practical applicability of the approach in real-world scenarios.
The experimental setup is robust, utilizing a well-known dataset (Google Speech Commands v1) and simulating various noisy environments to evaluate the performance of the proposed system. The results demonstrate clear advantages over both single-channel and enhanced cascaded baselines, validating the effectiveness of the proposed approach. The paper provides comprehensive comparisons across different configurations, which strengthens the findings. However, the lack of real-world testing and reliance on simulated conditions may affect the generalizability of the results.
The paper provides a detailed description of the experimental setup, including data preparation, training, and testing methodologies. However, it lacks specific implementation details, such as hyperparameter settings and code availability, which are crucial for reproducibility. The absence of a demo or project URL further complicates the ability to replicate the results.
One significant limitation is the reliance on simulated data for both training and evaluation, which may not fully capture the complexities of real-world environments. Additionally, the assumption of known DOA during training could hinder the model's performance in practical applications where such information is not available. The performance gain from spatial priors appears to diminish under certain conditions, indicating potential weaknesses in the approach's robustness.
The proposed framework has the potential to enhance voice-controlled applications in noisy environments, making it relevant for various industries, including consumer electronics, automotive, and smart home devices. By improving keyword spotting accuracy in challenging acoustic scenarios, this work could lead to more reliable and user-friendly voice interfaces, ultimately benefiting end-users and developers alike. The paper presents a novel end-to-end multi-channel KWS framework that effectively incorporates spatial modeling and directional priors to enhance noise robustness. The technical contributions are significant, with a well-defined methodology and promising experimental results, although practical limitations and reproducibility issues need to be addressed for broader application.
Audiovisual speech recognition (AVSR) combines acoustic and visual cues to improve transcription robustness under challenging conditions but remains out of reach for most under-resourced languages due to the lack of labeled video corpora for training. We propose a zero-AV-resource AVSR framework that relies on synthetic visual streams generated by lip-syncing static facial images with real audio. We first evaluate synthetic visual augmentation on Spanish benchmarks, then apply it to Catalan, a language with no annotated audiovisual corpora. We synthesize over 700 hours of talking-head video and fine-tune a pre-trained AV-HuBERT model. On a manually annotated Catalan benchmark, our model achieves near state-of-the-art performance with much fewer parameters and training data, outperforms an identically trained audio-only baseline, and preserves multimodal advantages in noise. Scalable synthetic video thus offers a viable substitute for real recordings in zero-AV-resource AVSR.
Primary: Universitat Politècnica de Catalunya (UPC)
All Institutions: Barcelona Supercomputing Center (BSC), Universitat Politècnica de Catalunya (UPC)
The main contribution of this paper is the introduction of a zero-AV-resource AVSR framework that utilizes synthetic visual data to enhance speech recognition capabilities in under-resourced languages. This innovative approach not only addresses a critical gap in the field but also opens avenues for future research and development in multimodal speech recognition.
The proposed methodology leverages synthetic visual data generated from static images to create a training framework for AVSR in zero-resource scenarios. The use of lip-syncing techniques to generate talking-head videos is innovative, particularly in the context of under-resourced languages like Catalan. The end-to-end pipeline for generating synthetic audiovisual data is well-structured and language-agnostic, which enhances the applicability of the approach. The integration of a semi-automatic annotation pipeline further strengthens the methodology by providing a means to evaluate the model effectively. However, the reliance on synthetic data may raise questions about the generalizability of the results to real-world applications.
The experiments conducted are thorough, comparing the proposed model against both audio-only baselines and state-of-the-art ASR systems. The results demonstrate significant improvements in transcription accuracy when using synthetic visual data, particularly in challenging noise conditions. The authors provide clear metrics (WER) to quantify performance, and the comparative analysis with existing models like Whisper adds depth to the evaluation. However, the paper could benefit from more extensive ablation studies to further dissect the contributions of various components of the model.
The paper includes a link to the GitHub repository containing the code and resources for synthetic data generation and annotation, which is a positive aspect for reproducibility. However, the details regarding the datasets and specific configurations used in the experiments could be more explicitly stated to facilitate replication by other researchers.
One limitation is the potential gap between synthetic and real-world data, as the synthetic videos may not fully capture the complexities of natural speech and visual cues. Additionally, while the model shows promise for Catalan, its performance on other under-resourced languages remains untested. The reliance on a single method for generating synthetic videos may also limit the robustness of the approach.
This research has the potential to significantly impact the field of speech recognition, particularly for under-resourced languages, by providing a scalable method for training AVSR systems without the need for extensive audiovisual datasets. The implications extend to various applications in accessibility, communication technologies, and language preservation. The main contribution of this paper is the introduction of a zero-AV-resource AVSR framework that utilizes synthetic visual data to enhance speech recognition capabilities in under-resourced languages. This innovative approach not only addresses a critical gap in the field but also opens avenues for future research and development in multimodal speech recognition.
While autoregressive (AR) LLM-based ASR systems achieve strong accuracy, their sequential decoding limits parallelism and incurs high latency. We propose NLE, a non-autoregressive (NAR) approach that formulates speech recognition as conditional transcript editing, enabling fully parallel prediction. NLE extracts acoustic embeddings and an initial hypothesis from a pretrained speech encoder, then refines the hypothesis using a bidirectional LLM editor trained with a latent alignment objective. An interleaved padding strategy exploits the identity mapping bias of Transformers, allowing the model to focus on corrections rather than full reconstruction. On the Open ASR leaderboard, NLE++ achieves 5.67% average WER with an RTFx (inverse real-time factor) of 1630. In single-utterance scenarios, NLE achieves 27x speedup over the AR baseline, making it suitable for real-time applications.
Primary: IBM Research
All Institutions: IBM Research
The main contribution of this paper is the introduction of a non-autoregressive LLM-based ASR system that effectively combines the strengths of pretrained speech encoders and language models through a novel editing approach, significantly improving transcription speed and maintaining competitive accuracy. The methodology is innovative, and the experimental results demonstrate substantial technical impact, making it a valuable contribution to the field of machine learning and speech recognition.
The proposed methodology introduces a non-autoregressive (NAR) approach to automatic speech recognition (ASR) by framing it as conditional transcript editing. This is achieved through a bidirectional LLM editor that refines an initial hypothesis generated by a pretrained speech encoder. The interleaved padding strategy is a notable innovation, allowing the model to focus on corrections rather than full reconstructions, which enhances the efficiency of the editing process. The use of lightweight LoRA adapters for model adaptation is also a significant methodological contribution, enabling the model to leverage pretrained linguistic knowledge effectively while maintaining a manageable number of trainable parameters.
The experiments conducted are rigorous, with the authors evaluating their model against leading ASR systems on the Open ASR leaderboard. The reported results demonstrate a competitive word error rate (WER) of 5.67% for NLE++, with a substantial speedup of 27x over autoregressive baselines in single-utterance scenarios. The inclusion of ablation studies further strengthens the evaluation, providing insights into the impact of various design choices on performance. However, the paper could benefit from more extensive comparisons with a broader range of models and additional datasets to validate the robustness of the findings.
The paper provides a detailed description of the model architecture, training procedures, and evaluation metrics, which enhances reproducibility. However, the lack of a publicly available code repository or demo URL limits the ability for others to directly replicate the results. The authors mention using specific datasets and configurations, which is helpful, but sharing the implementation would significantly improve reproducibility.
The paper acknowledges that the NLE approach is less flexible than autoregressive models in scenarios requiring substantial changes to the hypothesis. It also highlights potential latency overhead due to the need for retokenization when using different tokenizers for the CTC encoder and the LLM. Moreover, the performance in multilingual settings appears to be weaker, suggesting that the model's training data may not be adequately representative of all languages.
The proposed NLE system has significant implications for real-time ASR applications, particularly in conversational settings where low latency is critical. By enabling faster and more accurate transcription, this approach could enhance user experiences in various domains, including virtual assistants, customer service, and accessibility technologies. The ability to refine initial hypotheses rather than regenerate them from scratch could also lead to more efficient use of computational resources. The main contribution of this paper is the introduction of a non-autoregressive LLM-based ASR system that effectively combines the strengths of pretrained speech encoders and language models through a novel editing approach, significantly improving transcription speed and maintaining competitive accuracy. The methodology is innovative, and the experimental results demonstrate substantial technical impact, making it a valuable contribution to the field of machine learning and speech recognition.
Automatic speech intelligibility assessment is crucial for monitoring speech disorders and therapy efficacy. However, existing methods are difficult to compare: research is fragmented across private datasets with inconsistent protocols. We introduce PathBench, a unified benchmark for pathological speech assessment using public datasets. We compare reference-free, reference-text, and reference-audio methods across three protocols (Matched Content, Extended, and Full) representing how a linguist (controlled stimuli) versus machine learning specialist (maximum data) would approach the same data. We establish benchmark baselines across six datasets, enabling systematic evaluation of future methodological advances, and introduce Dual-ASR Articulatory Precision (DArtP), achieving the highest average correlation among reference-free methods.
Primary: University of Cologne
All Institutions: University of Cologne, Nagoya University, University of Groningen
The paper presents PathBench, a comprehensive benchmarking framework for assessing speech intelligibility in pathological speech, addressing critical gaps in the field. The innovative methodology and rigorous experimental evaluation contribute significantly to advancing the state of research in automatic speech assessment, with potential applications in clinical settings.
The paper introduces PathBench, a systematic benchmarking framework for assessing speech intelligibility in pathological speech, which is a significant advancement given the fragmented nature of existing research. The methodology is robust, employing a variety of protocols that cater to both linguistic and machine learning perspectives. The introduction of Dual-ASR Articulatory Precision (DArtP) as a reference-free method is particularly innovative, providing a new way to evaluate articulatory precision without the need for labeled training data. The authors also address confounding factors such as speaker age and recording noise, which enhances the credibility of their findings.
The experiments are comprehensive, utilizing six datasets and establishing baseline performances across multiple protocols. The results demonstrate that DArtP achieves the highest correlation among reference-free methods, which is a notable contribution. The statistical analyses, including Wilcoxon Signed-Rank Tests, are well-executed, providing strong evidence for the superiority of certain methodologies over others. The detailed reporting of results across various conditions adds to the rigor of the evaluation.
The authors provide a GitHub repository with code and resources, which is essential for reproducibility. However, the paper could benefit from more detailed descriptions of the datasets and specific implementation details to facilitate easier replication of the results by other researchers.
The study is limited to four languages (English, Italian, Spanish, and Dutch), which may restrict its applicability to a broader audience. Additionally, while the authors address confounding factors, the impact of noise in real-world scenarios remains untested, which is critical for clinical applications. The reliance on public datasets may also introduce variability that could affect the generalizability of the findings.
The implications of this research are significant for the fields of speech therapy and clinical assessment of speech disorders. By providing a standardized benchmarking framework, PathBench can facilitate future research and development of more effective speech intelligibility assessment tools. This could ultimately improve patient outcomes in clinical settings by enabling better monitoring and evaluation of speech disorders. The paper presents PathBench, a comprehensive benchmarking framework for assessing speech intelligibility in pathological speech, addressing critical gaps in the field. The innovative methodology and rigorous experimental evaluation contribute significantly to advancing the state of research in automatic speech assessment, with potential applications in clinical settings.
End-to-end full-duplex speech models feed user audio through an always-on LLM backbone, yet the speaker privacy implications of their hidden representations remain unexamined. Following the VoicePrivacy 2024 protocol with a lazy-informed attacker, we show that the hidden states of SALM-Duplex and Moshi leak substantial speaker identity across all transformer layers. Layer-wise and turn-wise analyses reveal that leakage persists across all layers, with SALM-Duplex showing stronger leakage in early layers while Moshi leaks uniformly, and that Linkability rises sharply within the first few turns. We propose two streaming anonymization setups using Stream-Voice-Anon: a waveform-level front-end (Anon-W2W) and a feature-domain replacement (Anon-W2F). Anon-W2F raises EER by over 3.5x relative to the discrete encoder baseline (11.2% to 41.0%), approaching the 50% random-chance ceiling, while Anon-W2W retains 78-93% of baseline sBERT across setups with sub-second response latency (FRL under 0.8 s).
Primary: The Chinese University of Hong Kong
All Institutions: The Chinese University of Hong Kong, Huawei Leibniz Research Center, Nanyang Technological University, The Hong Kong Polytechnic University
The paper effectively characterizes speaker identity leakage in full-duplex speech dialogue models and proposes innovative anonymization techniques that significantly enhance privacy without sacrificing usability. This work is a crucial step towards ensuring the responsible deployment of AI-driven speech technologies.
The paper introduces a novel approach to analyzing speaker identity leakage in end-to-end full-duplex speech dialogue models, specifically SALM-Duplex and Moshi. The authors employ a lazy-informed attacker scenario to assess privacy risks, which is a relevant and timely concern given the increasing use of always-on speech systems. The proposed anonymization techniques, Anon-W2W and Anon-W2F, are well-structured, with clear distinctions between waveform-level and feature-domain methods. The methodology is rigorous, utilizing established metrics like Equal Error Rate (EER) and Linkability to quantify privacy improvements.
The experiments are comprehensive, employing a standardized dataset from the VoicePrivacy 2024 Challenge and a well-defined evaluation protocol. The results demonstrate significant improvements in privacy metrics, particularly with the Anon-W2F method, which achieves a notable increase in EER, indicating strong privacy protection. The authors also provide a thorough analysis of the impact of anonymization on dialogue quality and efficiency, showcasing a balanced consideration of privacy and usability.
The paper includes sufficient details regarding the experimental setup, including model architectures, training datasets, and evaluation metrics, which should facilitate reproducibility. However, the reliance on specific datasets and the proprietary nature of some components may pose challenges for full replication.
The study primarily focuses on two specific models (SALM-Duplex and Moshi), which may limit the generalizability of the findings to other full-duplex systems. Additionally, while the proposed anonymization methods show promise, the impact on speech quality and naturalness remains an area for further exploration. The authors also acknowledge that their quality metrics may not fully capture speech-level attributes.
The implications of this research are significant, particularly in the context of privacy regulations like GDPR. By addressing the privacy risks associated with always-on speech systems, the work contributes to the development of safer AI technologies that can be deployed in real-world applications without compromising user privacy. The findings could influence future designs of speech dialogue systems, emphasizing the need for privacy-by-design principles. The paper effectively characterizes speaker identity leakage in full-duplex speech dialogue models and proposes innovative anonymization techniques that significantly enhance privacy without sacrificing usability. This work is a crucial step towards ensuring the responsible deployment of AI-driven speech technologies.
Although deep neural networks have facilitated significant progress of neural vocoders in recent years, they usually suffer from intrinsic challenges like opaque modeling, inflexible retraining under different input configurations, and parameter-performance trade-off. These inherent hurdles can heavily impede the development of this field. To resolve these problems, in this paper, we propose a novel neural vocoder in the time-frequency (T-F) domain. Specifically, we bridge the connection between the classical range-null decomposition (RND) theory and the vocoder task, where the reconstruction of the target spectrogram is formulated into the superimposition between range-space and null-space. The former aims to project the representation in the original mel-domain into the target linear-scale domain, and the latter can be instantiated via neural networks to further infill the spectral details. To fully leverage the spectrum prior, an elaborate dual-path framework is devised, where the spectrum is hierarchically encoded and decoded, and the cross- and narrow-band modules are leveraged for effectively modeling along sub-band and time dimensions. To enable inference under various configurations, we propose a simple yet effective strategy, which transforms the multi-condition adaption in the inference stage into the data augmentation in the training stage. Comprehensive experiments are conducted on various benchmarks. Quantitative and qualitative results show that while enjoying lightweight network structure and scalable inference paradigm, the proposed framework achieves state-ofthe-art performance among existing advanced methods. Code is available at https://github.com/Andong-Li-speech/RNDVoC.
Primary: Institute of Acoustics, Chinese Academy of Sciences
All Institutions: Institute of Acoustics, Chinese Academy of Sciences, Chongqing University of Posts and Telecommunications, Tencent AI Lab, University of Chinese Academy of Sciences
This paper makes a significant contribution to the field of neural vocoding by introducing a novel architecture that effectively utilizes range-null space decomposition, enhancing both the interpretability and performance of audio synthesis models. The methodology is well-structured, and the experimental results substantiate its effectiveness, positioning it as a valuable advancement in the audio processing domain.
The paper introduces a novel neural vocoder architecture based on range-null space decomposition (RND), which effectively separates the reconstruction of audio spectrograms into two orthogonal components: range-space and null-space. This approach is innovative as it leverages classical signal processing theory to enhance the interpretability and robustness of neural vocoders. The dual-path framework proposed allows for hierarchical encoding and decoding of spectral features, which is a significant advancement over existing methods that typically use full-band modules. The introduction of a multi-condition-as-data-augmentation strategy is also noteworthy, as it allows for scalable inference without the need for retraining, addressing a common limitation in neural vocoders.
The authors conducted comprehensive experiments on established benchmarks, including LJSpeech and LibriTTS, demonstrating state-of-the-art performance compared to existing methods. The quantitative metrics and qualitative assessments indicate that the proposed method not only achieves high-quality audio synthesis but also maintains a lightweight network structure, enhancing its practical applicability. The ablation studies further validate the effectiveness of the proposed components, providing a thorough evaluation of their contributions to performance.
The paper provides a GitHub repository link for code access, which is crucial for reproducibility. However, the detailed implementation specifics, such as hyperparameter settings and training configurations, could be better documented to facilitate easier replication of results by other researchers.
While the proposed method shows promise, it may still face challenges in handling extreme variations in input conditions that were not covered in the training data. Additionally, the reliance on the pseudo-inverse operation might introduce computational overhead in real-time applications, which could limit its deployment in resource-constrained environments.
The advancements in neural vocoding presented in this paper have significant implications for various audio processing applications, including text-to-speech synthesis, music generation, and speech enhancement. By improving the quality and efficiency of vocoders, this work could enhance user experiences in voice interfaces and multimedia applications, contributing to the broader field of artificial intelligence in audio processing. This paper makes a significant contribution to the field of neural vocoding by introducing a novel architecture that effectively utilizes range-null space decomposition, enhancing both the interpretability and performance of audio synthesis models. The methodology is well-structured, and the experimental results substantiate its effectiveness, positioning it as a valuable advancement in the audio processing domain.
Although deep neural networks have facilitated significant progress of neural vocoders in recent years, they usually suffer from intrinsic challenges like opaque modeling, inflexible retraining under different input configurations, and parameter-performance trade-off. These inherent hurdles can heavily impede the development of this field. To resolve these problems, in this paper, we propose a novel neural vocoder in the time-frequency (T-F) domain. Specifically, we bridge the connection between the classical range-null decomposition (RND) theory and the vocoder task, where the reconstruction of the target spectrogram is formulated into the superimposition between range-space and null-space. The former aims to project the representation in the original mel-domain into the target linear-scale domain, and the latter can be instantiated via neural networks to further infill the spectral details. To fully leverage the spectrum prior, an elaborate dual-path framework is devised, where the spectrum is hierarchically encoded and decoded, and the cross- and narrow-band modules are leveraged for effectively modeling along sub-band and time dimensions. To enable inference under various configurations, we propose a simple yet effective strategy, which transforms the multi-condition adaption in the inference stage into the data augmentation in the training stage. Comprehensive experiments are conducted on various benchmarks. Quantitative and qualitative results show that while enjoying lightweight network structure and scalable inference paradigm, the proposed framework achieves state-ofthe-art performance among existing advanced methods. Code is available at https://github.com/Andong-Li-speech/RNDVoC.
Primary: Institute of Acoustics, Chinese Academy of Sciences
All Institutions: Institute of Acoustics, Chinese Academy of Sciences, Chongqing University of Posts and Telecommunications, Tencent AI Lab, University of Chinese Academy of Sciences
The paper presents a significant advancement in neural vocoding by introducing a scalable framework that effectively integrates range-null space decomposition, addressing key challenges in the field. The innovative methodology and comprehensive experimental validation position this work as a valuable contribution to the audio processing community.
The proposed methodology introduces a novel neural vocoder framework based on range-null space decomposition (RND), which effectively addresses common challenges in existing vocoders, such as opaque modeling and inflexible retraining. The dual-path framework allows for hierarchical encoding and decoding of spectral features, leveraging both range-space and null-space modeling. The introduction of a multiple-condition-as-data-augmentation (MCDA) strategy enhances the model's adaptability to various mel configurations without the need for retraining, showcasing an innovative approach to scalability in neural vocoders.
The experiments are comprehensive, utilizing well-known benchmarks like LJSpeech and LibriTTS. The results demonstrate that the proposed method achieves state-of-the-art performance, outperforming existing models such as BigVGAN with significantly fewer parameters. The quantitative metrics, including PESQ and MCD, alongside qualitative assessments, indicate a robust evaluation of the model's effectiveness.
The paper provides a GitHub repository for code access, which is crucial for reproducibility. However, the detailed implementation specifics, such as hyperparameter settings and training procedures, should be clearly documented to facilitate replication by other researchers.
While the proposed framework shows promise, it may still struggle with certain edge cases in phase recovery and may require further optimization for real-time applications. Additionally, the reliance on specific datasets may limit the generalizability of the findings.
The advancements in neural vocoding have significant implications for various applications in speech synthesis, music generation, and audio processing. The ability to efficiently adapt to different configurations can enhance the deployment of these models in real-world scenarios, potentially leading to broader adoption in commercial products. The paper presents a significant advancement in neural vocoding by introducing a scalable framework that effectively integrates range-null space decomposition, addressing key challenges in the field. The innovative methodology and comprehensive experimental validation position this work as a valuable contribution to the audio processing community.
Text-to-audio diffusion models produce high-fidelity audio but require tens of function evaluations (NFEs), incurring multi-second latency and limited throughput. We present SoundWeaver, the first training-free, model-agnostic serving system that accelerates text-to-audio diffusion by warm-starting from semantically similar cached audio. SoundWeaver introduces three components: a Reference Selector that retrieves and temporally aligns cached candidates via semantic and duration-aware gating; a Skip Gater that dynamically determines the percentage of NFEs to skip; and a lightweight Cache Manager that maintains cache utility through quality-aware eviction and refinement. On real-world audio traces, SoundWeaver achieves 1.8--3.0$ \times $ latency reduction with a cache of only ${\sim}$1K entries while preserving or improving perceptual quality.
Primary: University of Illinois Urbana-Champaign
All Institutions: University of Illinois Urbana-Champaign
SoundWeaver introduces a novel approach to accelerating text-to-audio diffusion models through semantic warm-starting, demonstrating substantial improvements in latency and quality. The comprehensive methodology and experimental validation position this work as a meaningful contribution to the field of machine learning in audio generation.
The methodology presented in SoundWeaver is innovative, focusing on warm-starting text-to-audio diffusion models by leveraging semantically similar cached audio. The system comprises three main components: a Reference Selector for retrieving and aligning cached audio, a Skip Gater for determining the number of NFEs to skip, and a Cache Manager for maintaining cache quality. The use of a contextual multi-arm bandit approach for the Skip Gater is particularly noteworthy, as it adapts to varying user prompts and optimizes performance dynamically. The integration of semantic and duration-aware retrieval mechanisms adds depth to the approach, allowing for more efficient audio generation while preserving quality.
The experimental evaluation is robust, utilizing real-world audio traces and a variety of metrics to assess performance. The results demonstrate significant latency reductions (1.8-3.0x) while maintaining or improving perceptual quality across different models. The ablation studies effectively illustrate the contributions of each component, reinforcing the importance of the proposed methods. However, the reliance on specific datasets and the absence of extensive user studies could limit the generalizability of the findings.
The paper provides a detailed description of the experimental setup, including the models used, metrics evaluated, and the caching mechanism. However, the lack of a publicly accessible code repository or demo limits reproducibility. The authors mention using generative AI for writing and evaluation, which raises questions about the transparency of the evaluation process.
The paper acknowledges limitations such as potential phase vocoder distortion on longer audio requests and the lack of dedicated request schedulers. Additionally, the system's performance with complex samplers remains untested, which could impact its applicability in diverse scenarios.
SoundWeaver has significant implications for real-time audio generation applications, such as music composition and sound design. By reducing latency and improving throughput, it can enhance user experience in various audio-related services. The model-agnostic nature of the approach also suggests potential for broader adoption across different diffusion models and applications. SoundWeaver introduces a novel approach to accelerating text-to-audio diffusion models through semantic warm-starting, demonstrating substantial improvements in latency and quality. The comprehensive methodology and experimental validation position this work as a meaningful contribution to the field of machine learning in audio generation.
We propose Universal Speech Content Factorization (USCF), a simple and invertible linear method for extracting a low-rank speech representation in which speaker timbre is suppressed while phonetic content is preserved. USCF extends Speech Content Factorization, a closed-set voice conversion (VC) method, to an open-set setting by learning a universal speech-to-content mapping via least-squares optimization and deriving speaker-specific transformations from only a few seconds of target speech. We show through embedding analysis that USCF effectively removes speaker-dependent variation. As a zero-shot VC system, USCF achieves competitive intelligibility, naturalness, and speaker similarity compared to methods that require substantially more target-speaker data or additional neural training. Finally, we demonstrate that as a training-efficient timbre-disentangled speech feature, USCF features can serve as the acoustic representation for training timbre-prompted text-to-speech models. Speech samples and code are publicly available.
Primary: Johns Hopkins University
All Institutions: Johns Hopkins University
The main contribution of this paper is the introduction of USCF, a novel method for extracting speaker-agnostic speech representations that preserve phonetic content while suppressing speaker timbre. This work significantly advances the field of voice conversion by enabling zero-shot adaptation to unseen speakers, thus broadening the applicability of voice conversion technologies.
The methodology presented in the paper introduces Universal Speech Content Factorization (USCF), which extends the existing Speech Content Factorization (SCF) to an open-set setting. The approach is grounded in linear transformations and least-squares optimization, allowing for the extraction of speaker-agnostic content representations from speech data. The authors provide a clear derivation of the universal speech-to-content mapping and speaker transformation matrices, demonstrating a solid understanding of the underlying mathematical principles. The use of embedding analysis to validate the effectiveness of USCF in removing speaker-dependent variations while preserving phonetic content is a strong methodological aspect. However, the reliance on linear assumptions may limit the generalizability of the findings.
The experimental evaluation is robust, utilizing multiple datasets (LibriSpeech and TIMIT) to assess the performance of USCF in voice conversion tasks. The authors compare USCF against several baseline methods, including kNN-VC and LinearVC, providing both objective and subjective metrics to evaluate intelligibility, naturalness, and speaker similarity. The results indicate that USCF performs competitively, particularly in content preservation, although some degradation in speaker similarity is noted. The inclusion of ablation studies further strengthens the evaluation by analyzing the impact of different parameters on performance.
The paper includes sufficient details regarding the experimental setup, including the selection of datasets, metrics used, and the specific configurations for the voice conversion tasks. The authors have made their code publicly available, which enhances reproducibility. However, the paper could benefit from more detailed descriptions of the hyperparameter tuning process and the specific conditions under which experiments were conducted.
One limitation of the proposed method is the reliance on linear transformations, which may not capture complex relationships in speech data. Additionally, the performance degradation in speaker similarity indicates that while content preservation is achieved, the quality of voice conversion may suffer when adapting to unseen speakers. The requirement for a minimum amount of target speaker data (10 seconds) for effective transformation may also limit the applicability of USCF in scenarios with very limited data.
The implications of this research are significant for applications in voice conversion and text-to-speech synthesis, particularly in scenarios where speaker adaptation is necessary without extensive training data. The ability to generate speaker-agnostic representations could enhance accessibility in voice technologies and improve user experiences in various applications, including virtual assistants and personalized speech synthesis. The main contribution of this paper is the introduction of USCF, a novel method for extracting speaker-agnostic speech representations that preserve phonetic content while suppressing speaker timbre. This work significantly advances the field of voice conversion by enabling zero-shot adaptation to unseen speakers, thus broadening the applicability of voice conversion technologies.
Speech Large Language Models (LLMs) show great promise for speech emotion recognition (SER) via generative interfaces. However, shifting from closed-set classification to open text generation introduces zero-shot stochasticity, making evaluation highly sensitive to prompts. Additionally, conventional speech LLMs benchmarks overlook the inherent ambiguity of human emotion. Hence, we present VoxEmo, a comprehensive SER benchmark encompassing 35 emotion corpora across 15 languages for Speech LLMs. VoxEmo provides a standardized toolkit featuring varying prompt complexities, from direct classification to paralinguistic reasoning. To reflect real-world perception/application, we introduce a distribution-aware soft-label protocol and a prompt-ensemble strategy that emulates annotator disagreement. Experiments reveal that while zero-shot speech LLMs trail supervised baselines in hard-label accuracy, they uniquely align with human subjective distributions.
Primary: University of Sheffield
All Institutions: University of Sheffield, University of Southern California
The main contribution of this paper is the introduction of VoxEmo, a comprehensive benchmarking framework for speech emotion recognition that addresses the challenges of prompt sensitivity and human emotion ambiguity in the evaluation of speech LLMs. The technical contributions, including the standardized toolkit and innovative evaluation strategies, position this work as a significant advancement in the field of SER.
The paper introduces VoxEmo, a novel benchmarking framework for speech emotion recognition (SER) using speech LLMs. The methodology is well-structured, addressing the challenges of prompt sensitivity and human emotion ambiguity through a comprehensive toolkit that includes a distribution-aware soft-label protocol and a prompt-ensemble strategy. The approach of utilizing multiple prompts to capture the stochastic nature of LLM outputs is innovative, although it may lead to increased complexity in evaluation.
The experiments are extensive, covering 35 emotion corpora across 15 languages. The results demonstrate the performance of two speech LLMs (Qwen2-Audio and Audio Flamingo) under various prompt configurations. The analysis of zero-shot performance and the impact of supervised fine-tuning is thorough, providing valuable insights into the strengths and weaknesses of the models. However, the paper could benefit from more detailed comparisons with existing state-of-the-art methods.
The paper emphasizes reproducibility by providing a standardized evaluation toolkit and clear descriptions of the experimental setup, including the selection of models and evaluation metrics. However, the reliance on specific models and the absence of a public code repository may hinder full reproducibility.
The paper acknowledges several limitations, including the focus on only two models with the same audio encoder, the potential for hyperparameter mismatch during fine-tuning, and the restriction of soft-label evaluation to a limited number of datasets. Additionally, the study does not explore within-dataset factors that could affect performance.
The proposed benchmark has significant implications for the development of affect-aware systems in human-computer interaction and speech analytics. By addressing the ambiguity of human emotion and providing a framework for evaluating generative models, this work could lead to advancements in more nuanced and effective emotion recognition systems. The main contribution of this paper is the introduction of VoxEmo, a comprehensive benchmarking framework for speech emotion recognition that addresses the challenges of prompt sensitivity and human emotion ambiguity in the evaluation of speech LLMs. The technical contributions, including the standardized toolkit and innovative evaluation strategies, position this work as a significant advancement in the field of SER.
Whispered speech lacks vocal fold vibration and fundamental frequency, resulting in degraded acoustic cues and making whisper-to-normal (W2N) conversion challenging, especially with limited parallel data. We propose WhispEar, a bidirectional framework based on unified semantic representations that capture speaking-mode-invariant information shared by whispered and normal speech. The framework contains both W2N and normal-to-whisper (N2W) models. Notably, the N2W model enables zero-shot pseudo-parallel whisper generation from abundant normal speech, allowing scalable data augmentation for W2N training. Increasing generated data consistently improves performance. We also release the largest bilingual (Chinese-English) whispered-normal parallel corpus to date. Experiments demonstrate that WhispEar outperforms strong baselines and benefits significantly from scalable pseudo-parallel data.
Primary: The Chinese University of Hong Kong
All Institutions: The Chinese University of Hong Kong
WhispEar presents a novel bidirectional framework for whispered speech conversion, effectively addressing data scarcity through innovative pseudo-parallel data generation. The paper's contributions significantly advance the field of speech processing, particularly in enhancing the intelligibility and naturalness of whispered speech.
The methodology presented in WhispEar is innovative, leveraging a bidirectional framework that allows for both whisper-to-normal (W2N) and normal-to-whisper (N2W) conversions. The use of semantic representations to bridge the gap between the two modalities is a significant advancement. The three-stage training process, particularly the zero-shot pseudo-parallel whisper generation, is a clever approach to mitigate the scarcity of parallel data. The incorporation of a lightweight semantic tokenizer and a shared Flow-Matching Transformer model demonstrates a solid understanding of the underlying acoustic characteristics and the need for efficient data utilization.
The experiments are well-structured, comparing WhispEar against strong baselines and demonstrating clear performance improvements across various metrics, including intelligibility, naturalness, and prosody recovery. The release of the wEar dataset, the largest bilingual whispered-normal parallel corpus, adds significant value to the research community. The systematic scaling study provides compelling evidence of the effectiveness of the proposed methods, showcasing how increasing the amount of pseudo-parallel data leads to consistent performance gains.
The paper provides sufficient details regarding the training process, data collection, and evaluation metrics, which should enable other researchers to replicate the experiments. However, the absence of a publicly available code repository limits full reproducibility, as potential users cannot directly implement the proposed methods without access to the code.
One limitation noted is the reliance on the quality of the generated pseudo-whispered data, which may not fully capture the nuances of real whispered speech. Additionally, while the framework shows promise, its performance in noisy environments or with diverse speaker characteristics has not been thoroughly evaluated. Future work should address these aspects to enhance robustness and generalizability.
The implications of this research are significant, particularly in areas requiring whispered speech conversion for privacy and communication enhancement. The ability to generate high-quality whispered speech from normal speech could have applications in assistive technologies, voice restoration, and privacy-focused communication tools. The release of the wEar dataset also paves the way for further research in this domain, potentially leading to advancements in speech synthesis and recognition technologies. WhispEar presents a novel bidirectional framework for whispered speech conversion, effectively addressing data scarcity through innovative pseudo-parallel data generation. The paper's contributions significantly advance the field of speech processing, particularly in enhancing the intelligibility and naturalness of whispered speech.
Speech-to-speech models handle turn-taking naturally but offer limited support for tool-calling or complex reasoning, while production ASR-LLM-TTS voice pipelines offer these capabilities but rely on silence timeouts, which lead to unnatural turn-taking. We present DualTurn, which narrows this gap through generative pretraining on dual-channel conversational audio. The model generates both speakers' future audio autoregressively, implicitly learning conversational dynamics without any labels, and is then fine-tuned to predict interpretable turn-taking signals that map directly to agent actions. DualTurn monitors both channels continuously, anticipating turn boundaries and producing five agent actions. On standard benchmarks, DualTurn (0.5B) outperforms both VAP on agent action prediction (wF1 0.633 vs. 0.389) and a 3.1B audio-text model on word-level turn prediction (AUC 0.930 vs. 0.880), while anticipating turn boundaries earlier with fewer interruptions.
Primary: Anyreach AI
All Institutions: Anyreach AI
The main contribution of this paper is the introduction of DualTurn, a model that effectively learns turn-taking dynamics in conversational audio through generative pretraining, outperforming existing methods in both anticipation of turn boundaries and prediction of agent actions. This work represents a meaningful advancement in the field of conversational AI, addressing limitations in current models and providing a foundation for future research in multi-speaker interaction systems.
The methodology presented in DualTurn is innovative, leveraging dual-channel generative pretraining to learn turn-taking dynamics without labeled data. The use of a lightweight neural codec for audio encoding, combined with a two-stage training process, allows the model to effectively capture conversational context and predict turn-taking signals. The architecture is well thought out, with a clear distinction between generative pretraining and subsequent fine-tuning for specific tasks, which enhances the model's performance in predicting agent actions.
The experimental evaluation is robust, utilizing standard benchmarks such as Switchboard and otoSpeech to compare DualTurn against existing models like VAP and a large audio-text fusion model. The results demonstrate significant improvements in both word-level turn prediction and agent action prediction, with clear metrics provided (e.g., wF1 and AUC scores). The ablation studies further validate the contributions of different components of the model, showcasing the effectiveness of the generative pretraining stage.
The paper provides sufficient details about the architecture, training procedures, and datasets used, which supports reproducibility. However, the absence of URLs for code or demo implementations limits the ability for others to directly replicate the results. Including a public repository would enhance reproducibility significantly.
One limitation noted is the reliance on a single language (English) and a relatively small dataset (453 hours of dual-channel conversation audio), which may affect the generalizability of the model to other languages or larger, more diverse datasets. Additionally, while the model anticipates turn boundaries earlier, the practical implications of this in real-world applications need further exploration.
The implications of DualTurn are significant for applications in conversational AI, particularly in enhancing the naturalness of interactions in voice assistants and other automated systems. By improving turn-taking dynamics, the model can contribute to more fluid and human-like conversations, which is critical for user satisfaction and engagement in AI-driven communication tools. The main contribution of this paper is the introduction of DualTurn, a model that effectively learns turn-taking dynamics in conversational audio through generative pretraining, outperforming existing methods in both anticipation of turn boundaries and prediction of agent actions. This work represents a meaningful advancement in the field of conversational AI, addressing limitations in current models and providing a foundation for future research in multi-speaker interaction systems.
Quantization has become essential for the efficient deployment of speech processing systems. Although widely studied, most existing quantization methods were developed for vision and NLP architectures, while the specific challenges of audio signals remain largely overlooked. In particular, we show that audio activations can exhibit large calibration ranges, leading to significant information loss when standard calibration techniques are applied. To address this, we propose ESC, an Evolution Strategy-based Calibration method that formulates activation scaling as an optimization problem and solves it using a two-step local-global scheme driven by an evolution strategy. ESC enables unaltered performance under full INT8 quantization and is the first calibration method to achieve near-lossless performance for full INT4 quantization across multiple speech tasks. Integrating ESC with PTQ methods further reduces performance loss, achieving a 1% relative accuracy degradation on the AST model.
Primary: cortAIx Labs
All Institutions: cortAIx Labs
The paper presents a novel calibration method for low-bit quantization of speech models that leverages evolution strategies to optimize activation scaling, demonstrating significant performance improvements across various tasks. The technical contributions are substantial, addressing a critical gap in the quantization of audio models and paving the way for more efficient deployment in resource-constrained environments.
The proposed Evolution Strategy-Based Calibration (ESC) method is innovative, particularly in its formulation of calibration as a two-step optimization problem that integrates local and global objectives. The use of evolution strategies to optimize activation scaling factors is a novel approach tailored specifically for the audio domain, addressing the unique challenges posed by audio activations that differ significantly from those in vision and NLP. The methodology is well-structured, with clear steps for initialization and optimization, although it could benefit from more detailed explanations of the algorithm's parameters and their tuning.
The experiments conducted are comprehensive, covering multiple speech tasks and models, which strengthens the validity of the results. The paper reports significant improvements over existing calibration methods, particularly in INT4 quantization, which is crucial for deploying models in resource-constrained environments. However, the paper lacks detailed descriptions of datasets and specific evaluation metrics used, which could enhance the reproducibility and understanding of the results.
While the paper outlines the methodology and experimental setup, it does not provide sufficient implementation details or code availability, which are critical for reproducibility. The absence of a project URL or demo further limits the ability of other researchers to replicate the findings.
One limitation is the reliance on a specific hardware configuration (NVIDIA RTX 3090) for performance evaluation, which may not generalize across different platforms. Additionally, while the method shows promise for INT4 quantization, the paper does not explore the trade-offs or potential degradation in performance for other model architectures or tasks outside those tested.
The proposed ESC method has the potential to significantly impact the deployment of speech models in real-world applications, particularly in scenarios where computational resources are limited. By enabling near-lossless performance at lower bit-widths, this work could facilitate the broader adoption of advanced speech processing technologies in mobile and embedded systems. The paper presents a novel calibration method for low-bit quantization of speech models that leverages evolution strategies to optimize activation scaling, demonstrating significant performance improvements across various tasks. The technical contributions are substantial, addressing a critical gap in the quantization of audio models and paving the way for more efficient deployment in resource-constrained environments.
Autoregressive "language" models (LMs) trained on raw waveforms can be repurposed for lossless audio compression, but prior work is limited to 8-bit audio, leaving open whether such approaches work for practical settings (16/24-bit) and can compete with existing codecs. We benchmark LM-based compression on full-fidelity audio across diverse domains (music, speech, bioacoustics), sampling rates (16kHz-48kHz), and bit depths (8, 16, 24-bit). Standard sample-level tokenization becomes intractable at higher bit depths due to vocabulary size (65K for 16-bit; 16.7M for 24-bit). We propose Trilobyte, a byte-level tokenization schema for full resolution audio, improving vocabulary scaling from $O(2^{b})$ to $O(1)$ and enabling the first tractable 24-bit LM-based lossless compression. While LMs consistently outperform FLAC and yield state-of-the-art compression at 8-bit and 16-bit, we observe that compression gains become more modest as bit depth increases beyond 8-bit.
Primary: Carnegie Mellon University
All Institutions: Carnegie Mellon University, University of California
The main contribution of this paper is the introduction of Trilobyte, a byte-level tokenization schema that enables tractable modeling of 24-bit audio for lossless compression using autoregressive language models. This work significantly advances the application of machine learning in audio compression, addressing a critical gap in the literature and providing a foundation for future research in the area.
The paper introduces a novel byte-level tokenization schema, Trilobyte, which effectively addresses the vocabulary explosion problem in autoregressive language models (LMs) for lossless audio compression. By reducing the vocabulary size from exponential scaling to a constant size, the authors enable tractable modeling of 24-bit audio, a significant advancement over prior work limited to 8-bit audio. The methodology is well-structured, detailing the compression pipeline, the use of arithmetic coding, and the training of models on diverse audio datasets. The approach is theoretically sound and leverages established principles of autoregressive modeling, making it a meaningful contribution to the field.
The authors conduct a comprehensive benchmarking of their proposed method across various audio domains (music, speech, bioacoustics) and bit depths (8, 16, 24-bit). The experiments are rigorous, with comparisons to industry-standard codecs like FLAC, and they provide detailed results that highlight the performance of Trilobyte in different scenarios. The evaluation demonstrates that while the compression gains are modest at higher bit depths, the method consistently outperforms FLAC at 8-bit and shows competitive results at 16-bit.
The authors provide a GitHub repository for the Trilobyte implementation, which enhances reproducibility. However, the paper could benefit from more detailed descriptions of the experimental setup, including hyperparameters and training conditions, to facilitate replication of results by other researchers.
The paper acknowledges that the computational cost of the proposed ML approaches is significantly higher than traditional codecs like FLAC, which may limit their practical deployment in real-world scenarios. Additionally, the modest compression gains at higher bit depths suggest that further optimization is needed to make these methods more competitive.
The work has significant implications for the field of audio compression, particularly in contexts where lossless audio fidelity is critical, such as professional audio production and archival storage. By demonstrating the potential of LMs for lossless audio compression, this research opens avenues for future exploration of machine learning techniques in audio processing. The main contribution of this paper is the introduction of Trilobyte, a byte-level tokenization schema that enables tractable modeling of 24-bit audio for lossless compression using autoregressive language models. This work significantly advances the application of machine learning in audio compression, addressing a critical gap in the literature and providing a foundation for future research in the area.
Recent studies have shown that post-deployment adaptation can improve the robustness of speech enhancement models in unseen noise conditions. However, existing methods often incur prohibitive computational and memory costs, limiting their suitability for on-device deployment. In this work, we investigate model adaptation in realistic settings with dynamic acoustic scene changes and propose a lightweight framework that augments a frozen backbone with low-rank adapters updated via self-supervised training. Experiments on sequential scene evaluations spanning 111 environments across 37 noise types and three signal-to-noise ratio ranges, including the challenging [-8, 0] dB range, show that our method updates fewer than 1% of the base model's parameters while achieving an average 1.51 dB SI-SDR improvement within only 20 updates per scene. Compared to state-of-the-art approaches, our framework achieves competitive or superior perceptual quality with smoother and more stable convergence, demonstrating its practicality for lightweight on-device adaptation of speech enhancement models under real-world acoustic conditions.
Primary: Institute of Neuroinformatics
All Institutions: Institute of Neuroinformatics, University of Zurich, ETH Zurich
The main contribution of this paper is the introduction of a lightweight self-supervised adaptation framework for speech enhancement models that efficiently updates model parameters in real-world acoustic environments. This work represents a significant step toward making advanced speech processing technologies more accessible and practical for on-device applications.
The paper presents a novel self-supervised adaptation framework leveraging low-rank adapters for speech enhancement models. This approach addresses the critical issue of adapting models to dynamic acoustic environments without the need for extensive parameter updates, which is a significant advancement over traditional methods that require fine-tuning a large number of parameters. The methodology is well-structured, clearly outlining the adaptation process and the rationale behind using low-rank adapters. However, the paper could benefit from a more detailed explanation of the self-supervised training process and how pseudo-targets are generated.
The experimental setup is robust, involving evaluations across 111 environments and multiple noise types, which strengthens the validity of the results. The metrics used (PESQ, STOI, and SI-SDR) are appropriate for assessing speech enhancement quality. The results demonstrate that the proposed method achieves competitive performance compared to state-of-the-art approaches while maintaining a significantly lower computational footprint. However, the paper lacks a detailed comparison of the proposed method with other lightweight adaptation techniques beyond RemixIT, which could provide a more comprehensive view of its relative performance.
The paper provides a thorough description of the experimental setup, including model architectures, training procedures, and dataset details, which enhances reproducibility. However, the absence of publicly available code or datasets limits the ability for other researchers to replicate the results directly. Including a link to a repository or providing access to the datasets used would significantly improve reproducibility.
One limitation of the proposed method is its reliance on the quality of the pseudo-targets generated during adaptation. If the initial model is not sufficiently robust, the adaptation may not yield optimal results. Additionally, while the method shows promise for dynamic environments, its performance in highly variable or extreme conditions remains to be tested. The paper also does not address the potential computational overhead associated with the self-supervised training phase.
The proposed lightweight adaptation framework has significant implications for real-world applications, particularly in mobile and edge computing environments where computational resources are limited. By enabling effective on-device adaptation of speech enhancement models, this work could improve accessibility for users of hearing aids and other assistive listening devices in diverse acoustic settings. The approach could also be extended to other domains requiring real-time audio processing, enhancing the practicality of machine learning solutions in everyday applications. The main contribution of this paper is the introduction of a lightweight self-supervised adaptation framework for speech enhancement models that efficiently updates model parameters in real-world acoustic environments. This work represents a significant step toward making advanced speech processing technologies more accessible and practical for on-device applications.
Bowel sounds (BS) are typically momentary and have low amplitude, making them difficult to detect accurately through manual auscultation. This leads to significant variability in clinical assessment. Digital acoustic sensors allow the acquisition of high-quality BS and enable automated signal analysis, offering the potential to provide clinicians with both objective and quantitative feedback on bowel activity. This study presents an automated pipeline for bowel sound segmentation and classification using a wearable acoustic SonicGuard sensor. BS signals from 83 subjects were recorded using a SonicGuard sensor. Data from 40 subjects were manually annotated by clinical experts and used to train an automatic annotation algorithm, while the remaining subjects were used for further model evaluation. An energy-based event detection algorithm was developed to detect BS events. Detected sound segments were then classified into BS patterns using a pretrained Audio Spectrogram Transformer (AST) model. Model performance was evaluated separately for healthy individuals and patients. The best configuration used two specialized models, one trained on healthy subjects and one on patients, achieving (accuracy: 0.97, AUROC: 0.98) for healthy group and (accuracy: 0.96, AUROC: 0.98) for patient group. The auto-annotation method reduced manual labeling time by approximately 70%, and expert review showed that less than 12% of automatically detected segments required correction. The proposed automated segmentation and classification system enables quantitative assessment of bowel activity, providing clinicians with an objective diagnostic tool that may improve the diagnostic of gastrointestinal function and support the annotation of large-scale datasets.
Primary: Carl von Ossietzky Universität Oldenburg
All Institutions: Carl von Ossietzky Universität Oldenburg, PIUS Hospital
The main contribution of this paper is the development of an automated pipeline for bowel sound segmentation and classification that integrates advanced machine learning techniques with a wearable acoustic sensor, addressing the challenges of subjective auscultation in clinical practice. The comprehensive methodology and promising results indicate a significant step forward in the objective assessment of gastrointestinal function, with potential implications for clinical diagnostics and research.
The paper presents a comprehensive automated pipeline for bowel sound segmentation and classification, utilizing a wearable acoustic sensor. The methodology is well-structured, combining an energy-based event detection algorithm with advanced deep learning models (Audio Spectrogram Transformer and Wav2Vec). The approach is innovative in its integration of cohort-specific models to account for differences between healthy individuals and patients, which is a significant advancement over previous works that did not consider such variability. The detailed description of the event detection algorithm, including the use of RMS amplitude and energy variations, demonstrates a thoughtful approach to addressing the challenges posed by the heterogeneous nature of bowel sounds.
The experiments are robust, involving recordings from a diverse set of subjects (both healthy and patients) and a well-defined evaluation protocol. The performance metrics (accuracy and AUROC) indicate strong model performance, particularly with the AST model achieving high accuracy rates (0.97 for healthy subjects and 0.96 for patients). The use of expert-reviewed annotations adds credibility to the evaluation process. However, the paper could benefit from additional comparative analyses with other state-of-the-art methods to further validate the proposed approach.
The authors provide a GitHub repository for the implementation of their approach, which is a positive aspect for reproducibility. However, the paper lacks detailed information on the specific experimental setup, such as hyperparameter tuning and training procedures, which could hinder full reproducibility by other researchers.
The study acknowledges limitations, such as the tendency of the auto-annotation framework to truncate certain event durations, particularly for the MB class. Additionally, the reliance on a relatively small dataset for training and evaluation may affect the generalizability of the model. The authors could also explore the impact of noise and other external factors on the model's performance in real-world clinical settings.
The proposed automated system has significant potential applications in clinical settings, providing objective and quantitative assessments of bowel sounds that could enhance diagnostic accuracy and efficiency. By reducing the workload on clinicians and enabling the analysis of large datasets, this work could facilitate improved patient monitoring and treatment strategies in gastrointestinal care. The development of such tools aligns with the growing trend towards digital health and personalized medicine. The main contribution of this paper is the development of an automated pipeline for bowel sound segmentation and classification that integrates advanced machine learning techniques with a wearable acoustic sensor, addressing the challenges of subjective auscultation in clinical practice. The comprehensive methodology and promising results indicate a significant step forward in the objective assessment of gastrointestinal function, with potential implications for clinical diagnostics and research.
Audio-visual speech recognition (AVSR) is an extension of ASR that incorporates visual signals. Current AVSR approaches primarily focus on lip motion, largely overlooking rich context present in the video such as speaking scene and on-screen text. To tackle such CAVSR (AVSR including rich visual Context), we propose VASR designed to "see" and reason the visual context to improve speech recognition. Specifically, we construct an Audio-Visual Chain-of-Thought (AV-CoT) that explicitly enforces intermediate cross-modal grounding between acoustic signals and visual evidence. This evidence-driven reasoning mitigates the "single-modality dominance" problem, where models either over-rely on visual context or fail to utilize it. Besides, to address the data scarcity, we construct and release a corresponding data pipeline and test set. Experiments show that AV-CoT effectively mitigates the single-modality dominance, achieving state-of-the-art performance in CAVSR. The project is open-sourced.
Primary: Northwestern Polytechnical University
All Institutions: Northwestern Polytechnical University
The paper presents a novel approach to context-aware audio-visual speech recognition by leveraging rich visual context through a structured reasoning framework. This work significantly advances the field by addressing the limitations of existing AVSR methods and providing a comprehensive dataset for future research.
The proposed methodology introduces the Audio-Visual Chain-of-Thought (AV-CoT) framework, which is a structured approach to integrate visual context into speech recognition tasks. This is a significant advancement over traditional AVSR methods that primarily focus on lip movements. The three-step process of Perception, Reasoning, and Transcription is well-defined, allowing for a systematic approach to disambiguate speech using multimodal inputs. The authors also address the challenge of data scarcity by developing a scalable data pipeline, which is a commendable effort in enhancing the dataset quality for CAVSR tasks.
The experiments are thorough, demonstrating the effectiveness of the VASR model against several strong baselines. The use of character error rate (CER) as a metric is appropriate for the task, and the results indicate a significant performance improvement over existing models. The ablation studies provide additional insights into the importance of the AV-CoT mechanism, reinforcing the claims made about its effectiveness in mitigating single-modality dominance.
The authors provide sufficient implementation details, including the model architecture, training parameters, and data processing pipeline. However, the reproducibility could be enhanced by providing more detailed descriptions of the datasets used and ensuring that all code and data are readily accessible for independent verification.
One notable limitation is the reliance on the Qwen2.5-Omni model, which has a low frame rate for visual encoding, potentially impacting the performance of the lip-reading task. Additionally, the paper does not address the potential biases that may arise from the datasets used, which could affect the generalizability of the results.
The research has significant implications for improving speech recognition systems, particularly in contexts where visual cues are abundant. This could enhance accessibility for individuals with hearing impairments and improve user experience in various multimedia applications. The open-sourcing of the dataset and code also promotes further research in this area. The paper presents a novel approach to context-aware audio-visual speech recognition by leveraging rich visual context through a structured reasoning framework. This work significantly advances the field by addressing the limitations of existing AVSR methods and providing a comprehensive dataset for future research.