Conversational generative AI is rapidly entering healthcare, where general-purpose models must integrate heterogeneous patient signals and support diverse interaction styles while producing clinically meaningful outputs. In respiratory care, non-invasive audio, such as recordings captured via mobile microphones, enables scalable screening and longitudinal monitoring, but the heterogeneity challenge is particularly acute: recordings vary widely across devices, environments, and acquisition protocols, and questions span multiple intents and question formats. Existing biomedical audio-language QA systems are typically monolithic, without any specialization mechanisms for tackling diverse respiratory corpora and query intents. They are also only validated in limited settings, leaving it unclear how reliably they handle the shifts encountered in real-world settings. To address these limitations, we introduce RAMoEA-QA, a hierarchically routed generative model for respiratory audio question answering that unifies multiple question types and supports both discrete and continuous targets within a single multimodal system. RAMoEA-QA applies two-stage conditional specialization: an Audio Mixture-of-Experts routes each recording to a suitable pre-trained audio encoder, and a Language Mixture-of-Adapters selects a LoRA adapter on a shared frozen LLM to match the query intent and answer format. By specializing both acoustic representations and generation behaviour per example, RAMoEA-QA consistently outperforms strong baselines and routing ablations with minimal parameter overhead, improving in-domain test accuracy to 0.72 (vs. 0.61 and 0.67 for state-of-the-art baselines) and exhibiting the strongest generalization for diagnosis under domain, modality, and task shifts.
Primary: Tsinghua University
All Institutions: Tsinghua University, University of Calabria, University of Cambridge
The main contribution of this paper is the introduction of RAMoEA-QA, a hierarchically specialized model for respiratory audio question answering that effectively addresses the challenges posed by heterogeneous audio data and diverse query intents. This work represents a meaningful advancement in the integration of machine learning and healthcare, demonstrating both innovative methodology and impactful results.
The proposed RAMoEA-QA model introduces a novel hierarchical specialization approach that employs a two-stage conditional specialization mechanism, utilizing an Audio Mixture-of-Experts and a Language Mixture-of-Adapters. This design allows the model to effectively handle the diverse nature of respiratory audio data and various query intents, which is a significant advancement over existing monolithic biomedical audio-language QA systems. The use of pre-trained audio encoders and LoRA adapters on a frozen LLM demonstrates a thoughtful integration of state-of-the-art techniques while maintaining a low parameter overhead.
The paper presents a comprehensive experimental setup, comparing RAMoEA-QA against strong baselines and conducting ablation studies to validate the effectiveness of the routing mechanisms. The reported in-domain test accuracy of 0.72 significantly surpasses the state-of-the-art baselines (0.61 and 0.67), indicating robust performance. The experiments also address generalization across different domains, modalities, and tasks, which is critical for real-world applications in healthcare.
The authors provide a link to their code repository, which is essential for reproducibility. However, the paper could benefit from additional details regarding the implementation specifics, such as hyperparameter settings and training procedures, to facilitate easier replication of results by other researchers.
One limitation noted is the reliance on the RA-QA collection, which may not encompass the full diversity of respiratory audio data encountered in practice. Additionally, while the model shows strong performance in controlled settings, its robustness in highly variable real-world environments remains to be fully validated.
The RAMoEA-QA model has significant potential applications in healthcare, particularly in respiratory care, where it can enhance patient monitoring and screening through scalable audio analysis. Its ability to handle diverse audio inputs and question formats could lead to more effective and personalized patient interactions, ultimately improving healthcare outcomes. The main contribution of this paper is the introduction of RAMoEA-QA, a hierarchically specialized model for respiratory audio question answering that effectively addresses the challenges posed by heterogeneous audio data and diverse query intents. This work represents a meaningful advancement in the integration of machine learning and healthcare, demonstrating both innovative methodology and impactful results.
Modern ASR systems are typically trained on large-scale pseudo-labeled, in-the-wild data spanning multiple domains. While such heterogeneous data benefit generalist models designed for broad deployment, they pose challenges for specialist models targeting specific domains: specialist models lack the capacity to learn from all available data, and one must pay closer attention to addressing the mismatch between training and test conditions. In this work, we study targeted data selection as a strategy to address these challenges, selecting relevant subsets from 100k hours of in-the-wild training data to optimize performance on target domains. We represent speech samples using embeddings that capture complementary characteristic--speaker attributes, phonetic content, and semantic meaning--and analyze how relevance and diversity along these axes when performing data selection affect downstream ASR performance. Our experiments with CTC-based Conformer models show that training on a strategically selected 5% subset can exceed the performance of models trained on the full dataset by up to 36.8% relative WER reduction on target domains.
Primary: Carnegie Mellon University
All Institutions: Carnegie Mellon University
This paper contributes to the field by proposing a robust embedding-based data selection method for ASR systems that addresses domain mismatch challenges, demonstrating significant performance improvements across various datasets. The comprehensive methodology and experimental validation provide a strong foundation for future research in data selection and ASR model training.
The paper presents a novel approach to data selection for ASR systems using embedding-based methods that capture speaker, phonetic, and semantic characteristics. The use of Maximal Marginal Relevance (MMR) to balance relevance and diversity in data selection is a significant methodological advancement. The multi-embedding and multi-target strategies enhance the robustness of the approach, allowing for effective training on large-scale, heterogeneous datasets. The methodology is well-structured, with clear definitions and mathematical formulations that enhance clarity and reproducibility.
The experiments are comprehensive, utilizing multiple target datasets (LibriSpeech, CommonVoice, TED-LIUM) to validate the effectiveness of the proposed data selection methods. The results demonstrate substantial improvements in word error rate (WER) when using strategically selected subsets compared to random selections and the full dataset. The experiments are well-designed, with appropriate controls and comparisons that provide strong evidence for the claims made.
The paper provides sufficient details regarding the implementation, including model architectures, training procedures, and data selection algorithms. However, the lack of publicly available code or datasets limits reproducibility. The use of specific embeddings and the complexity of the MMR selection process may pose challenges for others attempting to replicate the results without access to the same resources.
The paper acknowledges the computational expense of the greedy MMR procedure and the potential for label noise in the pseudo-labeled Granary dataset. Additionally, the reliance on embedding-based selection may not generalize across all domains or datasets, and the performance may vary based on the characteristics of the target domain.
The findings have significant implications for the deployment of ASR systems in specialized domains, particularly in scenarios where labeled data is scarce. The ability to effectively select relevant training data can enhance the performance of models in real-world applications, making this research highly relevant to both academia and industry. The approach may also inspire further research into data selection strategies in other machine learning domains. This paper contributes to the field by proposing a robust embedding-based data selection method for ASR systems that addresses domain mismatch challenges, demonstrating significant performance improvements across various datasets. The comprehensive methodology and experimental validation provide a strong foundation for future research in data selection and ASR model training.
Generative audio requires fine-grained controllable outputs, yet most existing methods require model retraining on specific controls or inference-time controls (\textit{e.g.}, guidance) that can also be computationally demanding. By examining the bottlenecks of existing guidance-based controls, in particular their high cost-per-step due to decoder backpropagation, we introduce a guidance-based approach through selective TFG and Latent-Control Heads (LatCHs), which enables controlling latent audio diffusion models with low computational overhead. LatCHs operate directly in latent space, avoiding the expensive decoder step, and requiring minimal training resources (7M parameters and $\approx$ 4 hours of training). Experiments with Stable Audio Open demonstrate effective control over intensity, pitch, and beats (and a combination of those) while maintaining generation quality. Our method balances precision and audio fidelity with far lower computational costs than standard end-to-end guidance. Demo examples can be found at https://zacharynovack.github.io/latch/latch.html.
Primary: UC -- San Diego
All Institutions: UC -- San Diego
The main contribution of this paper is the introduction of a low-resource, inference-time control framework for latent audio diffusion models, which effectively balances control precision, audio fidelity, and runtime performance. The methodology and results presented are significant advancements in the field of controllable audio generation, showcasing the potential for efficient and high-quality audio synthesis.
The paper introduces a novel approach to controllable audio generation through the use of Latent-Control Heads (LatCHs) and selective Training-Free Guidance (TFG). By operating directly in latent space, the proposed method significantly reduces computational overhead associated with traditional end-to-end guidance methods. The methodology is well-structured, with clear explanations of how LatCHs function and the rationale behind selective TFG. The authors provide a solid theoretical foundation, linking their work to existing literature while clearly delineating their contributions.
The experiments are comprehensive, utilizing the Stable Audio Open (SAO) dataset and comparing the proposed methods against established baselines, including end-to-end guidance and readouts. The evaluation metrics are well-defined, including both qualitative assessments (mean opinion scores) and quantitative metrics (FDopenl3, KLpasst, and CLAP). The results demonstrate that LatCHs outperform traditional methods in terms of both audio quality and computational efficiency, which is a significant achievement in the field of audio generation.
The paper provides sufficient details regarding the experimental setup, including hyperparameters and training procedures for LatCHs. However, the lack of a publicly available code repository may hinder full reproducibility. The authors do mention the datasets used, which aids in replicating the experiments, but the absence of a project URL limits access to the implementation.
One limitation is the potential challenge in generalizing the method to more complex audio generation tasks beyond the evaluated controls (intensity, pitch, and beats). Additionally, the reliance on specific feature extractors may limit the applicability of the approach to other audio domains. The authors also note that controls with greater variability, such as pitch, pose challenges, indicating room for improvement in handling such cases.
The proposed framework has significant implications for the field of generative audio, particularly in applications requiring real-time audio manipulation and control. The ability to generate high-quality audio with low computational costs can benefit various industries, including music production, gaming, and virtual reality. Furthermore, the approach could pave the way for more accessible audio generation tools for creators without extensive computational resources. The main contribution of this paper is the introduction of a low-resource, inference-time control framework for latent audio diffusion models, which effectively balances control precision, audio fidelity, and runtime performance. The methodology and results presented are significant advancements in the field of controllable audio generation, showcasing the potential for efficient and high-quality audio synthesis.
Audiovisual speech recognition (AVSR) combines acoustic and visual cues to improve transcription robustness under challenging conditions but remains out of reach for most under-resourced languages due to the lack of labeled video corpora for training. We propose a zero-AV-resource AVSR framework that relies on synthetic visual streams generated by lip-syncing static facial images with real audio. We first evaluate synthetic visual augmentation on Spanish benchmarks, then apply it to Catalan, a language with no annotated audiovisual corpora. We synthesize over 700 hours of talking-head video and fine-tune a pre-trained AV-HuBERT model. On a manually annotated Catalan benchmark, our model achieves near state-of-the-art performance with much fewer parameters and training data, outperforms an identically trained audio-only baseline, and preserves multimodal advantages in noise. Scalable synthetic video thus offers a viable substitute for real recordings in zero-AV-resource AVSR.
Primary: Universitat Politècnica de Catalunya (UPC)
All Institutions: Barcelona Supercomputing Center (BSC), Universitat Politècnica de Catalunya (UPC)
The main contribution of this paper is the introduction of a zero-AV-resource AVSR framework that utilizes synthetic visual data to enhance speech recognition capabilities in under-resourced languages. This innovative approach not only addresses a critical gap in the field but also opens avenues for future research and development in multimodal speech recognition.
The proposed methodology leverages synthetic visual data generated from static images to create a training framework for AVSR in zero-resource scenarios. The use of lip-syncing techniques to generate talking-head videos is innovative, particularly in the context of under-resourced languages like Catalan. The end-to-end pipeline for generating synthetic audiovisual data is well-structured and language-agnostic, which enhances the applicability of the approach. The integration of a semi-automatic annotation pipeline further strengthens the methodology by providing a means to evaluate the model effectively. However, the reliance on synthetic data may raise questions about the generalizability of the results to real-world applications.
The experiments conducted are thorough, comparing the proposed model against both audio-only baselines and state-of-the-art ASR systems. The results demonstrate significant improvements in transcription accuracy when using synthetic visual data, particularly in challenging noise conditions. The authors provide clear metrics (WER) to quantify performance, and the comparative analysis with existing models like Whisper adds depth to the evaluation. However, the paper could benefit from more extensive ablation studies to further dissect the contributions of various components of the model.
The paper includes a link to the GitHub repository containing the code and resources for synthetic data generation and annotation, which is a positive aspect for reproducibility. However, the details regarding the datasets and specific configurations used in the experiments could be more explicitly stated to facilitate replication by other researchers.
One limitation is the potential gap between synthetic and real-world data, as the synthetic videos may not fully capture the complexities of natural speech and visual cues. Additionally, while the model shows promise for Catalan, its performance on other under-resourced languages remains untested. The reliance on a single method for generating synthetic videos may also limit the robustness of the approach.
This research has the potential to significantly impact the field of speech recognition, particularly for under-resourced languages, by providing a scalable method for training AVSR systems without the need for extensive audiovisual datasets. The implications extend to various applications in accessibility, communication technologies, and language preservation. The main contribution of this paper is the introduction of a zero-AV-resource AVSR framework that utilizes synthetic visual data to enhance speech recognition capabilities in under-resourced languages. This innovative approach not only addresses a critical gap in the field but also opens avenues for future research and development in multimodal speech recognition.
While autoregressive (AR) LLM-based ASR systems achieve strong accuracy, their sequential decoding limits parallelism and incurs high latency. We propose NLE, a non-autoregressive (NAR) approach that formulates speech recognition as conditional transcript editing, enabling fully parallel prediction. NLE extracts acoustic embeddings and an initial hypothesis from a pretrained speech encoder, then refines the hypothesis using a bidirectional LLM editor trained with a latent alignment objective. An interleaved padding strategy exploits the identity mapping bias of Transformers, allowing the model to focus on corrections rather than full reconstruction. On the Open ASR leaderboard, NLE++ achieves 5.67% average WER with an RTFx (inverse real-time factor) of 1630. In single-utterance scenarios, NLE achieves 27x speedup over the AR baseline, making it suitable for real-time applications.
Primary: IBM Research
All Institutions: IBM Research
The main contribution of this paper is the introduction of a non-autoregressive LLM-based ASR system that effectively combines the strengths of pretrained speech encoders and language models through a novel editing approach, significantly improving transcription speed and maintaining competitive accuracy. The methodology is innovative, and the experimental results demonstrate substantial technical impact, making it a valuable contribution to the field of machine learning and speech recognition.
The proposed methodology introduces a non-autoregressive (NAR) approach to automatic speech recognition (ASR) by framing it as conditional transcript editing. This is achieved through a bidirectional LLM editor that refines an initial hypothesis generated by a pretrained speech encoder. The interleaved padding strategy is a notable innovation, allowing the model to focus on corrections rather than full reconstructions, which enhances the efficiency of the editing process. The use of lightweight LoRA adapters for model adaptation is also a significant methodological contribution, enabling the model to leverage pretrained linguistic knowledge effectively while maintaining a manageable number of trainable parameters.
The experiments conducted are rigorous, with the authors evaluating their model against leading ASR systems on the Open ASR leaderboard. The reported results demonstrate a competitive word error rate (WER) of 5.67% for NLE++, with a substantial speedup of 27x over autoregressive baselines in single-utterance scenarios. The inclusion of ablation studies further strengthens the evaluation, providing insights into the impact of various design choices on performance. However, the paper could benefit from more extensive comparisons with a broader range of models and additional datasets to validate the robustness of the findings.
The paper provides a detailed description of the model architecture, training procedures, and evaluation metrics, which enhances reproducibility. However, the lack of a publicly available code repository or demo URL limits the ability for others to directly replicate the results. The authors mention using specific datasets and configurations, which is helpful, but sharing the implementation would significantly improve reproducibility.
The paper acknowledges that the NLE approach is less flexible than autoregressive models in scenarios requiring substantial changes to the hypothesis. It also highlights potential latency overhead due to the need for retokenization when using different tokenizers for the CTC encoder and the LLM. Moreover, the performance in multilingual settings appears to be weaker, suggesting that the model's training data may not be adequately representative of all languages.
The proposed NLE system has significant implications for real-time ASR applications, particularly in conversational settings where low latency is critical. By enabling faster and more accurate transcription, this approach could enhance user experiences in various domains, including virtual assistants, customer service, and accessibility technologies. The ability to refine initial hypotheses rather than regenerate them from scratch could also lead to more efficient use of computational resources. The main contribution of this paper is the introduction of a non-autoregressive LLM-based ASR system that effectively combines the strengths of pretrained speech encoders and language models through a novel editing approach, significantly improving transcription speed and maintaining competitive accuracy. The methodology is innovative, and the experimental results demonstrate substantial technical impact, making it a valuable contribution to the field of machine learning and speech recognition.
End-to-end full-duplex speech models feed user audio through an always-on LLM backbone, yet the speaker privacy implications of their hidden representations remain unexamined. Following the VoicePrivacy 2024 protocol with a lazy-informed attacker, we show that the hidden states of SALM-Duplex and Moshi leak substantial speaker identity across all transformer layers. Layer-wise and turn-wise analyses reveal that leakage persists across all layers, with SALM-Duplex showing stronger leakage in early layers while Moshi leaks uniformly, and that Linkability rises sharply within the first few turns. We propose two streaming anonymization setups using Stream-Voice-Anon: a waveform-level front-end (Anon-W2W) and a feature-domain replacement (Anon-W2F). Anon-W2F raises EER by over 3.5x relative to the discrete encoder baseline (11.2% to 41.0%), approaching the 50% random-chance ceiling, while Anon-W2W retains 78-93% of baseline sBERT across setups with sub-second response latency (FRL under 0.8 s).
Primary: The Chinese University of Hong Kong
All Institutions: The Chinese University of Hong Kong, Huawei Leibniz Research Center, Nanyang Technological University, The Hong Kong Polytechnic University
The paper effectively characterizes speaker identity leakage in full-duplex speech dialogue models and proposes innovative anonymization techniques that significantly enhance privacy without sacrificing usability. This work is a crucial step towards ensuring the responsible deployment of AI-driven speech technologies.
The paper introduces a novel approach to analyzing speaker identity leakage in end-to-end full-duplex speech dialogue models, specifically SALM-Duplex and Moshi. The authors employ a lazy-informed attacker scenario to assess privacy risks, which is a relevant and timely concern given the increasing use of always-on speech systems. The proposed anonymization techniques, Anon-W2W and Anon-W2F, are well-structured, with clear distinctions between waveform-level and feature-domain methods. The methodology is rigorous, utilizing established metrics like Equal Error Rate (EER) and Linkability to quantify privacy improvements.
The experiments are comprehensive, employing a standardized dataset from the VoicePrivacy 2024 Challenge and a well-defined evaluation protocol. The results demonstrate significant improvements in privacy metrics, particularly with the Anon-W2F method, which achieves a notable increase in EER, indicating strong privacy protection. The authors also provide a thorough analysis of the impact of anonymization on dialogue quality and efficiency, showcasing a balanced consideration of privacy and usability.
The paper includes sufficient details regarding the experimental setup, including model architectures, training datasets, and evaluation metrics, which should facilitate reproducibility. However, the reliance on specific datasets and the proprietary nature of some components may pose challenges for full replication.
The study primarily focuses on two specific models (SALM-Duplex and Moshi), which may limit the generalizability of the findings to other full-duplex systems. Additionally, while the proposed anonymization methods show promise, the impact on speech quality and naturalness remains an area for further exploration. The authors also acknowledge that their quality metrics may not fully capture speech-level attributes.
The implications of this research are significant, particularly in the context of privacy regulations like GDPR. By addressing the privacy risks associated with always-on speech systems, the work contributes to the development of safer AI technologies that can be deployed in real-world applications without compromising user privacy. The findings could influence future designs of speech dialogue systems, emphasizing the need for privacy-by-design principles. The paper effectively characterizes speaker identity leakage in full-duplex speech dialogue models and proposes innovative anonymization techniques that significantly enhance privacy without sacrificing usability. This work is a crucial step towards ensuring the responsible deployment of AI-driven speech technologies.
Although deep neural networks have facilitated significant progress of neural vocoders in recent years, they usually suffer from intrinsic challenges like opaque modeling, inflexible retraining under different input configurations, and parameter-performance trade-off. These inherent hurdles can heavily impede the development of this field. To resolve these problems, in this paper, we propose a novel neural vocoder in the time-frequency (T-F) domain. Specifically, we bridge the connection between the classical range-null decomposition (RND) theory and the vocoder task, where the reconstruction of the target spectrogram is formulated into the superimposition between range-space and null-space. The former aims to project the representation in the original mel-domain into the target linear-scale domain, and the latter can be instantiated via neural networks to further infill the spectral details. To fully leverage the spectrum prior, an elaborate dual-path framework is devised, where the spectrum is hierarchically encoded and decoded, and the cross- and narrow-band modules are leveraged for effectively modeling along sub-band and time dimensions. To enable inference under various configurations, we propose a simple yet effective strategy, which transforms the multi-condition adaption in the inference stage into the data augmentation in the training stage. Comprehensive experiments are conducted on various benchmarks. Quantitative and qualitative results show that while enjoying lightweight network structure and scalable inference paradigm, the proposed framework achieves state-ofthe-art performance among existing advanced methods. Code is available at https://github.com/Andong-Li-speech/RNDVoC.
Primary: Institute of Acoustics, Chinese Academy of Sciences
All Institutions: Institute of Acoustics, Chinese Academy of Sciences, Chongqing University of Posts and Telecommunications, Tencent AI Lab, University of Chinese Academy of Sciences
This paper makes a significant contribution to the field of neural vocoding by introducing a novel architecture that effectively utilizes range-null space decomposition, enhancing both the interpretability and performance of audio synthesis models. The methodology is well-structured, and the experimental results substantiate its effectiveness, positioning it as a valuable advancement in the audio processing domain.
The paper introduces a novel neural vocoder architecture based on range-null space decomposition (RND), which effectively separates the reconstruction of audio spectrograms into two orthogonal components: range-space and null-space. This approach is innovative as it leverages classical signal processing theory to enhance the interpretability and robustness of neural vocoders. The dual-path framework proposed allows for hierarchical encoding and decoding of spectral features, which is a significant advancement over existing methods that typically use full-band modules. The introduction of a multi-condition-as-data-augmentation strategy is also noteworthy, as it allows for scalable inference without the need for retraining, addressing a common limitation in neural vocoders.
The authors conducted comprehensive experiments on established benchmarks, including LJSpeech and LibriTTS, demonstrating state-of-the-art performance compared to existing methods. The quantitative metrics and qualitative assessments indicate that the proposed method not only achieves high-quality audio synthesis but also maintains a lightweight network structure, enhancing its practical applicability. The ablation studies further validate the effectiveness of the proposed components, providing a thorough evaluation of their contributions to performance.
The paper provides a GitHub repository link for code access, which is crucial for reproducibility. However, the detailed implementation specifics, such as hyperparameter settings and training configurations, could be better documented to facilitate easier replication of results by other researchers.
While the proposed method shows promise, it may still face challenges in handling extreme variations in input conditions that were not covered in the training data. Additionally, the reliance on the pseudo-inverse operation might introduce computational overhead in real-time applications, which could limit its deployment in resource-constrained environments.
The advancements in neural vocoding presented in this paper have significant implications for various audio processing applications, including text-to-speech synthesis, music generation, and speech enhancement. By improving the quality and efficiency of vocoders, this work could enhance user experiences in voice interfaces and multimedia applications, contributing to the broader field of artificial intelligence in audio processing. This paper makes a significant contribution to the field of neural vocoding by introducing a novel architecture that effectively utilizes range-null space decomposition, enhancing both the interpretability and performance of audio synthesis models. The methodology is well-structured, and the experimental results substantiate its effectiveness, positioning it as a valuable advancement in the audio processing domain.
Text-to-audio diffusion models produce high-fidelity audio but require tens of function evaluations (NFEs), incurring multi-second latency and limited throughput. We present SoundWeaver, the first training-free, model-agnostic serving system that accelerates text-to-audio diffusion by warm-starting from semantically similar cached audio. SoundWeaver introduces three components: a Reference Selector that retrieves and temporally aligns cached candidates via semantic and duration-aware gating; a Skip Gater that dynamically determines the percentage of NFEs to skip; and a lightweight Cache Manager that maintains cache utility through quality-aware eviction and refinement. On real-world audio traces, SoundWeaver achieves 1.8--3.0$ \times $ latency reduction with a cache of only ${\sim}$1K entries while preserving or improving perceptual quality.
Primary: University of Illinois Urbana-Champaign
All Institutions: University of Illinois Urbana-Champaign
SoundWeaver introduces a novel approach to accelerating text-to-audio diffusion models through semantic warm-starting, demonstrating substantial improvements in latency and quality. The comprehensive methodology and experimental validation position this work as a meaningful contribution to the field of machine learning in audio generation.
The methodology presented in SoundWeaver is innovative, focusing on warm-starting text-to-audio diffusion models by leveraging semantically similar cached audio. The system comprises three main components: a Reference Selector for retrieving and aligning cached audio, a Skip Gater for determining the number of NFEs to skip, and a Cache Manager for maintaining cache quality. The use of a contextual multi-arm bandit approach for the Skip Gater is particularly noteworthy, as it adapts to varying user prompts and optimizes performance dynamically. The integration of semantic and duration-aware retrieval mechanisms adds depth to the approach, allowing for more efficient audio generation while preserving quality.
The experimental evaluation is robust, utilizing real-world audio traces and a variety of metrics to assess performance. The results demonstrate significant latency reductions (1.8-3.0x) while maintaining or improving perceptual quality across different models. The ablation studies effectively illustrate the contributions of each component, reinforcing the importance of the proposed methods. However, the reliance on specific datasets and the absence of extensive user studies could limit the generalizability of the findings.
The paper provides a detailed description of the experimental setup, including the models used, metrics evaluated, and the caching mechanism. However, the lack of a publicly accessible code repository or demo limits reproducibility. The authors mention using generative AI for writing and evaluation, which raises questions about the transparency of the evaluation process.
The paper acknowledges limitations such as potential phase vocoder distortion on longer audio requests and the lack of dedicated request schedulers. Additionally, the system's performance with complex samplers remains untested, which could impact its applicability in diverse scenarios.
SoundWeaver has significant implications for real-time audio generation applications, such as music composition and sound design. By reducing latency and improving throughput, it can enhance user experience in various audio-related services. The model-agnostic nature of the approach also suggests potential for broader adoption across different diffusion models and applications. SoundWeaver introduces a novel approach to accelerating text-to-audio diffusion models through semantic warm-starting, demonstrating substantial improvements in latency and quality. The comprehensive methodology and experimental validation position this work as a meaningful contribution to the field of machine learning in audio generation.
Whispered speech lacks vocal fold vibration and fundamental frequency, resulting in degraded acoustic cues and making whisper-to-normal (W2N) conversion challenging, especially with limited parallel data. We propose WhispEar, a bidirectional framework based on unified semantic representations that capture speaking-mode-invariant information shared by whispered and normal speech. The framework contains both W2N and normal-to-whisper (N2W) models. Notably, the N2W model enables zero-shot pseudo-parallel whisper generation from abundant normal speech, allowing scalable data augmentation for W2N training. Increasing generated data consistently improves performance. We also release the largest bilingual (Chinese-English) whispered-normal parallel corpus to date. Experiments demonstrate that WhispEar outperforms strong baselines and benefits significantly from scalable pseudo-parallel data.
Primary: The Chinese University of Hong Kong
All Institutions: The Chinese University of Hong Kong
WhispEar presents a novel bidirectional framework for whispered speech conversion, effectively addressing data scarcity through innovative pseudo-parallel data generation. The paper's contributions significantly advance the field of speech processing, particularly in enhancing the intelligibility and naturalness of whispered speech.
The methodology presented in WhispEar is innovative, leveraging a bidirectional framework that allows for both whisper-to-normal (W2N) and normal-to-whisper (N2W) conversions. The use of semantic representations to bridge the gap between the two modalities is a significant advancement. The three-stage training process, particularly the zero-shot pseudo-parallel whisper generation, is a clever approach to mitigate the scarcity of parallel data. The incorporation of a lightweight semantic tokenizer and a shared Flow-Matching Transformer model demonstrates a solid understanding of the underlying acoustic characteristics and the need for efficient data utilization.
The experiments are well-structured, comparing WhispEar against strong baselines and demonstrating clear performance improvements across various metrics, including intelligibility, naturalness, and prosody recovery. The release of the wEar dataset, the largest bilingual whispered-normal parallel corpus, adds significant value to the research community. The systematic scaling study provides compelling evidence of the effectiveness of the proposed methods, showcasing how increasing the amount of pseudo-parallel data leads to consistent performance gains.
The paper provides sufficient details regarding the training process, data collection, and evaluation metrics, which should enable other researchers to replicate the experiments. However, the absence of a publicly available code repository limits full reproducibility, as potential users cannot directly implement the proposed methods without access to the code.
One limitation noted is the reliance on the quality of the generated pseudo-whispered data, which may not fully capture the nuances of real whispered speech. Additionally, while the framework shows promise, its performance in noisy environments or with diverse speaker characteristics has not been thoroughly evaluated. Future work should address these aspects to enhance robustness and generalizability.
The implications of this research are significant, particularly in areas requiring whispered speech conversion for privacy and communication enhancement. The ability to generate high-quality whispered speech from normal speech could have applications in assistive technologies, voice restoration, and privacy-focused communication tools. The release of the wEar dataset also paves the way for further research in this domain, potentially leading to advancements in speech synthesis and recognition technologies. WhispEar presents a novel bidirectional framework for whispered speech conversion, effectively addressing data scarcity through innovative pseudo-parallel data generation. The paper's contributions significantly advance the field of speech processing, particularly in enhancing the intelligibility and naturalness of whispered speech.
Speech-to-speech models handle turn-taking naturally but offer limited support for tool-calling or complex reasoning, while production ASR-LLM-TTS voice pipelines offer these capabilities but rely on silence timeouts, which lead to unnatural turn-taking. We present DualTurn, which narrows this gap through generative pretraining on dual-channel conversational audio. The model generates both speakers' future audio autoregressively, implicitly learning conversational dynamics without any labels, and is then fine-tuned to predict interpretable turn-taking signals that map directly to agent actions. DualTurn monitors both channels continuously, anticipating turn boundaries and producing five agent actions. On standard benchmarks, DualTurn (0.5B) outperforms both VAP on agent action prediction (wF1 0.633 vs. 0.389) and a 3.1B audio-text model on word-level turn prediction (AUC 0.930 vs. 0.880), while anticipating turn boundaries earlier with fewer interruptions.
Primary: Anyreach AI
All Institutions: Anyreach AI
The main contribution of this paper is the introduction of DualTurn, a model that effectively learns turn-taking dynamics in conversational audio through generative pretraining, outperforming existing methods in both anticipation of turn boundaries and prediction of agent actions. This work represents a meaningful advancement in the field of conversational AI, addressing limitations in current models and providing a foundation for future research in multi-speaker interaction systems.
The methodology presented in DualTurn is innovative, leveraging dual-channel generative pretraining to learn turn-taking dynamics without labeled data. The use of a lightweight neural codec for audio encoding, combined with a two-stage training process, allows the model to effectively capture conversational context and predict turn-taking signals. The architecture is well thought out, with a clear distinction between generative pretraining and subsequent fine-tuning for specific tasks, which enhances the model's performance in predicting agent actions.
The experimental evaluation is robust, utilizing standard benchmarks such as Switchboard and otoSpeech to compare DualTurn against existing models like VAP and a large audio-text fusion model. The results demonstrate significant improvements in both word-level turn prediction and agent action prediction, with clear metrics provided (e.g., wF1 and AUC scores). The ablation studies further validate the contributions of different components of the model, showcasing the effectiveness of the generative pretraining stage.
The paper provides sufficient details about the architecture, training procedures, and datasets used, which supports reproducibility. However, the absence of URLs for code or demo implementations limits the ability for others to directly replicate the results. Including a public repository would enhance reproducibility significantly.
One limitation noted is the reliance on a single language (English) and a relatively small dataset (453 hours of dual-channel conversation audio), which may affect the generalizability of the model to other languages or larger, more diverse datasets. Additionally, while the model anticipates turn boundaries earlier, the practical implications of this in real-world applications need further exploration.
The implications of DualTurn are significant for applications in conversational AI, particularly in enhancing the naturalness of interactions in voice assistants and other automated systems. By improving turn-taking dynamics, the model can contribute to more fluid and human-like conversations, which is critical for user satisfaction and engagement in AI-driven communication tools. The main contribution of this paper is the introduction of DualTurn, a model that effectively learns turn-taking dynamics in conversational audio through generative pretraining, outperforming existing methods in both anticipation of turn boundaries and prediction of agent actions. This work represents a meaningful advancement in the field of conversational AI, addressing limitations in current models and providing a foundation for future research in multi-speaker interaction systems.
Quantization has become essential for the efficient deployment of speech processing systems. Although widely studied, most existing quantization methods were developed for vision and NLP architectures, while the specific challenges of audio signals remain largely overlooked. In particular, we show that audio activations can exhibit large calibration ranges, leading to significant information loss when standard calibration techniques are applied. To address this, we propose ESC, an Evolution Strategy-based Calibration method that formulates activation scaling as an optimization problem and solves it using a two-step local-global scheme driven by an evolution strategy. ESC enables unaltered performance under full INT8 quantization and is the first calibration method to achieve near-lossless performance for full INT4 quantization across multiple speech tasks. Integrating ESC with PTQ methods further reduces performance loss, achieving a 1% relative accuracy degradation on the AST model.
Primary: cortAIx Labs
All Institutions: cortAIx Labs
The paper presents a novel calibration method for low-bit quantization of speech models that leverages evolution strategies to optimize activation scaling, demonstrating significant performance improvements across various tasks. The technical contributions are substantial, addressing a critical gap in the quantization of audio models and paving the way for more efficient deployment in resource-constrained environments.
The proposed Evolution Strategy-Based Calibration (ESC) method is innovative, particularly in its formulation of calibration as a two-step optimization problem that integrates local and global objectives. The use of evolution strategies to optimize activation scaling factors is a novel approach tailored specifically for the audio domain, addressing the unique challenges posed by audio activations that differ significantly from those in vision and NLP. The methodology is well-structured, with clear steps for initialization and optimization, although it could benefit from more detailed explanations of the algorithm's parameters and their tuning.
The experiments conducted are comprehensive, covering multiple speech tasks and models, which strengthens the validity of the results. The paper reports significant improvements over existing calibration methods, particularly in INT4 quantization, which is crucial for deploying models in resource-constrained environments. However, the paper lacks detailed descriptions of datasets and specific evaluation metrics used, which could enhance the reproducibility and understanding of the results.
While the paper outlines the methodology and experimental setup, it does not provide sufficient implementation details or code availability, which are critical for reproducibility. The absence of a project URL or demo further limits the ability of other researchers to replicate the findings.
One limitation is the reliance on a specific hardware configuration (NVIDIA RTX 3090) for performance evaluation, which may not generalize across different platforms. Additionally, while the method shows promise for INT4 quantization, the paper does not explore the trade-offs or potential degradation in performance for other model architectures or tasks outside those tested.
The proposed ESC method has the potential to significantly impact the deployment of speech models in real-world applications, particularly in scenarios where computational resources are limited. By enabling near-lossless performance at lower bit-widths, this work could facilitate the broader adoption of advanced speech processing technologies in mobile and embedded systems. The paper presents a novel calibration method for low-bit quantization of speech models that leverages evolution strategies to optimize activation scaling, demonstrating significant performance improvements across various tasks. The technical contributions are substantial, addressing a critical gap in the quantization of audio models and paving the way for more efficient deployment in resource-constrained environments.
Autoregressive "language" models (LMs) trained on raw waveforms can be repurposed for lossless audio compression, but prior work is limited to 8-bit audio, leaving open whether such approaches work for practical settings (16/24-bit) and can compete with existing codecs. We benchmark LM-based compression on full-fidelity audio across diverse domains (music, speech, bioacoustics), sampling rates (16kHz-48kHz), and bit depths (8, 16, 24-bit). Standard sample-level tokenization becomes intractable at higher bit depths due to vocabulary size (65K for 16-bit; 16.7M for 24-bit). We propose Trilobyte, a byte-level tokenization schema for full resolution audio, improving vocabulary scaling from $O(2^{b})$ to $O(1)$ and enabling the first tractable 24-bit LM-based lossless compression. While LMs consistently outperform FLAC and yield state-of-the-art compression at 8-bit and 16-bit, we observe that compression gains become more modest as bit depth increases beyond 8-bit.
Primary: Carnegie Mellon University
All Institutions: Carnegie Mellon University, University of California
The main contribution of this paper is the introduction of Trilobyte, a byte-level tokenization schema that enables tractable modeling of 24-bit audio for lossless compression using autoregressive language models. This work significantly advances the application of machine learning in audio compression, addressing a critical gap in the literature and providing a foundation for future research in the area.
The paper introduces a novel byte-level tokenization schema, Trilobyte, which effectively addresses the vocabulary explosion problem in autoregressive language models (LMs) for lossless audio compression. By reducing the vocabulary size from exponential scaling to a constant size, the authors enable tractable modeling of 24-bit audio, a significant advancement over prior work limited to 8-bit audio. The methodology is well-structured, detailing the compression pipeline, the use of arithmetic coding, and the training of models on diverse audio datasets. The approach is theoretically sound and leverages established principles of autoregressive modeling, making it a meaningful contribution to the field.
The authors conduct a comprehensive benchmarking of their proposed method across various audio domains (music, speech, bioacoustics) and bit depths (8, 16, 24-bit). The experiments are rigorous, with comparisons to industry-standard codecs like FLAC, and they provide detailed results that highlight the performance of Trilobyte in different scenarios. The evaluation demonstrates that while the compression gains are modest at higher bit depths, the method consistently outperforms FLAC at 8-bit and shows competitive results at 16-bit.
The authors provide a GitHub repository for the Trilobyte implementation, which enhances reproducibility. However, the paper could benefit from more detailed descriptions of the experimental setup, including hyperparameters and training conditions, to facilitate replication of results by other researchers.
The paper acknowledges that the computational cost of the proposed ML approaches is significantly higher than traditional codecs like FLAC, which may limit their practical deployment in real-world scenarios. Additionally, the modest compression gains at higher bit depths suggest that further optimization is needed to make these methods more competitive.
The work has significant implications for the field of audio compression, particularly in contexts where lossless audio fidelity is critical, such as professional audio production and archival storage. By demonstrating the potential of LMs for lossless audio compression, this research opens avenues for future exploration of machine learning techniques in audio processing. The main contribution of this paper is the introduction of Trilobyte, a byte-level tokenization schema that enables tractable modeling of 24-bit audio for lossless compression using autoregressive language models. This work significantly advances the application of machine learning in audio compression, addressing a critical gap in the literature and providing a foundation for future research in the area.
Recent studies have shown that post-deployment adaptation can improve the robustness of speech enhancement models in unseen noise conditions. However, existing methods often incur prohibitive computational and memory costs, limiting their suitability for on-device deployment. In this work, we investigate model adaptation in realistic settings with dynamic acoustic scene changes and propose a lightweight framework that augments a frozen backbone with low-rank adapters updated via self-supervised training. Experiments on sequential scene evaluations spanning 111 environments across 37 noise types and three signal-to-noise ratio ranges, including the challenging [-8, 0] dB range, show that our method updates fewer than 1% of the base model's parameters while achieving an average 1.51 dB SI-SDR improvement within only 20 updates per scene. Compared to state-of-the-art approaches, our framework achieves competitive or superior perceptual quality with smoother and more stable convergence, demonstrating its practicality for lightweight on-device adaptation of speech enhancement models under real-world acoustic conditions.
Primary: Institute of Neuroinformatics
All Institutions: Institute of Neuroinformatics, University of Zurich, ETH Zurich
The main contribution of this paper is the introduction of a lightweight self-supervised adaptation framework for speech enhancement models that efficiently updates model parameters in real-world acoustic environments. This work represents a significant step toward making advanced speech processing technologies more accessible and practical for on-device applications.
The paper presents a novel self-supervised adaptation framework leveraging low-rank adapters for speech enhancement models. This approach addresses the critical issue of adapting models to dynamic acoustic environments without the need for extensive parameter updates, which is a significant advancement over traditional methods that require fine-tuning a large number of parameters. The methodology is well-structured, clearly outlining the adaptation process and the rationale behind using low-rank adapters. However, the paper could benefit from a more detailed explanation of the self-supervised training process and how pseudo-targets are generated.
The experimental setup is robust, involving evaluations across 111 environments and multiple noise types, which strengthens the validity of the results. The metrics used (PESQ, STOI, and SI-SDR) are appropriate for assessing speech enhancement quality. The results demonstrate that the proposed method achieves competitive performance compared to state-of-the-art approaches while maintaining a significantly lower computational footprint. However, the paper lacks a detailed comparison of the proposed method with other lightweight adaptation techniques beyond RemixIT, which could provide a more comprehensive view of its relative performance.
The paper provides a thorough description of the experimental setup, including model architectures, training procedures, and dataset details, which enhances reproducibility. However, the absence of publicly available code or datasets limits the ability for other researchers to replicate the results directly. Including a link to a repository or providing access to the datasets used would significantly improve reproducibility.
One limitation of the proposed method is its reliance on the quality of the pseudo-targets generated during adaptation. If the initial model is not sufficiently robust, the adaptation may not yield optimal results. Additionally, while the method shows promise for dynamic environments, its performance in highly variable or extreme conditions remains to be tested. The paper also does not address the potential computational overhead associated with the self-supervised training phase.
The proposed lightweight adaptation framework has significant implications for real-world applications, particularly in mobile and edge computing environments where computational resources are limited. By enabling effective on-device adaptation of speech enhancement models, this work could improve accessibility for users of hearing aids and other assistive listening devices in diverse acoustic settings. The approach could also be extended to other domains requiring real-time audio processing, enhancing the practicality of machine learning solutions in everyday applications. The main contribution of this paper is the introduction of a lightweight self-supervised adaptation framework for speech enhancement models that efficiently updates model parameters in real-world acoustic environments. This work represents a significant step toward making advanced speech processing technologies more accessible and practical for on-device applications.
Bowel sounds (BS) are typically momentary and have low amplitude, making them difficult to detect accurately through manual auscultation. This leads to significant variability in clinical assessment. Digital acoustic sensors allow the acquisition of high-quality BS and enable automated signal analysis, offering the potential to provide clinicians with both objective and quantitative feedback on bowel activity. This study presents an automated pipeline for bowel sound segmentation and classification using a wearable acoustic SonicGuard sensor. BS signals from 83 subjects were recorded using a SonicGuard sensor. Data from 40 subjects were manually annotated by clinical experts and used to train an automatic annotation algorithm, while the remaining subjects were used for further model evaluation. An energy-based event detection algorithm was developed to detect BS events. Detected sound segments were then classified into BS patterns using a pretrained Audio Spectrogram Transformer (AST) model. Model performance was evaluated separately for healthy individuals and patients. The best configuration used two specialized models, one trained on healthy subjects and one on patients, achieving (accuracy: 0.97, AUROC: 0.98) for healthy group and (accuracy: 0.96, AUROC: 0.98) for patient group. The auto-annotation method reduced manual labeling time by approximately 70%, and expert review showed that less than 12% of automatically detected segments required correction. The proposed automated segmentation and classification system enables quantitative assessment of bowel activity, providing clinicians with an objective diagnostic tool that may improve the diagnostic of gastrointestinal function and support the annotation of large-scale datasets.
Primary: Carl von Ossietzky Universität Oldenburg
All Institutions: Carl von Ossietzky Universität Oldenburg, PIUS Hospital
The main contribution of this paper is the development of an automated pipeline for bowel sound segmentation and classification that integrates advanced machine learning techniques with a wearable acoustic sensor, addressing the challenges of subjective auscultation in clinical practice. The comprehensive methodology and promising results indicate a significant step forward in the objective assessment of gastrointestinal function, with potential implications for clinical diagnostics and research.
The paper presents a comprehensive automated pipeline for bowel sound segmentation and classification, utilizing a wearable acoustic sensor. The methodology is well-structured, combining an energy-based event detection algorithm with advanced deep learning models (Audio Spectrogram Transformer and Wav2Vec). The approach is innovative in its integration of cohort-specific models to account for differences between healthy individuals and patients, which is a significant advancement over previous works that did not consider such variability. The detailed description of the event detection algorithm, including the use of RMS amplitude and energy variations, demonstrates a thoughtful approach to addressing the challenges posed by the heterogeneous nature of bowel sounds.
The experiments are robust, involving recordings from a diverse set of subjects (both healthy and patients) and a well-defined evaluation protocol. The performance metrics (accuracy and AUROC) indicate strong model performance, particularly with the AST model achieving high accuracy rates (0.97 for healthy subjects and 0.96 for patients). The use of expert-reviewed annotations adds credibility to the evaluation process. However, the paper could benefit from additional comparative analyses with other state-of-the-art methods to further validate the proposed approach.
The authors provide a GitHub repository for the implementation of their approach, which is a positive aspect for reproducibility. However, the paper lacks detailed information on the specific experimental setup, such as hyperparameter tuning and training procedures, which could hinder full reproducibility by other researchers.
The study acknowledges limitations, such as the tendency of the auto-annotation framework to truncate certain event durations, particularly for the MB class. Additionally, the reliance on a relatively small dataset for training and evaluation may affect the generalizability of the model. The authors could also explore the impact of noise and other external factors on the model's performance in real-world clinical settings.
The proposed automated system has significant potential applications in clinical settings, providing objective and quantitative assessments of bowel sounds that could enhance diagnostic accuracy and efficiency. By reducing the workload on clinicians and enabling the analysis of large datasets, this work could facilitate improved patient monitoring and treatment strategies in gastrointestinal care. The development of such tools aligns with the growing trend towards digital health and personalized medicine. The main contribution of this paper is the development of an automated pipeline for bowel sound segmentation and classification that integrates advanced machine learning techniques with a wearable acoustic sensor, addressing the challenges of subjective auscultation in clinical practice. The comprehensive methodology and promising results indicate a significant step forward in the objective assessment of gastrointestinal function, with potential implications for clinical diagnostics and research.
Audio-visual speech recognition (AVSR) is an extension of ASR that incorporates visual signals. Current AVSR approaches primarily focus on lip motion, largely overlooking rich context present in the video such as speaking scene and on-screen text. To tackle such CAVSR (AVSR including rich visual Context), we propose VASR designed to "see" and reason the visual context to improve speech recognition. Specifically, we construct an Audio-Visual Chain-of-Thought (AV-CoT) that explicitly enforces intermediate cross-modal grounding between acoustic signals and visual evidence. This evidence-driven reasoning mitigates the "single-modality dominance" problem, where models either over-rely on visual context or fail to utilize it. Besides, to address the data scarcity, we construct and release a corresponding data pipeline and test set. Experiments show that AV-CoT effectively mitigates the single-modality dominance, achieving state-of-the-art performance in CAVSR. The project is open-sourced.
Primary: Northwestern Polytechnical University
All Institutions: Northwestern Polytechnical University
The paper presents a novel approach to context-aware audio-visual speech recognition by leveraging rich visual context through a structured reasoning framework. This work significantly advances the field by addressing the limitations of existing AVSR methods and providing a comprehensive dataset for future research.
The proposed methodology introduces the Audio-Visual Chain-of-Thought (AV-CoT) framework, which is a structured approach to integrate visual context into speech recognition tasks. This is a significant advancement over traditional AVSR methods that primarily focus on lip movements. The three-step process of Perception, Reasoning, and Transcription is well-defined, allowing for a systematic approach to disambiguate speech using multimodal inputs. The authors also address the challenge of data scarcity by developing a scalable data pipeline, which is a commendable effort in enhancing the dataset quality for CAVSR tasks.
The experiments are thorough, demonstrating the effectiveness of the VASR model against several strong baselines. The use of character error rate (CER) as a metric is appropriate for the task, and the results indicate a significant performance improvement over existing models. The ablation studies provide additional insights into the importance of the AV-CoT mechanism, reinforcing the claims made about its effectiveness in mitigating single-modality dominance.
The authors provide sufficient implementation details, including the model architecture, training parameters, and data processing pipeline. However, the reproducibility could be enhanced by providing more detailed descriptions of the datasets used and ensuring that all code and data are readily accessible for independent verification.
One notable limitation is the reliance on the Qwen2.5-Omni model, which has a low frame rate for visual encoding, potentially impacting the performance of the lip-reading task. Additionally, the paper does not address the potential biases that may arise from the datasets used, which could affect the generalizability of the results.
The research has significant implications for improving speech recognition systems, particularly in contexts where visual cues are abundant. This could enhance accessibility for individuals with hearing impairments and improve user experience in various multimedia applications. The open-sourcing of the dataset and code also promotes further research in this area. The paper presents a novel approach to context-aware audio-visual speech recognition by leveraging rich visual context through a structured reasoning framework. This work significantly advances the field by addressing the limitations of existing AVSR methods and providing a comprehensive dataset for future research.
Conversational generative AI is rapidly entering healthcare, where general-purpose models must integrate heterogeneous patient signals and support diverse interaction styles while producing clinically meaningful outputs. In respiratory care, non-invasive audio, such as recordings captured via mobile microphones, enables scalable screening and longitudinal monitoring, but the heterogeneity challenge is particularly acute: recordings vary widely across devices, environments, and acquisition protocols, and questions span multiple intents and question formats. Existing biomedical audio-language QA systems are typically monolithic, without any specialization mechanisms for tackling diverse respiratory corpora and query intents. They are also only validated in limited settings, leaving it unclear how reliably they handle the shifts encountered in real-world settings. To address these limitations, we introduce RAMoEA-QA, a hierarchically routed generative model for respiratory audio question answering that unifies multiple question types and supports both discrete and continuous targets within a single multimodal system. RAMoEA-QA applies two-stage conditional specialization: an Audio Mixture-of-Experts routes each recording to a suitable pre-trained audio encoder, and a Language Mixture-of-Adapters selects a LoRA adapter on a shared frozen LLM to match the query intent and answer format. By specializing both acoustic representations and generation behaviour per example, RAMoEA-QA consistently outperforms strong baselines and routing ablations with minimal parameter overhead, improving in-domain test accuracy to 0.72 (vs. 0.61 and 0.67 for state-of-the-art baselines) and exhibiting the strongest generalization for diagnosis under domain, modality, and task shifts.
Primary: Tsinghua University
All Institutions: Tsinghua University, University of Calabria, University of Cambridge
The main contribution of this paper is the introduction of RAMoEA-QA, a hierarchically specialized model for respiratory audio question answering that effectively addresses the challenges posed by heterogeneous audio data and diverse query intents. This work represents a meaningful advancement in the integration of machine learning and healthcare, demonstrating both innovative methodology and impactful results.
The proposed RAMoEA-QA model introduces a novel hierarchical specialization approach that employs a two-stage conditional specialization mechanism, utilizing an Audio Mixture-of-Experts and a Language Mixture-of-Adapters. This design allows the model to effectively handle the diverse nature of respiratory audio data and various query intents, which is a significant advancement over existing monolithic biomedical audio-language QA systems. The use of pre-trained audio encoders and LoRA adapters on a frozen LLM demonstrates a thoughtful integration of state-of-the-art techniques while maintaining a low parameter overhead.
The paper presents a comprehensive experimental setup, comparing RAMoEA-QA against strong baselines and conducting ablation studies to validate the effectiveness of the routing mechanisms. The reported in-domain test accuracy of 0.72 significantly surpasses the state-of-the-art baselines (0.61 and 0.67), indicating robust performance. The experiments also address generalization across different domains, modalities, and tasks, which is critical for real-world applications in healthcare.
The authors provide a link to their code repository, which is essential for reproducibility. However, the paper could benefit from additional details regarding the implementation specifics, such as hyperparameter settings and training procedures, to facilitate easier replication of results by other researchers.
One limitation noted is the reliance on the RA-QA collection, which may not encompass the full diversity of respiratory audio data encountered in practice. Additionally, while the model shows strong performance in controlled settings, its robustness in highly variable real-world environments remains to be fully validated.
The RAMoEA-QA model has significant potential applications in healthcare, particularly in respiratory care, where it can enhance patient monitoring and screening through scalable audio analysis. Its ability to handle diverse audio inputs and question formats could lead to more effective and personalized patient interactions, ultimately improving healthcare outcomes. The main contribution of this paper is the introduction of RAMoEA-QA, a hierarchically specialized model for respiratory audio question answering that effectively addresses the challenges posed by heterogeneous audio data and diverse query intents. This work represents a meaningful advancement in the integration of machine learning and healthcare, demonstrating both innovative methodology and impactful results.
Modern ASR systems are typically trained on large-scale pseudo-labeled, in-the-wild data spanning multiple domains. While such heterogeneous data benefit generalist models designed for broad deployment, they pose challenges for specialist models targeting specific domains: specialist models lack the capacity to learn from all available data, and one must pay closer attention to addressing the mismatch between training and test conditions. In this work, we study targeted data selection as a strategy to address these challenges, selecting relevant subsets from 100k hours of in-the-wild training data to optimize performance on target domains. We represent speech samples using embeddings that capture complementary characteristic--speaker attributes, phonetic content, and semantic meaning--and analyze how relevance and diversity along these axes when performing data selection affect downstream ASR performance. Our experiments with CTC-based Conformer models show that training on a strategically selected 5% subset can exceed the performance of models trained on the full dataset by up to 36.8% relative WER reduction on target domains.
Primary: Carnegie Mellon University
All Institutions: Carnegie Mellon University
This paper contributes to the field by proposing a robust embedding-based data selection method for ASR systems that addresses domain mismatch challenges, demonstrating significant performance improvements across various datasets. The comprehensive methodology and experimental validation provide a strong foundation for future research in data selection and ASR model training.
The paper presents a novel approach to data selection for ASR systems using embedding-based methods that capture speaker, phonetic, and semantic characteristics. The use of Maximal Marginal Relevance (MMR) to balance relevance and diversity in data selection is a significant methodological advancement. The multi-embedding and multi-target strategies enhance the robustness of the approach, allowing for effective training on large-scale, heterogeneous datasets. The methodology is well-structured, with clear definitions and mathematical formulations that enhance clarity and reproducibility.
The experiments are comprehensive, utilizing multiple target datasets (LibriSpeech, CommonVoice, TED-LIUM) to validate the effectiveness of the proposed data selection methods. The results demonstrate substantial improvements in word error rate (WER) when using strategically selected subsets compared to random selections and the full dataset. The experiments are well-designed, with appropriate controls and comparisons that provide strong evidence for the claims made.
The paper provides sufficient details regarding the implementation, including model architectures, training procedures, and data selection algorithms. However, the lack of publicly available code or datasets limits reproducibility. The use of specific embeddings and the complexity of the MMR selection process may pose challenges for others attempting to replicate the results without access to the same resources.
The paper acknowledges the computational expense of the greedy MMR procedure and the potential for label noise in the pseudo-labeled Granary dataset. Additionally, the reliance on embedding-based selection may not generalize across all domains or datasets, and the performance may vary based on the characteristics of the target domain.
The findings have significant implications for the deployment of ASR systems in specialized domains, particularly in scenarios where labeled data is scarce. The ability to effectively select relevant training data can enhance the performance of models in real-world applications, making this research highly relevant to both academia and industry. The approach may also inspire further research into data selection strategies in other machine learning domains. This paper contributes to the field by proposing a robust embedding-based data selection method for ASR systems that addresses domain mismatch challenges, demonstrating significant performance improvements across various datasets. The comprehensive methodology and experimental validation provide a strong foundation for future research in data selection and ASR model training.
Self-supervised learning (SSL) underpins modern audio deepfake detection, yet most prior work centers on a single large wav2vec2-XLSR backbone, leaving compact under studied. We present RAPTOR, Representation Aware Pairwise-gated Transformer for Out-of-domain Recognition a controlled study of compact SSL backbones from the HuBERT and WavLM within a unified pairwise-gated fusion detector, evaluated across 14 cross-domain benchmarks. We show that multilingual HuBERT pre-training is the primary driver of cross-domain robustness, enabling 100M models to match larger and commercial systems. Beyond EER, we introduce a test-time augmentation protocol with perturbation-based aleatoric uncertainty to expose calibration differences invisible to standard metrics: WavLM variants exhibit overconfident miscalibration under perturbation, whereas iterative mHuBERT remains stable. These findings indicate that SSL pre-training trajectory, not model scale, drives reliable audio deepfake detection.
Primary: Idiap Research Institute
All Institutions: Idiap Research Institute, Tallinn University of Technology
The main contribution of this paper is the demonstration that compact SSL backbones can achieve competitive performance in audio deepfake detection through careful pre-training strategies, while also introducing a novel method for assessing model calibration under distributional shifts. This work significantly advances the understanding of how SSL pre-training affects model robustness and reliability in practical applications.
The paper introduces RAPTOR, a pairwise-gated hierarchical layer-fusion architecture, to evaluate the performance of compact self-supervised learning (SSL) backbones for audio deepfake detection. The methodology is robust, employing a controlled experimental setup where only the SSL encoder is varied while keeping the downstream detection framework constant. This approach allows for a clear analysis of the impact of different pre-training strategies on model performance. The introduction of test-time augmentation (TTA) for uncertainty estimation is particularly noteworthy, as it provides a novel way to assess model calibration beyond traditional metrics.
The authors conduct extensive experiments across 14 cross-domain benchmarks, which is a significant contribution to the field as it highlights the robustness of the proposed models under varying conditions. The results demonstrate that the compact models can achieve competitive performance compared to larger models, which is an important finding for practical applications. The use of multiple evaluation metrics, including EER and pooled EER, adds depth to the analysis and provides a comprehensive view of model performance.
The paper provides sufficient implementation details, including training protocols, datasets, and evaluation metrics, which enhances reproducibility. However, the absence of a publicly accessible code repository limits the ease with which other researchers can replicate the findings.
One limitation of the study is the reliance on specific datasets for training and evaluation, which may not fully capture the diversity of real-world audio deepfake scenarios. Additionally, the paper acknowledges the need for further investigation into the sensitivity-diversity trade-off observed in the final mHuBERT checkpoint.
The findings of this research have significant implications for the field of audio deepfake detection, particularly in enhancing the reliability of detection systems in real-world applications. The emphasis on model calibration and the effectiveness of compact models could lead to more accessible and efficient solutions for combating audio deepfakes. The main contribution of this paper is the demonstration that compact SSL backbones can achieve competitive performance in audio deepfake detection through careful pre-training strategies, while also introducing a novel method for assessing model calibration under distributional shifts. This work significantly advances the understanding of how SSL pre-training affects model robustness and reliability in practical applications.
Recent advances in speech synthesis and voice conversion have greatly improved the naturalness and authenticity of generated audio. Meanwhile, evolving encoding, compression, and transmission mechanisms on social media platforms further obscure deepfake artifacts. These factors complicate reliable detection in real-world environments, underscoring the need for representative evaluation benchmarks. To this end, we introduce ML-ITW (Multilingual In-The-Wild), a multilingual dataset covering 14 languages, seven major platforms, and 180 public figures, totaling 28.39 hours of audio. We evaluate three detection paradigms: end-to-end neural models, self-supervised feature-based (SSL) methods, and audio large language models (Audio LLMs). Experimental results reveal significant performance degradation across diverse languages and real-world acoustic conditions, highlighting the limited generalization ability of existing detectors in practical scenarios. The ML-ITW dataset is publicly available.
Primary: Wuhan University
All Institutions: Wuhan University
The main contribution of this work is the introduction of the ML-ITW dataset, which provides a realistic benchmark for evaluating speech deepfake detection systems across multiple languages and platforms. This comprehensive analysis of the technical contribution, methodology, and significance to the field underscores the pressing need for improved detection mechanisms in the face of rapidly advancing speech synthesis technologies.
The paper introduces a novel dataset, ML-ITW, which is a significant advancement in the field of speech deepfake detection. The methodology for dataset construction is robust, utilizing a diverse range of social media platforms and languages, which enhances the realism of the evaluation scenarios. The evaluation of various detection paradigms, including end-to-end models, self-supervised methods, and audio large language models, is comprehensive and well-structured. The approach to validating spoofed samples is methodical, ensuring high-quality data for training and evaluation.
The experiments are thorough, comparing multiple models across different datasets, including ASVspoof2019-LA, ITW, and ML-ITW. The results clearly demonstrate the performance degradation of existing models when faced with real-world conditions, highlighting the limitations of current benchmarks. The use of standard metrics (EER, AUC, ACC, F1) adds rigor to the evaluation, although the paper could benefit from more detailed statistical analysis of the results to strengthen claims about generalization gaps.
The paper provides sufficient details regarding the dataset construction, model training, and evaluation protocols, which would allow other researchers to replicate the experiments. However, the absence of a direct implementation link or code repository limits the ease of reproducibility.
One notable limitation is the relatively small sample size for some low-resource languages, which may affect the reliability of language-specific analyses. Additionally, while the dataset is comprehensive, the evolving nature of speech synthesis technologies means that the dataset may quickly become outdated, necessitating continuous updates.
The findings of this research have significant implications for the development of robust deepfake detection systems. By highlighting the importance of realistic evaluation benchmarks, the study encourages future research to focus on generalization across diverse conditions, ultimately contributing to the enhancement of security measures against identity impersonation and misinformation. The main contribution of this work is the introduction of the ML-ITW dataset, which provides a realistic benchmark for evaluating speech deepfake detection systems across multiple languages and platforms. This comprehensive analysis of the technical contribution, methodology, and significance to the field underscores the pressing need for improved detection mechanisms in the face of rapidly advancing speech synthesis technologies.
Neural audio codecs optimized for mel-spectrogram reconstruction often fail to preserve intelligibility. While semantic encoder distillation improves encoded representations, it does not guarantee content preservation in reconstructed speech. In this work, we demonstrate that self-supervised representation reconstruction (SSRR) loss fundamentally improves codec training and performance. First, SSRR significantly accelerates convergence, enabling competitive results using only a single GPU. Second, it enhances intelligibility by reconstructing distilled self-supervised representations from codec outputs. Third, SSRR enables high intelligibility without additional lookahead in streaming Transformer-based codecs, allowing a zero-lookahead architecture for real-time deployment. As a result, our JHCodec achieves state-of-the-art performance while maintaining minimal latency and reduced training cost. We open-source the full implementation, training pipeline, and demo on Github https://github.com/jhcodec843/jhcodec.
Primary: University of Southern California
All Institutions: University of Southern California, Johns Hopkins University
The main contribution of this paper is the introduction of a self-supervised representation reconstruction loss that significantly enhances the performance of neural audio codecs in terms of intelligibility and latency. This work represents a meaningful advancement in the field of audio processing, providing a practical solution for real-time applications while also contributing to the theoretical understanding of codec training methodologies.
The paper introduces a novel self-supervised representation reconstruction (SSRR) loss that improves the training of neural audio codecs. The methodology is well-articulated, detailing how SSRR enhances convergence speed and intelligibility without requiring additional lookahead in streaming architectures. The approach is innovative in its focus on reconstructing self-supervised representations, which is a departure from traditional methods that prioritize mel-spectrogram reconstruction. The use of a single GPU for competitive results is a significant practical consideration that enhances the method's appeal for real-world applications.
The experiments conducted are robust, demonstrating the effectiveness of SSRR through comparative analysis with existing methods. The results indicate that the proposed JHCodec achieves state-of-the-art performance, particularly in terms of intelligibility and latency, which are critical metrics in audio codec applications. However, specific details regarding the datasets used and the metrics for evaluation could be elaborated further to strengthen the experimental validation.
The authors have taken steps to ensure reproducibility by open-sourcing the full implementation and training pipeline, which is commendable. The availability of a demo on GitHub allows for practical testing of the proposed system, although the paper could benefit from a more detailed description of the training process and hyperparameters used.
One limitation noted is the reliance on self-supervised representations, which may not generalize well across all types of audio content. Additionally, while the zero-lookahead architecture is advantageous for real-time applications, it may impose constraints on the complexity of the audio being processed. The paper could also discuss potential trade-offs between intelligibility and other audio quality metrics, such as fidelity.
The implications of this work are significant for applications in real-time audio processing, such as telecommunication and streaming services. By achieving high intelligibility with low latency, the proposed codec could enhance user experiences in various audio-related fields. Furthermore, the open-source nature of the project encourages further research and development in neural audio codecs, potentially leading to broader advancements in the field. The main contribution of this paper is the introduction of a self-supervised representation reconstruction loss that significantly enhances the performance of neural audio codecs in terms of intelligibility and latency. This work represents a meaningful advancement in the field of audio processing, providing a practical solution for real-time applications while also contributing to the theoretical understanding of codec training methodologies.
We address the challenge of preserving emotional content in streaming speaker anonymization (SA). Neural audio codec language models trained for audio continuation tend to degrade source emotion: content tokens discard emotional information, and the model defaults to dominant acoustic patterns rather than preserving paralinguistic attributes. We propose supervised finetuning with neutral-emotion utterance pairs from the same speaker, combined with frame-level emotion distillation on acoustic token hidden states. All modifications are confined to finetuning, which takes less than 2 hours on 4 GPUs and adds zero inference latency overhead, while maintaining a competitive 180ms streaming latency. On the VoicePrivacy 2024 protocol, our approach achieves a 49.2% UAR (emotion preservation) with 5.77% WER (intelligibility), a +24% relative UAR improvement over the baseline (39.7%->49.2%) and +10% over the emotion-prompt variant (44.6% UAR), while maintaining strong privacy (EER 49.0%). Demo and code are available: https://anonymous3842031239.github.io/
Primary: Institute for Infocomm Research (I2R)
All Institutions: Institute for Infocomm Research (I2R), Nanyang Technological University, The Hong Kong Polytechnic University
The main contribution of this paper is the introduction of a novel supervised finetuning approach combined with frame-level emotion distillation for emotion-preserving streaming speaker anonymization, which significantly improves emotion retention while maintaining privacy and intelligibility. The technical contributions and rigorous methodology present a meaningful advancement in the field of audio processing and speaker anonymization.
The methodology presented in this paper is innovative, focusing on supervised finetuning with neutral-emotion utterance pairs and frame-level emotion distillation. This dual approach effectively addresses the limitations of existing neural audio codec models in preserving emotional content during speaker anonymization. The use of neutral-emotion pairs ensures that the model learns to generate emotional outputs without relying on emotional prompts, which can be difficult to obtain. The design choice to apply emotion distillation to the acoustic branch rather than the semantic branch is a significant improvement that allows for cleaner gradient flow and better emotion preservation.
The experiments are well-structured, adhering to the VoicePrivacy 2024 protocol, which allows for direct comparison with prior works. The results show a substantial improvement in emotion preservation (UAR) and privacy (EER) while maintaining competitive intelligibility (WER). The ablation studies provide clear evidence of the contributions of each component of the proposed method, reinforcing the claims made in the paper. The dataset used for training and evaluation is appropriate, although the reliance on acted speech corpora may limit generalizability.
The paper provides sufficient details regarding the implementation, including the training setup, data preprocessing, and evaluation metrics. However, the absence of a public code repository limits reproducibility. The authors mention that the demo is available, which is a positive aspect, but a comprehensive project URL would enhance reproducibility further.
The paper acknowledges several limitations, including the reliance on a single SER evaluator, the lack of subjective listening tests, and the evaluation being restricted to acted speech corpora. These factors could affect the generalizability and real-world applicability of the findings. Additionally, the gap in performance compared to offline methods suggests that further improvements are needed for practical deployment.
The proposed method has significant implications for privacy-preserving applications in various domains, including teleconferencing, call centers, and online mental health counseling. By effectively anonymizing speaker identity while preserving emotional content, this research addresses a critical need for maintaining communication effectiveness in sensitive contexts. The approach could pave the way for more sophisticated anonymization techniques that balance privacy and emotional expressiveness. The main contribution of this paper is the introduction of a novel supervised finetuning approach combined with frame-level emotion distillation for emotion-preserving streaming speaker anonymization, which significantly improves emotion retention while maintaining privacy and intelligibility. The technical contributions and rigorous methodology present a meaningful advancement in the field of audio processing and speaker anonymization.
Streaming TTS that receives streaming text is essential for interactive systems, yet this scheme faces two major challenges: unnatural prosody due to missing lookahead and long-form collapse due to unbounded context. We propose a prosodic-boundary-aware post-training strategy, adapting a pretrained LLM-based TTS model using weakly time-aligned data. Specifically, the model is adapted to learn early stopping at specified content boundaries when provided with limited future text. During inference, a sliding-window prompt carries forward previous text and speech tokens, ensuring bounded context and seamless concatenation. Evaluations show our method outperforms CosyVoice-Style interleaved baseline in both short and long-form scenarios. In long-text synthesis, especially, it achieves a 66.2% absolute reduction in word error rate (from 71.0% to 4.8%) and increases speaker and emotion similarity by 16.1% and 1.5% relatively, offering a robust solution for streaming TTS with incremental text.
Primary: Southeast University
All Institutions: Southeast University, Nanyang Technological University, Tianjin University
This paper presents a boundary-aware post-training strategy for streaming LLM-based text-to-speech with streaming text input. The proposed methodology effectively addresses the challenges of prosody and long-form stability in TTS systems, making a meaningful contribution to the field of audio machine learning.
The proposed methodology introduces a novel prosodic-boundary-aware adaptation strategy that leverages weakly time-aligned data to enhance streaming TTS systems. The bifurcated sequence input with a prosodic-boundary marker allows for improved prosody while maintaining contextual integrity. The sliding-window prompt mechanism effectively manages the context length, preventing unbounded growth and ensuring seamless audio generation. The approach avoids complex architectural modifications, which is a significant advantage. However, the reliance on weakly aligned data raises questions about the generalizability of the method across different datasets and languages.
The experiments are well-structured, utilizing both objective and subjective metrics to evaluate performance. The use of the Seed-TTS-Eval benchmark for both standard and long-form evaluations provides a comprehensive assessment of the proposed method's effectiveness. The significant improvements in WER, speaker similarity, and emotional consistency demonstrate the robustness of the approach. However, the paper could benefit from a more extensive comparison with additional state-of-the-art methods to further validate its superiority.
The paper provides sufficient details on the experimental setup, including dataset descriptions, evaluation metrics, and baseline comparisons. However, the lack of a publicly available code repository limits reproducibility. Future work should consider sharing the implementation to facilitate validation by the research community.
One limitation is the dependency on weakly time-aligned data, which may not be available for all languages or datasets. Additionally, while the results are promising, the method's performance in highly variable or noisy environments has not been tested. The paper also does not address the potential computational costs associated with the proposed adaptations.
The advancements in streaming TTS systems have significant implications for interactive applications such as virtual assistants, real-time translation, and accessibility tools. The ability to generate natural-sounding speech with minimal latency can enhance user experience and broaden the applicability of TTS technologies in various domains. This paper presents a boundary-aware post-training strategy for streaming LLM-based text-to-speech with streaming text input. The proposed methodology effectively addresses the challenges of prosody and long-form stability in TTS systems, making a meaningful contribution to the field of audio machine learning.
Long-form speech recognition with large encoder-decoder models such as Whisper often exhibit hallucinations, repetition loops, and content omissions. These errors can accumulate and be further amplified when the previous segment's transcription is used as decoding context. We propose Whisper-CD, a training-free contrastive decoding framework that contrasts clean-audio logits against negative logits computed from three acoustically motivated perturbations: Gaussian noise injection, silence signal, and audio temporal shift. We aggregate these negatives via the log-sum-exp operator, building a unified multi-negative objective for token-by-token decoding. Across five English long-form benchmarks, Whisper-CD reduces WER by up to 24.3pp on CORAAL and shows 48% faster token generation throughput than beam search. Because Whisper-CD operates purely at inference time, it can be applied as a drop-in replacement to already-deployed Whisper systems without retraining.
Primary: Sungkyunkwan University
All Institutions: Sungkyunkwan University
The main contribution of this paper is the introduction of Whisper-CD, a contrastive decoding framework that significantly improves long-form speech recognition accuracy without the need for retraining. This work is a meaningful advancement in the field, addressing critical issues in existing models and offering a practical solution that can be readily adopted in deployed systems.
The proposed Whisper-CD framework introduces a novel approach to long-form speech recognition by employing a training-free contrastive decoding method. This method contrasts clean-audio logits with negative logits derived from various perturbations, which is innovative in the context of improving the robustness of speech recognition systems. The use of the log-sum-exp operator to aggregate negative samples is a thoughtful choice that enhances the decoding process. However, the paper could benefit from a more detailed explanation of the perturbations chosen and their specific impact on the model's performance.
The paper presents a comprehensive evaluation across five English long-form benchmarks, demonstrating a significant reduction in word error rate (WER) and improved token generation throughput. The results are compelling, particularly the 24.3 percentage point reduction in WER on the CORAAL dataset and the 48% increase in throughput compared to traditional beam search methods. However, additional details on the datasets used, including their characteristics and the specific metrics employed, would strengthen the experimental section.
The paper lacks sufficient implementation details, such as hyperparameters, specific configurations of the Whisper model, and the exact nature of the perturbations applied. Without these details, it may be challenging for other researchers to reproduce the results. Including a link to a code repository or supplementary material would greatly enhance reproducibility.
One limitation of the Whisper-CD approach is that it operates purely at inference time, which, while advantageous for deployment, may limit its adaptability to different audio conditions or languages without retraining. Additionally, the reliance on specific perturbations may not generalize across all types of audio inputs, potentially affecting performance in diverse real-world scenarios.
The proposed method has significant implications for the field of speech recognition, particularly in applications requiring high accuracy over long-form audio, such as transcription services, media content creation, and accessibility technologies. By improving the reliability of long-form speech recognition systems, Whisper-CD can enhance user experience and broaden the adoption of such technologies. The main contribution of this paper is the introduction of Whisper-CD, a contrastive decoding framework that significantly improves long-form speech recognition accuracy without the need for retraining. This work is a meaningful advancement in the field, addressing critical issues in existing models and offering a practical solution that can be readily adopted in deployed systems.
Accent variability remains a major errors in automatic speech recognition, yet most adaptation methods rely on parameter fine-tuning without understanding where accent information is encoded. We treat accent variation as an interpretable subspace in hidden representations and investigate whether it can be identified and controlled directly in activation space. We extract layer-wise encoder activations and estimate mean-shift directions capturing accent-induced representation shifts. By injecting these directions into individual layers and measuring how they align accented and standard embeddings, we derive a layer-wise accent sensitivity profile, revealing that accent information concentrates in a narrow band of middle encoder layers. Leveraging this structure, we further introduce parameter-free accent steering that modifies representations during inference without updating model weights. Experiments across eight accents show consistent word error rate reductions.
Primary: The University of Melbourne
All Institutions: Wuhan University, The University of Melbourne
The main contribution of this paper is the introduction of a novel method for accent adaptation in speech recognition models that operates in activation space, providing a deeper understanding of accent representation in neural networks. The approach is innovative and has the potential to significantly improve the performance of speech recognition systems across diverse accents, marking a meaningful advancement in the field.
The methodology presented in this paper is innovative as it shifts the focus from traditional parameter fine-tuning to a more interpretable approach that directly manipulates activation space for accent adaptation. The authors successfully extract layer-wise encoder activations and compute mean-shift directions to capture accent-induced shifts, which is a novel contribution to the understanding of how accents are encoded in neural networks. The introduction of parameter-free accent steering is particularly noteworthy, as it allows for real-time adjustments during inference without the need for retraining, which could have significant practical implications.
The experiments conducted across eight different accents provide a robust evaluation of the proposed method. The consistent reductions in word error rates across these accents demonstrate the effectiveness of the approach. However, the paper could benefit from a more detailed description of the datasets used, including their size, diversity, and how they were selected. Additionally, comparisons with existing state-of-the-art methods would strengthen the validation of the proposed technique.
The paper lacks sufficient implementation details that would allow for easy reproduction of the results. While the methodology is described, specific hyperparameters, training procedures, and the architecture of the models used in experiments are not adequately detailed. Providing a link to a code repository or supplementary materials would enhance reproducibility.
One limitation of the study is the potential overfitting to the specific accents tested, as the results may not generalize to other accents or dialects not included in the experiments. Additionally, the approach relies on the assumption that accent information is concentrated in specific layers, which may not hold true for all architectures or tasks. The paper also does not address the computational efficiency of the proposed method during inference.
The findings of this research could have significant implications for the development of more inclusive and accurate speech recognition systems, particularly in multilingual and multicultural contexts. By improving accent adaptation, the proposed method could enhance user experience and accessibility in various applications, from virtual assistants to automated transcription services. The main contribution of this paper is the introduction of a novel method for accent adaptation in speech recognition models that operates in activation space, providing a deeper understanding of accent representation in neural networks. The approach is innovative and has the potential to significantly improve the performance of speech recognition systems across diverse accents, marking a meaningful advancement in the field.
Speech production is a complex process spanning neural planning, motor control, muscle activation, and articulatory kinematics. While the acoustic speech signal is the most accessible product of the speech production act, it does not directly reveal its causal neurophysiological substrates. We present the first simultaneous acquisition of real-time (dynamic) MRI, EEG, and surface EMG, capturing several key aspects of the speech production chain: brain signals, muscle activations, and articulatory movements. This multimodal acquisition paradigm presents substantial technical challenges, including MRI-induced electromagnetic interference and myogenic artifacts. To mitigate these, we introduce an artifact suppression pipeline tailored to this tri-modal setting. Once fully developed, this framework is poised to offer an unprecedented window into speech neuroscience and insights leading to brain-computer interface advances.
Primary: University of Southern California
All Institutions: University of Southern California
This paper presents a pioneering approach to simultaneously capture real-time MRI, EEG, and surface EMG data during speech production, offering valuable insights into the neurophysiological processes underlying speech. The innovative artifact suppression techniques and the potential applications in BCI and speech science highlight its significance in advancing the field.
The methodology presented in this paper is innovative, combining real-time MRI, EEG, and surface EMG to capture the complex dynamics of speech production. The authors have developed a multi-stage denoising pipeline to address significant technical challenges, including electromagnetic interference and myogenic artifacts. The use of canonical correlation analysis (CCA) for artifact removal is particularly noteworthy, as it allows for the effective suppression of non-neural signals while preserving the underlying neural activity. However, the methodology could benefit from further validation across a larger cohort to establish its robustness and generalizability.
The experimental design is well-structured, focusing on a single subject to explore the feasibility of simultaneous data acquisition. The tasks are clearly defined, and the results demonstrate the effectiveness of the artifact removal techniques. However, the reliance on a single participant limits the generalizability of findings. The authors provide thorough comparisons of EEG signals before and after denoising, showcasing significant improvements in signal quality, which is a strong point of the experimental evaluation.
The paper provides detailed descriptions of the experimental setup, data acquisition methods, and artifact correction techniques, which are essential for reproducibility. However, the lack of a publicly available dataset or code repository hinders full reproducibility of the results. Future work should include sharing data and methodologies to allow other researchers to validate and build upon these findings.
The primary limitations include the small sample size (single-subject study), which restricts the ability to generalize findings. Additionally, the use of passive electrodes may introduce higher noise levels compared to active electrodes, potentially affecting data quality. The EEG cap's design may not be optimal for capturing speech-specific brain activity, and residual artifacts from the EMG setup could still influence results. Lastly, the potential impact of scanner noise and visual stimuli on neural activity remains a concern.
This research has significant implications for both speech neuroscience and brain-computer interface (BCI) technologies. By providing a comprehensive view of the neural, muscular, and articulatory components of speech production, the findings could lead to advancements in silent speech interfaces and improved understanding of speech disorders. The methodology could pave the way for future studies exploring the intricacies of speech planning and execution, potentially transforming approaches to speech rehabilitation and communication technologies. This paper presents a pioneering approach to simultaneously capture real-time MRI, EEG, and surface EMG data during speech production, offering valuable insights into the neurophysiological processes underlying speech. The innovative artifact suppression techniques and the potential applications in BCI and speech science highlight its significance in advancing the field.
Studying early speech development at scale requires automatic tools, yet automatic phoneme recognition, especially for young children, remains largely unsolved. Building on decades of data collection, we curate TinyVox, a corpus of more than half a million phonetically transcribed child vocalizations in English, French, Portuguese, German, and Spanish. We use TinyVox to train BabAR, a cross-linguistic phoneme recognition system for child speech. We find that pretraining the system on multilingual child-centered daylong recordings substantially outperforms alternatives, and that providing 20 seconds of surrounding audio context during fine-tuning further improves performance. Error analyses show that substitutions predominantly fall within the same broad phonetic categories, suggesting suitability for coarse-grained developmental analyses. We validate BabAR by showing that its automatic measures of speech maturity align with developmental estimates from the literature.
Primary: Harvard University
All Institutions: Harvard University, Massachusetts Institute of Technology
The paper presents BabAR, a pioneering phoneme recognition system for child speech, demonstrating significant advancements in automatic phonetic analysis through innovative methodology and extensive experimental validation.
The paper introduces a novel phoneme recognition system, BabAR, tailored for child speech, leveraging a large-scale dataset, TinyVox, which encompasses diverse languages and extensive child vocalizations. The methodology includes pretraining on multilingual data and context-aware fine-tuning, which are innovative approaches in the domain of child speech recognition. The use of Connectionist Temporal Classification (CTC) for sequence-to-sequence tasks is appropriate given the challenges of variable-length outputs in phoneme recognition. The systematic evaluation of different self-supervised models and the exploration of context duration for improving recognition accuracy are well-structured and contribute significantly to the methodology.
The experiments are robust, comparing BabAR against state-of-the-art phoneme recognition systems and demonstrating significant performance improvements. The paper provides detailed error analysis, illustrating that BabAR's substitutions tend to remain within phonetic categories, which is crucial for developmental analyses. The validation of BabAR's performance against a longitudinal dataset supports its practical applicability in developmental research. However, the paper could benefit from more extensive comparisons with existing systems and additional metrics beyond phoneme error rates to fully capture the model's effectiveness.
The authors provide sufficient implementation details, including model architecture, training procedures, and evaluation metrics, which enhance reproducibility. The availability of the dataset and code on GitHub is a significant step towards ensuring that other researchers can replicate the study and build upon it. However, the paper could improve by including more explicit instructions for setting up the environment and dependencies.
The study acknowledges challenges in phonetic transcription, particularly the subjective nature of human annotation and the presence of competing signals in naturalistic recordings. While BabAR shows promise, the reliance on coarse-grained measures for validation may not guarantee accuracy at the individual level, which is critical for clinical applications. Additionally, the dataset's diversity in terms of language and age could introduce variability that may affect generalization.
The development of BabAR and TinyVox has the potential to revolutionize the study of early speech development by enabling large-scale, automated phonetic analysis. This could facilitate early detection of speech and language delays, enhance cross-linguistic studies, and improve educational tools for language learning. The integration of advanced machine learning techniques with developmental science opens up new avenues for research and practical applications in child language acquisition. The paper presents BabAR, a pioneering phoneme recognition system for child speech, demonstrating significant advancements in automatic phonetic analysis through innovative methodology and extensive experimental validation.
We present a technical tutorial for building enterprise-grade realtime voice agents from first principles. While over 25 open-source speech-to-speech models and numerous voice agent frameworks exist, no single resource explains the complete pipeline from individual components to a working streaming voice agent with function calling capabilities. Through systematic investigation, we find that (1) native speech-to-speech models like Qwen2.5-Omni, while capable of high-quality audio generation, are too slow for realtime interaction ($\sim$13s time-to-first-audio); (2) the industry-standard approach uses a cascaded streaming pipeline: STT $\rightarrow$ LLM $\rightarrow$ TTS, where each component streams its output to the next; and (3) the key to ``realtime'' is not any single fast model but rather \textit{streaming and pipelining} across components. We build a complete voice agent using Deepgram (streaming STT), vLLM-served LLMs with function calling (streaming text generation), and ElevenLabs (streaming TTS), achieving a measured P50 time-to-first-audio of 947ms (best case 729ms) with cloud LLM APIs, and comparable latency with self-hosted vLLM on NVIDIA A10G GPU. We release the full codebase as a tutorial with working, tested code for every component.
Primary: Salesforce AI Research
All Institutions: Salesforce AI Research
The paper provides a comprehensive tutorial for building enterprise-grade realtime voice agents from scratch, emphasizing the importance of streaming and pipelining in achieving low latency. The technical contributions and methodology are significant, offering valuable insights and practical tools for researchers and practitioners in the field of audio machine learning.
The paper presents a systematic approach to building enterprise-grade realtime voice agents by dissecting the components of speech-to-text (STT), language model (LLM), and text-to-speech (TTS) into a cascaded streaming pipeline. The authors emphasize the importance of streaming and pipelining rather than relying on a single fast model, which is a critical insight for achieving low latency in voice interactions. The tutorial format is effective, providing a step-by-step guide that includes empirical evaluations of various models, thus making the methodology accessible and practical for developers.
The experiments conducted are robust, comparing the performance of native speech-to-speech models against a cascaded pipeline approach. The authors provide detailed latency measurements for each component, demonstrating the effectiveness of their proposed architecture in achieving sub-1-second time-to-first-audio (TTFA). The empirical results are well-documented, showcasing the advantages of their approach in real-world scenarios, which adds credibility to their findings.
The paper includes a comprehensive codebase released as open-source, which is a significant advantage for reproducibility. The detailed tutorial format, along with the release of tested code for each component, allows other researchers and practitioners to replicate the results and build upon the work. However, the paper could benefit from clearer documentation on the specific environments and dependencies required to run the code effectively.
One limitation noted is the reliance on cloud APIs for some components, which may introduce variability in performance due to network latency. Additionally, the findings are based on specific models and configurations, which may not generalize across all potential implementations. The authors also acknowledge that native speech-to-speech models are not yet viable for real-time applications, which highlights the current constraints in the field.
This work has significant implications for the development of voice-based AI agents in enterprise settings, particularly in applications such as customer service, healthcare, and task management. By providing a clear framework and practical guidance, the paper can facilitate the adoption of real-time voice agents, potentially transforming user interactions across various industries. The paper provides a comprehensive tutorial for building enterprise-grade realtime voice agents from scratch, emphasizing the importance of streaming and pipelining in achieving low latency. The technical contributions and methodology are significant, offering valuable insights and practical tools for researchers and practitioners in the field of audio machine learning.
While existing audio watermarking techniques have achieved strong robustness against traditional digital signal processing (DSP) attacks, they remain vulnerable to neural resynthesis. This occurs because modern neural audio codecs act as semantic filters and discard the imperceptible waveform variations used in prior watermarking methods. To address this limitation, we propose Latent-Mark, the first zero-bit audio watermarking framework designed to survive semantic compression. Our key insight is that robustness to the encode-decode process requires embedding the watermark within the codec's invariant latent space. We achieve this by optimizing the audio waveform to induce a detectable directional shift in its encoded latent representation, while constraining perturbations to align with the natural audio manifold to ensure imperceptibility. To prevent overfitting to a single codec's quantization rules, we introduce Cross-Codec Optimization, jointly optimizing the waveform across multiple surrogate codecs to target shared latent invariants. Extensive evaluations demonstrate robust zero-shot transferability to unseen neural codecs, achieving state-of-the-art resilience against traditional DSP attacks while preserving perceptual imperceptibility. Our work inspires future research into universal watermarking frameworks capable of maintaining integrity across increasingly complex and diverse generative distortions.
Primary: National Taiwan University
All Institutions: National Taiwan University, CyCraft AI Lab, MoonShine Animation Studio, RIKEN Center for Computational Science
The main contribution of this paper is the introduction of Latent-Mark, a novel zero-bit audio watermarking framework that effectively survives neural resynthesis by embedding watermarks within the latent space of audio codecs. This work represents a meaningful advancement in audio watermarking, addressing vulnerabilities posed by modern neural codecs and providing a foundation for future research in universal watermarking techniques.
The methodology presented in Latent-Mark is innovative, leveraging the concept of embedding watermarks within the invariant latent space of neural audio codecs. The approach of optimizing the audio waveform to induce a detectable shift while ensuring imperceptibility is a significant advancement over traditional methods. The introduction of Cross-Codec Optimization is particularly noteworthy, as it addresses the challenge of overfitting to specific codec characteristics, enhancing the generalizability of the watermarking technique across different audio codecs.
The paper provides extensive evaluations demonstrating the robustness of the proposed method against both traditional DSP attacks and neural resynthesis. The experiments are well-structured, showcasing the performance of Latent-Mark in various scenarios, including zero-shot transferability to unseen codecs. The results indicate a strong resilience to attacks while maintaining perceptual quality, which is crucial for practical applications.
The paper lacks detailed implementation specifics, such as code availability or datasets used for training and evaluation, which could hinder reproducibility. Providing a GitHub repository or links to datasets would significantly enhance the reproducibility of the results.
One limitation of the study is the potential dependency on the specific codecs chosen for Cross-Codec Optimization. While the method shows promise, its performance on a broader range of codecs, especially those not included in the training phase, remains to be fully explored. Additionally, the paper does not address the computational complexity of the optimization process, which could impact real-time applications.
The implications of this research are significant, as it opens avenues for secure audio transmission and copyright protection in an era where neural codecs are becoming prevalent. The ability to maintain watermark integrity against advanced generative models could have far-reaching applications in media, entertainment, and digital rights management. The main contribution of this paper is the introduction of Latent-Mark, a novel zero-bit audio watermarking framework that effectively survives neural resynthesis by embedding watermarks within the latent space of audio codecs. This work represents a meaningful advancement in audio watermarking, addressing vulnerabilities posed by modern neural codecs and providing a foundation for future research in universal watermarking techniques.
Recent progress in audio generation has made it increasingly easy to create highly realistic environmental soundscapes, which can be misused to produce deceptive content, such as fake alarms, gunshots, and crowd sounds, raising concerns for public safety and trust. While deepfake detection for speech and singing voice has been extensively studied, environmental sound deepfake detection (ESDD) remains underexplored. To advance ESDD, the first edition of the ESDD challenge was launched, attracting 97 registered teams and receiving 1,748 valid submissions. This paper presents the task formulation, dataset construction, evaluation protocols, baseline systems, and key insights from the challenge results. Furthermore, we analyze common architectural choices and training strategies among top-performing systems. Finally, we discuss potential future research directions for ESDD, outlining key opportunities and open problems to guide subsequent studies in this field.
Primary: University of Melbourne
All Institutions: University of Melbourne, Republic of Korea, School of Electrical Engineering
The paper presents the first ESDD challenge, providing a foundational framework for advancing the detection of environmental sound deepfakes. Its comprehensive methodology, extensive experimental results, and insights into future research directions mark a significant contribution to the field of audio deepfake detection.
The paper introduces a structured approach to environmental sound deepfake detection (ESDD) through the formulation of a challenge that includes a well-defined dataset (EnvSDD) and evaluation protocols. The methodology is robust, focusing on two distinct tracks that assess generalization across unseen generators and black-box scenarios, which are critical for real-world applications. The use of diverse audio generation models and the emphasis on cross-generator generalization are notable strengths. However, the paper could benefit from a more detailed explanation of the architectural choices made by the top-performing systems.
The experimental evaluation is comprehensive, with a large number of submissions (1,748) from 97 teams, indicating significant interest and engagement in the challenge. The results are systematically presented, showcasing the performance of baseline systems and top submissions across different tracks. The use of the Equal Error Rate (EER) as a metric is appropriate for the task, and the analysis of system design trends provides valuable insights into effective strategies for ESDD.
While the paper mentions the availability of the EnvSDD dataset and the challenge results, it lacks detailed implementation specifics that would facilitate reproducibility. The inclusion of code repositories or links to the actual implementations of the top-performing systems would enhance reproducibility and allow other researchers to build upon this work.
One limitation is the potential overfitting of models to specific generators, as indicated by performance degradation on unseen generators. Additionally, the challenge does not address the potential for adversarial attacks on detection systems, which could be a significant concern in practical applications. The reliance on a specific evaluation metric (EER) may also limit the understanding of model performance across different contexts.
The implications of this work are significant, as it addresses a growing concern in the realm of audio deepfakes, which can have serious consequences for public safety and misinformation. The establishment of a benchmark for ESDD could catalyze further research and development in this area, leading to more robust detection systems that can be applied in various real-world scenarios, including security and media verification. The paper presents the first ESDD challenge, providing a foundational framework for advancing the detection of environmental sound deepfakes. Its comprehensive methodology, extensive experimental results, and insights into future research directions mark a significant contribution to the field of audio deepfake detection.
Recent progress in audio generation has made it increasingly easy to create highly realistic environmental soundscapes, which can be misused to produce deceptive content, such as fake alarms, gunshots, and crowd sounds, raising concerns for public safety and trust. While deepfake detection for speech and singing voice has been extensively studied, environmental sound deepfake detection (ESDD) remains underexplored. To advance ESDD, the first edition of the ESDD challenge was launched, attracting 97 registered teams and receiving 1,748 valid submissions. This paper presents the task formulation, dataset construction, evaluation protocols, baseline systems, and key insights from the challenge results. Furthermore, we analyze common architectural choices and training strategies among top-performing systems. Finally, we discuss potential future research directions for ESDD, outlining key opportunities and open problems to guide subsequent studies in this field.
Primary: University of Melbourne
All Institutions: University of Melbourne, Republic of Korea, School of Electrical Engineering
The paper presents the first comprehensive challenge for environmental sound deepfake detection, establishing a significant benchmark in the field. The methodology and results contribute to advancing the understanding of audio deepfake detection, highlighting both the challenges and opportunities for future research.
The paper introduces a novel challenge for environmental sound deepfake detection (ESDD), which is a significant gap in the current literature. The methodology includes the creation of a large-scale dataset (EnvSDD) and a structured challenge with two tracks to evaluate the robustness of detection systems. The task formulation is clearly defined, and the evaluation metrics are appropriate for the objectives. The analysis of architectural choices and training strategies among top-performing systems provides valuable insights into effective approaches for ESDD.
The experimental evaluation is robust, featuring a large number of submissions (1,748) from 97 teams, indicating significant interest and engagement from the research community. The results are well-documented, with clear comparisons between baseline systems and participant submissions. The use of Equal Error Rate (EER) as a metric is suitable for the binary classification task, and the challenge results highlight the varying performance across different generators, which is critical for understanding the challenges in the field.
The paper provides sufficient details regarding the dataset construction and evaluation protocols, which are essential for reproducibility. However, the lack of direct access to code or detailed implementation specifics for the top-performing models may hinder full reproducibility. The challenge's website may provide additional resources, but direct links to code repositories would enhance reproducibility.
One limitation noted is the focus on clip-level classification, which may not adequately address the complexities of real-world audio scenarios where multiple sound events occur simultaneously. Additionally, the challenge primarily addresses detection without exploring the implications of false positives and negatives in practical applications, which could be a significant concern in real-world deployments.
The implications of this research are substantial, particularly in contexts where environmental sounds can be manipulated to create misinformation or panic (e.g., fake alarms or gunshots). The findings could inform the development of more robust detection systems applicable in security, media verification, and public safety. The challenge also sets a precedent for future research in audio deepfake detection, encouraging cross-domain approaches and the exploration of multimodal detection strategies. The paper presents the first comprehensive challenge for environmental sound deepfake detection, establishing a significant benchmark in the field. The methodology and results contribute to advancing the understanding of audio deepfake detection, highlighting both the challenges and opportunities for future research.
Large Audio-Language Models (LALMs) typically struggle with localized dialectal prosody due to the scarcity of specialized corpora. We present TW-Sound580K, a Taiwanese audio-text instruction dataset developed through a Verify-Generate-Critique (VGC) protocol. This pipeline leverages Dual-ASR validation to filter 522K raw clips, subsequently expanding them into 580,000 high-fidelity instruction pairs using a teacher model. The dataset's utility is demonstrated through Tai-LALM, which fine-tunes a DeSTA 2.5-Audio-initialized backbone and incorporates a dynamic Dual-ASR Arbitration strategy to optimize transcription selection during inference. On the TAU Benchmark, Tai-LALM reaches 49.1% accuracy, marking a 6.5% absolute improvement over the zero-shot baseline (42.6% with ASR text conditioning). This confirms that integrating regional corpora with rigorous curation and dynamic arbitration significantly enhances LALM performance on localized speech.
Primary: Shanghai Jiao Tong University
All Institutions: National Taiwan University, Shanghai Jiao Tong University
The main contribution of this paper is the introduction of TW-Sound580K, a specialized audio-text dataset, and the innovative methodologies for its curation and model adaptation, which significantly enhance the performance of audio-language models in localized contexts. The comprehensive approach to dataset construction and inference optimization represents a meaningful advancement in the field of machine learning for audio processing.
The paper introduces a robust methodology for constructing a large-scale audio-text dataset, TW-Sound580K, specifically targeting the unique acoustic characteristics of Taiwanese dialects. The Verify-Generate-Critique (VGC) protocol is a notable innovation, effectively addressing the challenges of data curation in a linguistically diverse context. The integration of Dual-ASR validation to filter and enhance the dataset quality is commendable, as it mitigates the risks of hallucinations in audio transcription. The dynamic Dual-ASR Arbitration mechanism further strengthens the inference process by selecting the most accurate transcription based on acoustic-conditioned perplexity, showcasing a thoughtful approach to model adaptation.
The experimental validation of the Tai-LALM model on the TAU Benchmark demonstrates a significant performance improvement over the baseline, achieving 49.1% accuracy. This empirical evidence supports the effectiveness of the proposed dataset and methodology. The paper includes a comprehensive ablation study that isolates the contributions of various components, reinforcing the robustness of the findings. However, the reliance on a single benchmark may limit the generalizability of the results.
The authors provide a clear outline of their methodology, including the dataset construction process and the training setup for Tai-LALM. However, the lack of direct access to the raw audio data due to copyright constraints poses challenges for full reproducibility. The mention of providing source URLs and metadata upon de-anonymization is a positive step towards enabling future research.
The paper acknowledges several limitations, including the empirical nature of the VGC curation threshold, which may require recalibration for different regions. Additionally, the latency and VRAM overhead introduced by the Dual-ASR arbitration could hinder deployment in resource-constrained environments. The evaluation primarily focuses on the TAU Benchmark, which may not capture the full spectrum of performance across diverse acoustic scenarios.
This work has significant implications for the development of localized audio-language models, particularly in under-resourced linguistic regions. By addressing the localization gap, the proposed dataset and methodologies can enhance the performance of LALMs in understanding regional dialects and acoustic features. The framework established in this paper could serve as a model for similar efforts in other culturally rich but underrepresented areas. The main contribution of this paper is the introduction of TW-Sound580K, a specialized audio-text dataset, and the innovative methodologies for its curation and model adaptation, which significantly enhance the performance of audio-language models in localized contexts. The comprehensive approach to dataset construction and inference optimization represents a meaningful advancement in the field of machine learning for audio processing.
Recently, Automatic Speech Recognition (ASR) systems (e.g., Whisper) have achieved remarkable accuracy improvements but remain highly sensitive to real-world unseen data (data with large distribution shifts), including noisy environments and diverse accents. To address this issue, test-time adaptation (TTA) has shown great potential in improving the model adaptability at inference time without ground-truth labels, and existing TTA methods often rely on pseudo-labeling or entropy minimization. However, by treating model confidence as a learning signal, these methods may reinforce high-confidence errors, leading to confirmation bias that undermines adaptation. To overcome these limitations, we present ASR-TRA, a novel Test-time Reinforcement Adaptation framework inspired by causal intervention. More precisely, our method introduces a learnable decoder prompt and utilizes temperature-controlled stochastic decoding to generate diverse transcription candidates. These are scored by a reward model that measures audio-text semantic alignment, and the resulting feedback is used to update both model and prompt parameters via reinforcement learning. Comprehensive experiments on LibriSpeech with synthetic noise and L2 Arctic accented English datasets demonstrate that our method achieves higher accuracy while maintaining lower latency than existing TTA baselines. Ablation studies further confirm the effectiveness of combining audio and language-based rewards, highlighting our method's enhanced stability and interpretability. Overall, our approach provides a practical and robust solution for deploying ASR systems in challenging real-world conditions.
Primary: Unknown
All Institutions: Unknown
The main contribution of this paper is the introduction of ASR-TRA, a novel test-time reinforcement adaptation framework that enhances the robustness of automatic speech recognition systems through causal interventions and semantic reward modeling. This work represents a significant step forward in addressing the challenges of deploying ASR systems in real-world conditions, providing a practical solution that balances accuracy and efficiency.
The proposed ASR-TRA framework introduces a novel approach to test-time adaptation (TTA) in automatic speech recognition (ASR) by leveraging reinforcement learning (RL) and causal interventions. The methodology is well-structured, utilizing a learnable decoder prompt and temperature-controlled stochastic decoding to generate diverse transcription candidates. The integration of a reward model based on audio-text semantic alignment is a significant innovation that addresses the limitations of existing TTA methods, which often rely on pseudo-labeling or entropy minimization. The use of a Structural Causal Model (SCM) to formalize the adaptation process adds rigor to the approach, although the paper could benefit from a more detailed explanation of the causal relationships involved.
The experiments conducted on the LibriSpeech and L2 Arctic datasets demonstrate the effectiveness of ASR-TRA in improving ASR robustness against noise and accent variations. The results indicate a significant reduction in word error rates (WER) compared to existing TTA methods, showcasing the practical applicability of the proposed framework. The ablation studies provide valuable insights into the contributions of different components, confirming the importance of both prompt tuning and reward modeling. However, the paper could enhance its experimental evaluation by including more diverse datasets and real-world scenarios to further validate the robustness of the method.
The paper provides sufficient details regarding the implementation of ASR-TRA, including the architecture, datasets, and evaluation metrics. The inclusion of hyperparameters and specific configurations aids in reproducibility. However, the lack of a comprehensive description of the training process and the absence of a public demo could hinder full reproducibility for other researchers.
One limitation of the proposed method is its reliance on the CLAP reward model, which may not generalize well across all types of audio inputs. Additionally, while the method shows improvements in accuracy and latency, the computational cost associated with generating multiple candidates and evaluating them could be a concern in resource-constrained environments. The paper also does not address potential scalability issues when deploying the model in real-time applications.
The ASR-TRA framework has the potential to significantly enhance the robustness of ASR systems in real-world applications, particularly in environments with high noise levels or diverse accents. This could lead to improved accessibility and user experience in various domains, including voice-activated assistants, transcription services, and communication aids for individuals with speech impairments. The focus on test-time adaptation without requiring ground-truth labels is particularly relevant for applications where labeled data is scarce or unavailable. The main contribution of this paper is the introduction of ASR-TRA, a novel test-time reinforcement adaptation framework that enhances the robustness of automatic speech recognition systems through causal interventions and semantic reward modeling. This work represents a significant step forward in addressing the challenges of deploying ASR systems in real-world conditions, providing a practical solution that balances accuracy and efficiency.
Recent advances in automatic speech recognition (ASR) and speech enhancement have led to a widespread assumption that improving perceptual audio quality should directly benefit recognition accuracy. In this work, we rigorously examine whether this assumption holds for modern zero-shot ASR systems. We present a systematic empirical study on the impact of Segment Anything Model Audio by Meta AI, a recent foundation-scale speech enhancement model proposed by Meta, when used as a preprocessing step for zero-shot transcription with Whisper. Experiments are conducted across multiple Whisper model variants and two linguistically distinct noisy speech datasets: a real-world Bengali YouTube corpus and a publicly available English noisy dataset. Contrary to common intuition, our results show that SAM-Audio preprocessing consistently degrades ASR performance, increasing both Word Error Rate (WER) and Character Error Rate (CER) compared to raw noisy speech, despite substantial improvements in signal-level quality. Objective Peak Signal-to-Noise Ratio analysis on the English dataset confirms that SAM-Audio produces acoustically cleaner signals, yet this improvement fails to translate into recognition gains. Therefore, we conducted a detailed utterance-level analysis to understand this counterintuitive result. We found that the recognition degradation is a systematic issue affecting the majority of the audio, not just isolated outliers, and that the errors worsen as the Whisper model size increases. These findings expose a fundamental mismatch: audio that is perceptually cleaner to human listeners is not necessarily robust for machine recognition. This highlights the risk of blindly applying state-of-the-art denoising as a preprocessing step in zero-shot ASR pipelines.
Primary: University of Rajshahi
All Institutions: University of Rajshahi, Anan National College of Technology
The main contribution of this paper is the critical examination of the assumption that improving perceptual audio quality through denoising enhances ASR performance, revealing that such enhancements can actually degrade recognition accuracy in zero-shot ASR contexts. This comprehensive analysis challenges prevailing notions and underscores the need for ASR-aware approaches to speech preprocessing, thereby advancing the understanding of the interplay between audio quality and machine recognition.
The methodology is robust, employing a systematic empirical study to evaluate the impact of SAM-Audio on zero-shot ASR performance across two distinct datasets. The authors clearly outline their preprocessing pipeline, ASR models, and evaluation metrics, ensuring that the study is well-structured and reproducible. However, the reliance on a single variant of SAM-Audio due to computational constraints may limit the generalizability of the findings.
The experiments are comprehensive, covering multiple Whisper model variants and two linguistically diverse datasets. The use of WER and CER as primary metrics is appropriate for assessing ASR performance. The results consistently demonstrate that SAM-Audio preprocessing degrades ASR performance, which is a significant finding that challenges existing assumptions in the field.
The paper provides sufficient detail regarding the experimental setup, including datasets and evaluation protocols, which facilitates reproducibility. However, the lack of access to the SAM-Audio model variants used in the experiments may hinder full reproducibility for other researchers.
The study is limited by the use of only the SAM-Audio Small variant and the focus on zero-shot ASR, which may not capture the full potential of the enhancement model. Additionally, the analysis is based on two datasets, which may not encompass the full range of real-world acoustic conditions.
This research has significant implications for the field of ASR and speech enhancement, as it highlights the risks of applying denoising techniques without considering their impact on recognition accuracy. The findings encourage a reevaluation of preprocessing strategies in ASR systems, particularly in zero-shot settings. The main contribution of this paper is the critical examination of the assumption that improving perceptual audio quality through denoising enhances ASR performance, revealing that such enhancements can actually degrade recognition accuracy in zero-shot ASR contexts. This comprehensive analysis challenges prevailing notions and underscores the need for ASR-aware approaches to speech preprocessing, thereby advancing the understanding of the interplay between audio quality and machine recognition.
Voice timbre attribute detection (vTAD) is the task of determining the relative intensity of timbre attributes between speech utterances. Voice timbre is a crucial yet inherently complex component of speech perception. While deep neural network (DNN) embeddings perform well in speaker modelling, they often act as black-box representations with limited physical interpretability and high computational cost. In this work, a compact acoustic parameter set is investigated for vTAD. The set captures important acoustic measures and their temporal dynamics which are found to be crucial in the task. Despite its simplicity, the acoustic parameter set is competitive, outperforming conventional cepstral features and supervised DNN embeddings, and approaching state-of-the-art self-supervised models. Importantly, the studied set require no trainable parameters, incur negligible computation, and offer explicit interpretability for analysing physical traits behind human timbre perception.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of a compact and interpretable acoustic parameter set for voice timbre attribute detection, which effectively competes with complex DNN-based approaches while offering significant advantages in interpretability and computational efficiency. The research addresses a critical gap in the field by providing a practical solution that balances performance with the need for understanding the underlying acoustic features relevant to human speech perception.
The paper proposes a novel approach to voice timbre attribute detection (vTAD) using a compact set of acoustic parameters that captures essential features without requiring training. This method contrasts with traditional deep neural networks (DNNs), which are often computationally intensive and lack interpretability. The methodology is well-structured, focusing on the extraction of 13 acoustic features and their temporal dynamics, leading to a 26-dimensional representation. The use of a simple Diff-Net for classification is appropriate, although the paper could benefit from more detailed descriptions of the feature extraction process and the rationale behind the choice of acoustic parameters.
The experiments are robust, utilizing a well-defined dataset (VCTK-RVA) with expert annotations, which enhances the reliability of the results. The performance metrics (Accuracy and EER) are clearly presented, showing that the proposed method competes well against established DNN-based models. However, the paper could improve by providing more comparative analysis with other state-of-the-art methods and discussing the implications of the results in greater detail.
The paper lacks sufficient implementation details that would facilitate reproducibility. While the methodology is described, specific parameters, configurations, and code availability are not mentioned, which could hinder other researchers from replicating the results.
One limitation is the reliance on a single dataset, which may affect the generalizability of the findings. Additionally, while the proposed method is interpretable, the paper does not fully explore the implications of this interpretability in practical applications. The absence of a demo or project URL also limits accessibility for further exploration of the work.
The study has significant implications for fields requiring voice analysis, such as forensics, healthcare, and human-computer interaction. The focus on interpretability and computational efficiency can lead to more accessible and user-friendly applications in speech technology. The findings could influence future research directions in audio processing and speech perception, particularly in developing systems that prioritize interpretability alongside performance. The main contribution of this paper is the introduction of a compact and interpretable acoustic parameter set for voice timbre attribute detection, which effectively competes with complex DNN-based approaches while offering significant advantages in interpretability and computational efficiency. The research addresses a critical gap in the field by providing a practical solution that balances performance with the need for understanding the underlying acoustic features relevant to human speech perception.
Speech deepfake detection (SDD) is essential for maintaining trust in voice-driven technologies and digital media. Although recent SDD systems increasingly rely on self-supervised learning (SSL) representations that capture rich contextual information, complementary signal-driven acoustic features remain important for modeling fine-grained structural properties of speech. Most existing acoustic front ends are based on time-frequency representations, which do not fully exploit higher-order spectral dependencies inherent in speech signals. We introduce a cyclostationarity-inspired acoustic feature extraction framework for SDD based on spectral correlation density (SCD). The proposed features model periodic statistical structures in speech by capturing spectral correlations between frequency components. In particular, we propose temporally structured SCD features that characterize the evolution of spectral and cyclic-frequency components over time. The effectiveness and complementarity of the proposed features are evaluated using multiple countermeasure architectures, including convolutional neural networks, SSL-based embedding systems, and hybrid fusion models. Experiments on ASVspoof 2019 LA, ASVspoof 2021 DF, and ASVspoof 5 demonstrate that SCD-based features provide complementary discriminative information to SSL embeddings and conventional acoustic representations. In particular, fusion of SSL and SCD embeddings reduces the equal error rate on ASVspoof 2019 LA from $8.28\%$ to $0.98\%$, and yields consistent improvements on the challenging ASVspoof 5 dataset. The results highlight cyclostationary signal analysis as a theoretically grounded and effective front end for speech deepfake detection.
Primary: Bursa Technical University
All Institutions: Bursa Technical University, TCG CREST, University of Eastern Finland
The main contribution of this paper is the introduction of a cyclostationarity-based feature extraction framework for speech deepfake detection, which significantly enhances the detection capabilities by capturing spectral correlations that are often overlooked by conventional methods. This work represents a meaningful advancement in the field of audio signal processing and machine learning, particularly in the context of combating the growing threat of synthetic audio content.
The paper introduces a novel cyclostationarity-inspired feature extraction framework for speech deepfake detection (SDD) that leverages spectral correlation density (SCD) to capture periodic statistical structures in speech. The methodology is well-grounded in signal processing theory, addressing the limitations of conventional time-frequency representations. The proposed two-dimensional SCD features are designed to incorporate temporal dynamics, which enhances their discriminative power. The use of multiple countermeasure architectures, including convolutional neural networks and self-supervised learning embeddings, demonstrates a comprehensive approach to evaluating the effectiveness of the proposed features.
The experiments are robust, utilizing three challenging datasets (ASVspoof 2019 LA, ASVspoof 2021 DF, and ASVspoof 5) to validate the proposed features. The results indicate significant improvements in equal error rates when combining SCD features with SSL embeddings, showcasing the complementarity of the approaches. The experimental setup is thorough, with clear metrics for performance evaluation (EER and minDCF), and the results are presented in a manner that highlights the advantages of the proposed methods over existing baselines.
The paper provides sufficient detail regarding the experimental setup, including datasets, feature extraction methods, and model architectures. However, the absence of a publicly available code repository limits the reproducibility of the results. The authors do provide a demo URL for synthesized speech, which is beneficial but does not fully compensate for the lack of code.
One limitation is the reliance on specific datasets, which may not capture the full diversity of speech deepfake scenarios. Additionally, while the results are promising, the paper does not address potential overfitting issues or the generalizability of the models to unseen spoofing techniques. The computational complexity of the SCD feature extraction process may also pose challenges for real-time applications.
The proposed methodology has significant implications for enhancing the security and trustworthiness of voice-driven technologies, particularly in applications like audio forensics and telecommunication security. By improving the detection of speech deepfakes, the research contributes to the broader field of audio signal processing and machine learning, addressing a critical need in the era of advanced synthetic media. The main contribution of this paper is the introduction of a cyclostationarity-based feature extraction framework for speech deepfake detection, which significantly enhances the detection capabilities by capturing spectral correlations that are often overlooked by conventional methods. This work represents a meaningful advancement in the field of audio signal processing and machine learning, particularly in the context of combating the growing threat of synthetic audio content.
Generative audio requires fine-grained controllable outputs, yet most existing methods require model retraining on specific controls or inference-time controls (\textit{e.g.}, guidance) that can also be computationally demanding. By examining the bottlenecks of existing guidance-based controls, in particular their high cost-per-step due to decoder backpropagation, we introduce a guidance-based approach through selective TFG and Latent-Control Heads (LatCHs), which enables controlling latent audio diffusion models with low computational overhead. LatCHs operate directly in latent space, avoiding the expensive decoder step, and requiring minimal training resources (7M parameters and $\approx$ 4 hours of training). Experiments with Stable Audio Open demonstrate effective control over intensity, pitch, and beats (and a combination of those) while maintaining generation quality. Our method balances precision and audio fidelity with far lower computational costs than standard end-to-end guidance. Demo examples can be found at https://zacharynovack.github.io/latch/latch.html.
Primary: UC -- San Diego
All Institutions: UC -- San Diego
The main contribution of this paper is the introduction of a low-resource, inference-time control framework for latent audio diffusion models, which effectively balances control precision, audio fidelity, and runtime performance. The methodology and results presented are significant advancements in the field of controllable audio generation, showcasing the potential for efficient and high-quality audio synthesis.
The paper introduces a novel approach to controllable audio generation through the use of Latent-Control Heads (LatCHs) and selective Training-Free Guidance (TFG). By operating directly in latent space, the proposed method significantly reduces computational overhead associated with traditional end-to-end guidance methods. The methodology is well-structured, with clear explanations of how LatCHs function and the rationale behind selective TFG. The authors provide a solid theoretical foundation, linking their work to existing literature while clearly delineating their contributions.
The experiments are comprehensive, utilizing the Stable Audio Open (SAO) dataset and comparing the proposed methods against established baselines, including end-to-end guidance and readouts. The evaluation metrics are well-defined, including both qualitative assessments (mean opinion scores) and quantitative metrics (FDopenl3, KLpasst, and CLAP). The results demonstrate that LatCHs outperform traditional methods in terms of both audio quality and computational efficiency, which is a significant achievement in the field of audio generation.
The paper provides sufficient details regarding the experimental setup, including hyperparameters and training procedures for LatCHs. However, the lack of a publicly available code repository may hinder full reproducibility. The authors do mention the datasets used, which aids in replicating the experiments, but the absence of a project URL limits access to the implementation.
One limitation is the potential challenge in generalizing the method to more complex audio generation tasks beyond the evaluated controls (intensity, pitch, and beats). Additionally, the reliance on specific feature extractors may limit the applicability of the approach to other audio domains. The authors also note that controls with greater variability, such as pitch, pose challenges, indicating room for improvement in handling such cases.
The proposed framework has significant implications for the field of generative audio, particularly in applications requiring real-time audio manipulation and control. The ability to generate high-quality audio with low computational costs can benefit various industries, including music production, gaming, and virtual reality. Furthermore, the approach could pave the way for more accessible audio generation tools for creators without extensive computational resources. The main contribution of this paper is the introduction of a low-resource, inference-time control framework for latent audio diffusion models, which effectively balances control precision, audio fidelity, and runtime performance. The methodology and results presented are significant advancements in the field of controllable audio generation, showcasing the potential for efficient and high-quality audio synthesis.
Audio-Visual Speech Recognition (AVSR) integrates acoustic and visual information to enhance robustness in adverse acoustic conditions. Recent advances in Large Language Models (LLMs) have yielded competitive automatic speech recognition performance and shown effectiveness for AVSR. However, prior approaches project audio and visual features independently or apply shallow fusion, limiting cross-modal alignment and complementary exchange while increasing the LLM's computational load. To address this, we propose AVUR-LLM, an LLM-based Audio-Visual Speech Recognition via Sparse Modality Alignment and Visual Unit-Guided Refinement. Experiments on LRS3 demonstrate state-of-the-art results for AVSR. Under additive-noise conditions at 0 dB SNR, it achieves 37% relative improvement over the baseline system.
Primary: Duke Kunshan University
All Institutions: Duke Kunshan University, The Chinese University of Hong Kong, Wuhan University
The main contribution of this paper is the introduction of AVUR-LLM, a novel framework for audio-visual speech recognition that leverages sparse modality alignment and visual unit-guided refinement to achieve state-of-the-art performance in challenging acoustic conditions. This work significantly advances the field of AVSR by addressing key limitations of existing methods and demonstrating the potential for improved robustness and accuracy in speech recognition tasks.
The proposed methodology introduces several innovative components such as Sparse Modality Alignment (SMA), Adaptive Modulated Fusion (AMF), and Visual Unit-Guided Refinement (VUR). SMA allows for a more controlled interaction between audio and visual modalities by inserting alignment blocks into the audio encoder, which is a significant improvement over existing methods that typically rely on shallow fusion. The AMF component intelligently modulates visual feature injection based on acoustic reliability, enhancing the model's adaptability to varying input conditions. The VUR approach effectively transforms visual representations into discrete tokens for LLM rescoring, which is a novel strategy that leverages the strengths of both visual and language models. Overall, the methodology is well-structured and addresses key limitations in prior AVSR systems.
The experiments conducted on the LRS3 dataset demonstrate the effectiveness of the proposed model, achieving state-of-the-art results in various noise conditions. The reported 37% relative improvement in Word Error Rate (WER) under 0 dB SNR conditions is particularly noteworthy, showcasing the robustness of the model in challenging scenarios. The ablation studies provide additional insights into the contributions of each component, reinforcing the validity of the proposed framework. However, the paper could benefit from a more detailed discussion on the statistical significance of the results and comparisons with a broader range of existing methods.
The paper provides a comprehensive overview of the experimental setup, including details on the dataset, model architecture, training procedures, and evaluation metrics. However, the lack of a publicly available code repository or demo URL limits the reproducibility of the results. Future work should consider making the implementation accessible to facilitate validation by the research community.
One limitation of the study is the reliance on a single dataset (LRS3) for evaluation, which may not fully capture the generalizability of the model across different domains or languages. Additionally, while the method shows improvements in noise robustness, the paper does not explore the performance in extremely adverse conditions or with diverse accents and speech patterns. The computational efficiency of the proposed model, particularly in real-time applications, is also not thoroughly addressed.
The advancements in AVSR presented in this paper have significant implications for various applications, including assistive technologies for the hearing impaired, video conferencing systems, and automated transcription services. By enhancing the robustness of speech recognition in noisy environments, this research contributes to making communication technologies more accessible and effective. The main contribution of this paper is the introduction of AVUR-LLM, a novel framework for audio-visual speech recognition that leverages sparse modality alignment and visual unit-guided refinement to achieve state-of-the-art performance in challenging acoustic conditions. This work significantly advances the field of AVSR by addressing key limitations of existing methods and demonstrating the potential for improved robustness and accuracy in speech recognition tasks.
Training-free anomalous sound detection (ASD) based on pre-trained audio embedding models has recently garnered significant attention, as it enables the detection of anomalous sounds using only normal reference data while offering improved robustness under domain shifts. However, existing embedding-based approaches almost exclusively rely on temporal mean pooling, while alternative pooling strategies have so far only been explored for spectrogram-based representations. Consequently, the role of temporal pooling in training-free ASD with pre-trained embeddings remains insufficiently understood. In this paper, we present a systematic evaluation of temporal pooling strategies across multiple state-of-the-art audio embedding models. We propose relative deviation pooling (RDP), an adaptive pooling method that emphasizes informative temporal deviations, and introduce a hybrid pooling strategy that combines RDP with generalized mean pooling. Experiments on five benchmark datasets demonstrate that the proposed methods consistently outperform mean pooling and achieve state-of-the-art performance for training-free ASD, including results that surpass all previously reported trained systems and ensembles on the DCASE2025 ASD dataset.
Primary: Aalborg University
All Institutions: Aalborg University, Pioneer Centre for Artificial Intelligence
The paper presents a novel exploration of temporal pooling strategies in training-free anomalous sound detection, significantly advancing the understanding of this critical component in audio processing pipelines. The systematic evaluation and introduction of innovative pooling methods contribute valuable insights and methodologies that can influence future research and applications in the field.
The paper introduces relative deviation pooling (RDP) and a hybrid pooling strategy that combines RDP with generalized mean pooling (GEM). This approach emphasizes informative temporal deviations, addressing a significant gap in the current understanding of temporal pooling in training-free anomalous sound detection (ASD). The systematic evaluation of various pooling strategies across multiple state-of-the-art audio embedding models is a strong methodological contribution, as it not only highlights the importance of pooling mechanisms but also provides a framework for future research in this area.
The experiments are conducted on five benchmark datasets, demonstrating that the proposed methods consistently outperform traditional mean pooling and achieve state-of-the-art performance for training-free ASD. The results are rigorously analyzed, showing significant improvements over existing methods, including previously reported trained systems. The paper includes comprehensive comparisons and ablation studies, validating the effectiveness of the proposed pooling strategies.
The paper provides detailed descriptions of the datasets, experimental setup, and evaluation metrics, which enhances reproducibility. However, the absence of publicly available code or demo URLs limits the ability for others to directly replicate the findings. The authors mention the use of specific hyperparameters but do not provide a repository for the implementation, which could be a barrier for reproducibility.
One limitation is the reliance on pre-trained audio embedding models, which may not generalize well to all types of anomalous sounds. Additionally, while the proposed pooling strategies show significant improvements, the paper does not explore the potential of integrating these methods into supervised or semi-supervised frameworks, which could further enhance performance. The focus on training-free methods may also limit applicability in scenarios where labeled data is available.
The findings have significant implications for real-world applications in anomaly detection, particularly in industrial settings where rapid deployment and robustness to domain shifts are critical. The proposed methods could lead to more effective monitoring systems for machinery and environmental sounds, potentially reducing downtime and improving safety. The emphasis on training-free approaches also opens avenues for applications in resource-constrained environments. The paper presents a novel exploration of temporal pooling strategies in training-free anomalous sound detection, significantly advancing the understanding of this critical component in audio processing pipelines. The systematic evaluation and introduction of innovative pooling methods contribute valuable insights and methodologies that can influence future research and applications in the field.
Whispered-to-normal (W2N) speech conversion aims to reconstruct missing phonation from whispered input while preserving content and speaker identity. This task is challenging due to temporal misalignment between whisper and voiced recordings and lack of paired data. We propose FlowW2N, a conditional flow matching approach that trains exclusively on synthetic, time-aligned whisper-normal pairs and conditions on domain-invariant features. We exploit high-level ASR embeddings that exhibits strong invariance between synthetic and real whispered speech, enabling generalization to real whispers despite never observing it during training. We verify this invariance across ASR layers and propose a selection criterion optimizing content informativeness and cross-domain invariance. Our method achieves SOTA intelligibility on the CHAINS and wTIMIT datasets, reducing Word Error Rate by 26-46% relative to prior work while using only 10 steps at inference and requiring no real paired data.
Primary: &D Institute UK (SRUK)
All Institutions: &D Institute UK (SRUK), Mobile eXperience Business, Republic of Korea
The main contribution of this paper is the introduction of FlowW2N, a novel approach for whispered-to-normal speech conversion that achieves state-of-the-art performance by leveraging synthetic data and domain-invariant features. This work represents a meaningful advancement in the field of audio processing and speech synthesis, addressing critical challenges in speech intelligibility and quality.
The proposed FlowW2N method introduces a novel conditional flow matching approach that effectively addresses the challenges of whispered-to-normal speech conversion, particularly the temporal misalignment and lack of paired data. By leveraging synthetic data and domain-invariant ASR embeddings, the authors successfully sidestep traditional alignment issues, which is a significant advancement in the field. The architecture employs a Diffusion Transformer and utilizes a Gaussian prior for generation, which is innovative and well-justified. The methodology is clearly articulated, with a systematic exploration of different conditioning mechanisms and layer selection criteria that enhance the model's performance.
The experiments are comprehensive, utilizing two well-established datasets (CHAINS and wTIMIT) to evaluate the model's performance. The results demonstrate a significant reduction in Word Error Rate (WER) compared to prior methods, achieving state-of-the-art intelligibility. The paper includes ablation studies that provide insights into the contributions of various components of the model, reinforcing the robustness of the findings. The evaluation metrics are appropriate and well-defined, ensuring that the results are credible and reproducible.
While the paper provides a detailed description of the methodology and experimental setup, it lacks a publicly available code repository or demo URL, which hinders reproducibility. The authors mention using internal generative AI tools for language refinement, but there is no indication of whether the model or data will be made available for further research.
One limitation is the reliance on synthetic data for training, which may not fully capture the complexities of real-world whispered speech. Additionally, while the model shows impressive performance on the evaluated datasets, its generalizability to other languages or dialects is not addressed. The absence of a demo or code repository also limits the accessibility of the research for further validation by the community.
The implications of this research are significant, particularly in applications involving speech recognition and synthesis for individuals with speech impairments or in noisy environments. The ability to convert whispered speech to normal speech could enhance communication for those who rely on whispering due to various reasons, thus broadening accessibility in technology. The main contribution of this paper is the introduction of FlowW2N, a novel approach for whispered-to-normal speech conversion that achieves state-of-the-art performance by leveraging synthetic data and domain-invariant features. This work represents a meaningful advancement in the field of audio processing and speech synthesis, addressing critical challenges in speech intelligibility and quality.