Audio codecs power discrete music generative modelling, music streaming, and immersive media by shrinking PCM audio to bandwidth-friendly bitrates. Recent works have gravitated towards processing in the spectral domain; however, spectrogram domains typically struggle with phase modeling, which is naturally complex-valued. Most frequency-domain neural codecs either disregard phase information or encode it as two separate real-valued channels, limiting spatial fidelity. This entails the need to introduce adversarial discriminators at the expense of convergence speed and training stability to compensate for the inadequate representation power of the audio signal. In this work we introduce an end-to-end complex-valued RVQ-VAE audio codec that preserves magnitude-phase coupling across the entire analysis-quantization-synthesis pipeline and removes adversarial discriminators and diffusion post-filters. Without GANs or diffusion, we match or surpass much longer-trained baselines in-domain and reach SOTA out-of-domain performance on phase coherence and waveform fidelity. Compared to standard baselines that train for hundreds of thousands of steps, our model, which reduces the training budget by an order of magnitude, is markedly more compute-efficient while preserving high perceptual quality.
Primary: Sapienza University of Rome
All Institutions: Sapienza University of Rome
This paper presents EuleroDec, the first end-to-end complex-valued RVQ-VAE audio codec, which achieves state-of-the-art performance while significantly reducing training time and avoiding adversarial training methods. The methodology is innovative, addressing critical challenges in audio coding, and its implications for the field are substantial, particularly in enhancing audio quality and efficiency in various applications.
The paper introduces a novel end-to-end complex-valued RVQ-VAE audio codec, which effectively preserves magnitude-phase coupling throughout the analysis-quantization-synthesis pipeline. The methodology is robust, employing complex convolutions, attention mechanisms, and normalization techniques that respect the algebraic structure of the STFT. The use of Wirtinger calculus for optimization is a significant advancement, allowing the model to operate entirely within the complex domain without resorting to real-valued detours. This approach addresses the limitations of existing codecs that either ignore phase information or require adversarial training, thus enhancing both convergence speed and stability.
The experiments are well-structured, benchmarking the proposed codec against established state-of-the-art models like APCodec, AudioDec, and EnCodec. The evaluation metrics are comprehensive, covering SI-SDR, PESQ, STOI, and GDD, which provide a holistic view of the codec's performance across different aspects of audio fidelity. The results demonstrate that EuleroDec achieves competitive performance with significantly fewer training steps, showcasing its efficiency and robustness in various scenarios, including out-of-domain testing.
The paper provides a detailed account of the training setup, including the dataset, optimization parameters, and evaluation metrics, which enhances reproducibility. However, the absence of a publicly available implementation or code repository limits the ability for others to replicate the results directly. Future work should consider releasing the model and training code to facilitate broader validation of the findings.
While the paper presents a significant advancement, it does not address potential limitations in terms of scalability to higher bitrates or real-time applications. Additionally, the reliance on complex-valued representations may introduce challenges in deployment on hardware that is optimized for real-valued computations. The performance in highly noisy environments or with diverse audio sources remains to be fully explored.
The proposed codec has the potential to significantly impact audio coding applications, particularly in scenarios requiring high fidelity at low bitrates, such as streaming and immersive media. Its efficiency and robustness could lead to advancements in music generation and speech synthesis, making it a valuable contribution to the field of audio processing. This paper presents EuleroDec, the first end-to-end complex-valued RVQ-VAE audio codec, which achieves state-of-the-art performance while significantly reducing training time and avoiding adversarial training methods. The methodology is innovative, addressing critical challenges in audio coding, and its implications for the field are substantial, particularly in enhancing audio quality and efficiency in various applications.
While modern Text-to-Speech (TTS) systems achieve high fidelity for read-style speech, they struggle to generate Autonomous Sensory Meridian Response (ASMR), a specialized, low-intensity speech style essential for relaxation. The inherent challenges include ASMR's subtle, often unvoiced characteristics and the demand for zero-shot speaker adaptation. In this paper, we introduce DeepASMR, the first framework designed for zero-shot ASMR generation. We demonstrate that a single short snippet of a speaker's ordinary, read-style speech is sufficient to synthesize high-fidelity ASMR in their voice, eliminating the need for whispered training data from the target speaker. Methodologically, we first identify that discrete speech tokens provide a soft factorization of ASMR style from speaker timbre. Leveraging this insight, we propose a two-stage pipeline incorporating a Large Language Model (LLM) for content-style encoding and a flow-matching acoustic decoder for timbre reconstruction. Furthermore, we contribute DeepASMR-DB, a comprehensive 670-hour English-Chinese multi-speaker ASMR speech corpus, and introduce a novel evaluation protocol integrating objective metrics, human listening tests, LLM-based scoring and unvoiced speech analysis. Extensive experiments confirm that DeepASMR achieves state-of-the-art naturalness and style fidelity in ASMR generation for anyone of any voice, while maintaining competitive performance on normal speech synthesis.
Primary: Shanghai Jiao Tong University
All Institutions: Shanghai Jiao Tong University, MoE Key Laboratory of Artificial Intelligence, VUI Labs
The main contribution of this paper is the introduction of DeepASMR, a pioneering framework for zero-shot ASMR speech generation that leverages a two-stage architecture to synthesize high-quality ASMR speech from any speaker's voice without requiring whispered training data. This work significantly advances the state of the art in speech synthesis, particularly in generating nuanced, low-intensity speech styles essential for relaxation.
The methodology is innovative, introducing a two-stage framework that effectively separates content and style in ASMR speech synthesis. The use of a large language model (LLM) for content-style encoding and a flow-matching acoustic decoder for timbre reconstruction is a significant advancement. The identification of a latent factorization within the token space is a novel insight that enhances the model's ability to generate ASMR speech without requiring whispered training data from the target speaker. The proposed task prompt selection via a virtual speaker pool is also a clever solution to mitigate timbre leakage and ensure speaker identity preservation.
The experiments are extensive and well-structured, demonstrating the effectiveness of DeepASMR across various tasks, including intra-style and cross-style synthesis. The introduction of DeepASMR-DB, a large bilingual ASMR dataset, is a valuable contribution to the field. The evaluation metrics are comprehensive, combining objective, subjective, and LLM-based assessments, which provide a holistic view of the model's performance. The results indicate that DeepASMR outperforms existing methods in terms of naturalness and style fidelity, particularly in the challenging Normal-to-ASMR task.
The paper provides detailed implementation details, including training configurations and evaluation protocols, which enhance reproducibility. The availability of the DeepASMR-DB dataset and the demo URL further supports the reproducibility of the research. However, the reliance on proprietary models and specific architectures may pose challenges for full replication without access to the same resources.
While the paper presents a significant advancement in ASMR synthesis, it does not address potential limitations in terms of the generalizability of the model across diverse languages and cultural contexts beyond English and Chinese. Additionally, the reliance on a large dataset may limit accessibility for researchers with fewer resources. The trade-offs observed in the iterative inference refinement process, where style intensity may compromise speaker identity, also highlight a potential area for improvement.
The ability to generate high-fidelity ASMR speech has implications for various applications, including mental health, relaxation, and entertainment. This research could lead to advancements in personalized audio experiences, enhancing user engagement and satisfaction. The release of the DeepASMR-DB dataset may also stimulate further research in ASMR synthesis and related fields. The main contribution of this paper is the introduction of DeepASMR, a pioneering framework for zero-shot ASMR speech generation that leverages a two-stage architecture to synthesize high-quality ASMR speech from any speaker's voice without requiring whispered training data. This work significantly advances the state of the art in speech synthesis, particularly in generating nuanced, low-intensity speech styles essential for relaxation.
Discrete speech tokens offer significant advantages for storage and language model integration, but their application in speech emotion recognition (SER) is limited by paralinguistic information loss during quantization. This paper presents a comprehensive investigation of discrete tokens for SER. Using a fine-tuned WavLM-Large model, we systematically quantify performance degradation across different layer configurations and k-means quantization granularities. To recover the information loss, we propose two key strategies: (1) attention-based multi-layer fusion to recapture complementary information from different layers, and (2) integration of openSMILE features to explicitly reintroduce paralinguistic cues. We also compare mainstream neural codec tokenizers (SpeechTokenizer, DAC, EnCodec) and analyze their behaviors when fused with acoustic features. Our findings demonstrate that through multi-layer fusion and acoustic feature integration, discrete tokens can close the performance gap with continuous representations in SER tasks.
Primary: Carnegie Mellon University
All Institutions: Carnegie Mellon University
This paper systematically investigates the use of discrete tokens for speech emotion recognition, proposing innovative methods to recover lost information through multi-layer fusion and paralinguistic feature integration. The work contributes to advancing SER methodologies, with potential applications in various real-world scenarios.
The paper presents a well-structured methodology that systematically investigates the use of discrete tokens for speech emotion recognition (SER). It employs a fine-tuned WavLM-Large model and explores various layer configurations and k-means quantization granularities. The introduction of attention-based multi-layer fusion and the integration of openSMILE features to recover paralinguistic cues are innovative strategies that enhance the approach. The methodology is robust, with clear steps outlined for feature extraction, quantization, and fusion, although the reliance on k-means clustering could be further justified in terms of its effectiveness compared to other clustering techniques.
The experiments are comprehensive, utilizing a large-scale dataset (MSP-Podcast) and a well-defined evaluation metric (Macro F1 Score) to assess performance across different configurations. The paper effectively compares the performance of discrete WavLM tokens against various neural audio codecs, providing valuable insights into the effectiveness of the proposed methods. However, the results could benefit from more extensive statistical analysis to validate the significance of the findings.
The paper lacks detailed implementation specifics, such as hyperparameter settings, code availability, and data preprocessing steps, which are crucial for reproducibility. While the methodology is described, the absence of a public repository or demo limits the ability of other researchers to replicate the results.
One limitation is the potential overfitting to the MSP-Podcast dataset, which may not generalize well to other SER tasks or datasets. Additionally, the reliance on discrete tokens may still lead to some information loss that is not fully addressed by the proposed methods. The performance of neural audio codecs is notably lower, which raises questions about their applicability in SER tasks.
The findings of this research have significant implications for the development of efficient SER systems, particularly in applications requiring low-latency processing and storage efficiency, such as mobile devices and real-time communication systems. The integration of paralinguistic features could also enhance emotional intelligence in AI systems, making them more responsive and human-like in interactions. This paper systematically investigates the use of discrete tokens for speech emotion recognition, proposing innovative methods to recover lost information through multi-layer fusion and paralinguistic feature integration. The work contributes to advancing SER methodologies, with potential applications in various real-world scenarios.
Audio codecs power discrete music generative modelling, music streaming, and immersive media by shrinking PCM audio to bandwidth-friendly bitrates. Recent works have gravitated towards processing in the spectral domain; however, spectrogram domains typically struggle with phase modeling, which is naturally complex-valued. Most frequency-domain neural codecs either disregard phase information or encode it as two separate real-valued channels, limiting spatial fidelity. This entails the need to introduce adversarial discriminators at the expense of convergence speed and training stability to compensate for the inadequate representation power of the audio signal. In this work we introduce an end-to-end complex-valued RVQ-VAE audio codec that preserves magnitude-phase coupling across the entire analysis-quantization-synthesis pipeline and removes adversarial discriminators and diffusion post-filters. Without GANs or diffusion, we match or surpass much longer-trained baselines in-domain and reach SOTA out-of-domain performance on phase coherence and waveform fidelity. Compared to standard baselines that train for hundreds of thousands of steps, our model, which reduces the training budget by an order of magnitude, is markedly more compute-efficient while preserving high perceptual quality.
Primary: Sapienza University of Rome
All Institutions: Sapienza University of Rome
This paper presents EuleroDec, the first end-to-end complex-valued RVQ-VAE audio codec, which achieves state-of-the-art performance while significantly reducing training time and avoiding adversarial training methods. The methodology is innovative, addressing critical challenges in audio coding, and its implications for the field are substantial, particularly in enhancing audio quality and efficiency in various applications.
The paper introduces a novel end-to-end complex-valued RVQ-VAE audio codec, which effectively preserves magnitude-phase coupling throughout the analysis-quantization-synthesis pipeline. The methodology is robust, employing complex convolutions, attention mechanisms, and normalization techniques that respect the algebraic structure of the STFT. The use of Wirtinger calculus for optimization is a significant advancement, allowing the model to operate entirely within the complex domain without resorting to real-valued detours. This approach addresses the limitations of existing codecs that either ignore phase information or require adversarial training, thus enhancing both convergence speed and stability.
The experiments are well-structured, benchmarking the proposed codec against established state-of-the-art models like APCodec, AudioDec, and EnCodec. The evaluation metrics are comprehensive, covering SI-SDR, PESQ, STOI, and GDD, which provide a holistic view of the codec's performance across different aspects of audio fidelity. The results demonstrate that EuleroDec achieves competitive performance with significantly fewer training steps, showcasing its efficiency and robustness in various scenarios, including out-of-domain testing.
The paper provides a detailed account of the training setup, including the dataset, optimization parameters, and evaluation metrics, which enhances reproducibility. However, the absence of a publicly available implementation or code repository limits the ability for others to replicate the results directly. Future work should consider releasing the model and training code to facilitate broader validation of the findings.
While the paper presents a significant advancement, it does not address potential limitations in terms of scalability to higher bitrates or real-time applications. Additionally, the reliance on complex-valued representations may introduce challenges in deployment on hardware that is optimized for real-valued computations. The performance in highly noisy environments or with diverse audio sources remains to be fully explored.
The proposed codec has the potential to significantly impact audio coding applications, particularly in scenarios requiring high fidelity at low bitrates, such as streaming and immersive media. Its efficiency and robustness could lead to advancements in music generation and speech synthesis, making it a valuable contribution to the field of audio processing. This paper presents EuleroDec, the first end-to-end complex-valued RVQ-VAE audio codec, which achieves state-of-the-art performance while significantly reducing training time and avoiding adversarial training methods. The methodology is innovative, addressing critical challenges in audio coding, and its implications for the field are substantial, particularly in enhancing audio quality and efficiency in various applications.
Audio codecs power discrete music generative modelling, music streaming and immersive media by shrinking PCM audio to bandwidth-friendly bit-rates. Recent works have gravitated towards processing in the spectral domain; however, spectrogram-domains typically struggle with phase modeling which is naturally complex-valued. Most frequency-domain neural codecs either disregard phase information or encode it as two separate real-valued channels, limiting spatial fidelity. This entails the need to introduce adversarial discriminators at the expense of convergence speed and training stability to compensate for the inadequate representation power of the audio signal. In this work we introduce an end-to-end complex-valued RVQ-VAE audio codec that preserves magnitude-phase coupling across the entire analysis-quantization-synthesis pipeline and removes adversarial discriminators and diffusion post-filters. Without GANs or diffusion we match or surpass much longer-trained baselines in-domain and reach SOTA out-of-domain performance. Compared to standard baselines that train for hundreds of thousands of steps, our model reducing training budget by an order of magnitude is markedly more compute-efficient while preserving high perceptual quality.
Primary: Sapienza University of Rome
All Institutions: Sapienza University of Rome
The paper presents EuleroDec, a pioneering end-to-end complex-valued RVQ-VAE audio codec that achieves high perceptual quality and robustness without adversarial training. This work significantly advances the field of audio coding by addressing the critical issue of phase information preservation and demonstrates the potential of complex-valued networks in achieving efficient and effective audio compression.
The paper introduces an innovative end-to-end complex-valued RVQ-VAE audio codec that preserves magnitude-phase coupling throughout the entire analysis-quantization-synthesis pipeline. The authors utilize complex-valued neural networks (CVNNs) to avoid the pitfalls of traditional real-valued approaches, which often lead to loss of phase information. The methodology is robust, employing Wirtinger calculus for optimization and complex convolutions, attention mechanisms, and normalization techniques that respect the algebraic structure of the STFT. The removal of adversarial discriminators and diffusion post-filters is a significant methodological advancement, as it enhances training stability and reduces computational overhead while achieving competitive performance.
The experiments are well-structured, benchmarking the proposed codec against established baselines (AudioDec, EnCodec, and APCodec) on the LibriTTS dataset. The evaluation metrics, including SI-SDR, PESQ, STOI, and GDD, comprehensively cover various aspects of audio quality and intelligibility. The results demonstrate that EuleroDec achieves state-of-the-art performance in out-of-domain scenarios, highlighting its robustness and generalization capabilities. The ablation studies provide valuable insights into the contributions of different architectural components, reinforcing the significance of the proposed approach.
The paper provides sufficient details regarding the training setup, including the dataset, optimizer settings, and loss functions used. However, the absence of a publicly available code repository or demo URL limits the reproducibility of the results. While the methodology is described in detail, the lack of shared resources may hinder other researchers from replicating the findings.
One limitation is the lack of a demo or project URL, which could facilitate further exploration and validation of the proposed codec. Additionally, while the paper claims to achieve state-of-the-art performance, the results are primarily based on a single dataset (LibriTTS), which may not fully represent the codec's performance across diverse audio types and conditions. The reliance on complex-valued architectures may also introduce challenges in terms of implementation and understanding for practitioners accustomed to traditional real-valued models.
The proposed codec has significant implications for audio coding, particularly in applications involving music streaming, generative modeling, and immersive media. By efficiently compressing audio while preserving perceptual quality and phase coherence, EuleroDec could enhance user experiences in various audio applications, including telecommunications and entertainment. The advancements in complex-valued neural networks could also inspire further research in other domains where phase information is crucial. The paper presents EuleroDec, a pioneering end-to-end complex-valued RVQ-VAE audio codec that achieves high perceptual quality and robustness without adversarial training. This work significantly advances the field of audio coding by addressing the critical issue of phase information preservation and demonstrates the potential of complex-valued networks in achieving efficient and effective audio compression.
Discrete speech tokens offer significant advantages for storage and language model integration, but their application in speech emotion recognition (SER) is limited by paralinguistic information loss during quantization. This paper presents a comprehensive investigation of discrete tokens for SER. Using a fine-tuned WavLM-Large model, we systematically quantify performance degradation across different layer configurations and k-means quantization granularities. To recover the information loss, we propose two key strategies: (1) attention-based multi-layer fusion to recapture complementary information from different layers, and (2) integration of openSMILE features to explicitly reintroduce paralinguistic cues. We also compare mainstream neural codec tokenizers (SpeechTokenizer, DAC, EnCodec) and analyze their behaviors when fused with acoustic features. Our findings demonstrate that through multi-layer fusion and acoustic feature integration, discrete tokens can close the performance gap with continuous representations in SER tasks.
Primary: Carnegie Mellon University
All Institutions: Carnegie Mellon University
This paper systematically investigates the use of discrete tokens for speech emotion recognition, proposing innovative methods to recover lost information through multi-layer fusion and paralinguistic feature integration. The work contributes to advancing SER methodologies, with potential applications in various real-world scenarios.
The paper presents a well-structured methodology that systematically investigates the use of discrete tokens for speech emotion recognition (SER). It employs a fine-tuned WavLM-Large model and explores various layer configurations and k-means quantization granularities. The introduction of attention-based multi-layer fusion and the integration of openSMILE features to recover paralinguistic cues are innovative strategies that enhance the approach. The methodology is robust, with clear steps outlined for feature extraction, quantization, and fusion, although the reliance on k-means clustering could be further justified in terms of its effectiveness compared to other clustering techniques.
The experiments are comprehensive, utilizing a large-scale dataset (MSP-Podcast) and a well-defined evaluation metric (Macro F1 Score) to assess performance across different configurations. The paper effectively compares the performance of discrete WavLM tokens against various neural audio codecs, providing valuable insights into the effectiveness of the proposed methods. However, the results could benefit from more extensive statistical analysis to validate the significance of the findings.
The paper lacks detailed implementation specifics, such as hyperparameter settings, code availability, and data preprocessing steps, which are crucial for reproducibility. While the methodology is described, the absence of a public repository or demo limits the ability of other researchers to replicate the results.
One limitation is the potential overfitting to the MSP-Podcast dataset, which may not generalize well to other SER tasks or datasets. Additionally, the reliance on discrete tokens may still lead to some information loss that is not fully addressed by the proposed methods. The performance of neural audio codecs is notably lower, which raises questions about their applicability in SER tasks.
The findings of this research have significant implications for the development of efficient SER systems, particularly in applications requiring low-latency processing and storage efficiency, such as mobile devices and real-time communication systems. The integration of paralinguistic features could also enhance emotional intelligence in AI systems, making them more responsive and human-like in interactions. This paper systematically investigates the use of discrete tokens for speech emotion recognition, proposing innovative methods to recover lost information through multi-layer fusion and paralinguistic feature integration. The work contributes to advancing SER methodologies, with potential applications in various real-world scenarios.
Generative speech enhancement offers a promising alternative to traditional discriminative methods by modeling the distribution of clean speech conditioned on noisy inputs. Post-training alignment via reinforcement learning (RL) effectively aligns generative models with human preferences and downstream metrics in domains such as natural language processing, but its use in speech enhancement remains limited, especially for online RL. Prior work explores offline methods like Direct Preference Optimization (DPO); online methods such as Group Relative Policy Optimization (GRPO) remain largely uninvestigated. In this paper, we present the first successful integration of online GRPO into a flow-matching speech enhancement framework, enabling efficient post-training alignment to perceptual and task-oriented metrics with few update steps. Unlike prior GRPO work on Large Language Models, we adapt the algorithm to the continuous, time-series nature of speech and to the dynamics of flow-matching generative models. We show that optimizing a single reward yields rapid metric gains but often induces reward hacking that degrades audio fidelity despite higher scores. To mitigate this, we propose a multi-metric reward optimization strategy that balances competing objectives, substantially reducing overfitting and improving overall performance. Our experiments validate online GRPO for speech enhancement and provide practical guidance for RL-based post-training of generative audio models.
Primary: Alibaba Group
All Institutions: Alibaba Group, Tongyi Lab
The main contribution of this paper is the successful application of online GRPO in a flow-matching speech enhancement framework, which offers a novel approach to improving audio quality through reinforcement learning. This work not only advances the state-of-the-art in speech enhancement but also provides practical guidance for future research in generative audio models.
The paper introduces a novel integration of online Group Relative Policy Optimization (GRPO) into a flow-matching speech enhancement framework, which is a significant advancement over traditional methods. The authors adapt GRPO to suit the continuous nature of speech signals, addressing the unique challenges posed by audio data. The proposed multi-metric reward optimization strategy is particularly noteworthy as it mitigates the risk of reward hacking, a common issue in reinforcement learning applications. The methodology is well-structured, with clear definitions of the Markov Decision Process and the transition from ODE to SDE, demonstrating a strong grasp of the underlying mathematical principles.
The experiments are comprehensive, utilizing a robust dataset that includes various noise sources and clean speech samples. The results indicate that the proposed method outperforms traditional offline methods like Direct Preference Optimization (DPO) in terms of efficiency and effectiveness, achieving better performance metrics with fewer training steps. The use of multiple evaluation metrics (DNSMOS, speaker similarity, and SpeechBERTScore) provides a well-rounded assessment of the model's performance. The ablation studies further validate the effectiveness of the multi-metric reward approach.
The paper provides detailed implementation specifics, including model architecture, training configurations, and hyperparameter settings. However, the absence of a publicly available code repository or demo limits the reproducibility of the results. Future work could benefit from sharing the code to enhance transparency and facilitate further research in this area.
One limitation noted is the potential for reward hacking when optimizing single metrics, which the authors address through their multi-metric approach. However, the tuning of weights for these metrics appears to require careful consideration, which may complicate the training process. Additionally, the reliance on specific datasets may limit the generalizability of the findings to other speech enhancement scenarios.
The integration of online reinforcement learning in speech enhancement has significant implications for real-time applications, such as voice communication systems and hearing aids. By improving the quality of speech in noisy environments, this research could enhance user experiences in various audio-related technologies. The findings could also inspire further research into the application of RL in other generative modeling tasks beyond speech. The main contribution of this paper is the successful application of online GRPO in a flow-matching speech enhancement framework, which offers a novel approach to improving audio quality through reinforcement learning. This work not only advances the state-of-the-art in speech enhancement but also provides practical guidance for future research in generative audio models.
It is well-known that audio classifiers often rely on non-musically relevant features and spurious correlations to classify audio. Hence audio classifiers are easy to manipulate or confuse, resulting in wrong classifications. While inducing a misclassification is not hard, until now the set of features that the classifiers rely on was not well understood. In this paper we introduce a new method that uses causal reasoning to discover features of the frequency space that are sufficient and necessary for a given classification. We describe an implementation of this algorithm in the tool FreqReX and provide experimental results on a number of standard benchmark datasets. Our experiments show that causally sufficient and necessary subsets allow us to manipulate the outputs of the models in a variety of ways by changing the input very slightly. Namely, a change to one out of 240,000 frequencies results in a change in classification 58% of the time, and the change can be so small that it is practically inaudible. These results show that causal analysis is useful for understanding the reasoning process of audio classifiers and can be used to successfully manipulate their outputs.
Primary: King's College London
All Institutions: King's College London
The paper presents a pioneering causal analysis method for audio classifiers, revealing the underlying frequency components critical for classification and demonstrating the potential for manipulating audio inputs to affect model outputs. This research contributes valuable insights into the vulnerabilities of audio classifiers and opens avenues for future work in enhancing model robustness and interpretability.
The paper introduces a novel causal analysis method for audio classifiers, focusing on identifying sufficient, necessary, and complete frequency components of audio signals. The methodology is based on actual causality principles and employs a frequency-based approach that manipulates the Fourier Transform of audio signals. This approach is innovative as it allows for black-box analysis without needing access to the model's internals, which is a significant advancement in understanding model behavior in audio classification.
The experiments conducted on multiple benchmark datasets demonstrate the effectiveness of the proposed method. The authors show that small perturbations in frequency can lead to significant changes in classification outcomes, highlighting the fragility of audio classifiers. The results are quantitatively supported by statistical measures, such as the success rate of manipulations and the average responsibility of frequencies across different models.
The paper mentions that all code, results, and supplementary materials are available, which is crucial for reproducibility. However, specific URLs for accessing these resources are not provided, which could hinder other researchers from easily replicating the study.
One limitation is the assumption of causal independence among frequencies, which may not hold in real-world audio signals. Additionally, the paper does not address the potential impact of these manipulations on the perceptual quality of audio, which is critical in applications involving human listeners.
The findings have significant implications for the field of audio classification and machine learning, particularly in understanding and mitigating the reliance on spurious features in model predictions. This work could inform the development of more robust audio classifiers and enhance the interpretability of machine learning models in audio applications. The paper presents a pioneering causal analysis method for audio classifiers, revealing the underlying frequency components critical for classification and demonstrating the potential for manipulating audio inputs to affect model outputs. This research contributes valuable insights into the vulnerabilities of audio classifiers and opens avenues for future work in enhancing model robustness and interpretability.
Mamba, a selective state-space model (SSM), has emerged as an efficient alternative to Transformers for speech modeling, enabling long-sequence processing with linear complexity. While effective in speech separation, existing approaches, whether in the time or time-frequency domain, typically decompose the input along a single dimension into short one-dimensional sequences before processing them with Mamba, which restricts it to local 1D modeling and limits its ability to capture global dependencies across the 2D spectrogram. In this work, we propose an efficient omni-directional attention (OA) mechanism built upon unidirectional Mamba, which models global dependencies from ten different directions on the spectrogram. We expand the proposed mechanism into two baseline separation models and evaluate on three public datasets. Experimental results show that our approach consistently achieves significant performance gains over the baselines while preserving linear complexity, outperforming existing state-of-the-art (SOTA) systems.
Primary: Beijing Institute of Technology
All Institutions: Beijing Institute of Technology, Beijing University of Posts and Telecommunications
The main contribution of this work is the development of an omni-directional attention mechanism that significantly improves speech separation performance while maintaining linear computational complexity. This paper presents a compelling advancement in the field of audio processing, particularly in the context of speech separation, by effectively addressing the limitations of existing models and demonstrating substantial performance gains.
The paper introduces an innovative omni-directional attention (OA) mechanism that enhances the Mamba model's capacity to capture global dependencies in speech separation tasks. By projecting the spectrogram into ten different directions, the OA mechanism allows for a more comprehensive representation of the input data. The integration of this mechanism into existing models like TF-GridNet and SPMamba is well-articulated, and the methodology is grounded in solid theoretical foundations. However, the paper could benefit from a more detailed explanation of the specific advantages of the ten-directional approach compared to existing methods.
The experimental setup is thorough, utilizing three well-known datasets (WSJ0-2Mix, WHAM!, and Libri2Mix) to validate the proposed method. The results demonstrate significant performance improvements over baseline models and existing state-of-the-art systems, which supports the claims made in the paper. The use of standard metrics such as SI-SDRi and SDRi provides clarity in evaluating the effectiveness of the proposed approach. However, the lack of a detailed comparison with other recent methods could limit the contextual understanding of the results.
The paper provides sufficient details regarding the experimental setup, including model configurations, training procedures, and evaluation metrics, which enhances reproducibility. However, the absence of publicly available code or a demo URL limits the ability for others to directly replicate the results.
While the proposed OA mechanism shows promise, the paper does not address potential limitations, such as the computational overhead of processing ten directional inputs or how the model performs under varying noise conditions beyond those tested. Additionally, the reliance on specific datasets may not fully capture the generalizability of the approach.
The advancements in speech separation through the proposed OA mechanism have significant implications for various applications, including telecommunications, hearing aids, and voice recognition systems. The ability to efficiently separate speech from noise can enhance user experience in real-world scenarios where clarity is paramount. The main contribution of this work is the development of an omni-directional attention mechanism that significantly improves speech separation performance while maintaining linear computational complexity. This paper presents a compelling advancement in the field of audio processing, particularly in the context of speech separation, by effectively addressing the limitations of existing models and demonstrating substantial performance gains.
Automated respiratory sound classification supports the diagnosis of pulmonary diseases. However, many deep models still rely on cycle-level analysis and suffer from patient-specific overfitting. We propose PC-MCL (Patient-Consistent Multi-Cycle Learning) to address these limitations by utilizing three key components: multi-cycle concatenation, a 3-label formulation, and a patient-matching auxiliary task. Our work resolves a multi-label distributional bias in respiratory sound classification, a critical issue inherent to applying multi-cycle concatenation with the conventional 2-label formulation (crackle, wheeze). This bias manifests as a systematic loss of normal signal information when normal and abnormal cycles are combined. Our proposed 3-label formulation (normal, crackle, wheeze) corrects this by preserving information from all constituent cycles in mixed samples. Furthermore, the patient-matching auxiliary task acts as a multi-task regularizer, encouraging the model to learn more robust features and improving generalization. On the ICBHI 2017 benchmark, PC-MCL achieves an ICBHI Score of 65.37%, outperforming existing baselines. Ablation studies confirm that all three components are essential, working synergistically to improve the detection of abnormal respiratory events.
Primary: Seoul National University of Science and Technology
All Institutions: Seoul National University of Science and Technology
The main contribution of this paper is the introduction of PC-MCL, a framework that effectively mitigates multi-label distributional bias and patient-specific overfitting in respiratory sound classification. This work represents a meaningful advancement in the field, combining innovative methodological approaches with rigorous experimental validation to enhance the accuracy and robustness of automated diagnostic systems.
The proposed PC-MCL framework introduces a novel approach to respiratory sound classification by addressing the limitations of single-cycle analysis and patient-specific overfitting. The methodology effectively combines multi-cycle concatenation with a 3-label formulation, allowing for a more comprehensive representation of respiratory sounds. The introduction of a patient-matching auxiliary task is a significant innovation that enhances model generalization and robustness. The ablation studies provide strong evidence for the necessity of each component, demonstrating a well-structured approach to model design.
The experiments are conducted on the ICBHI 2017 dataset, which is a relevant benchmark for respiratory sound classification. The reported results, including an ICBHI Score of 65.37%, indicate a substantial improvement over existing methods. The use of multiple metrics (Specificity, Sensitivity, ICBHI Score) enhances the evaluation's rigor. The ablation studies further validate the contributions of each component, showcasing the effectiveness of the proposed framework.
While the paper outlines the experimental setup and methodology, it lacks detailed implementation specifics that would facilitate reproducibility. There is no mention of code availability or links to a repository, which is a significant drawback for the community's ability to replicate the results.
One limitation is the potential for domain shift between training and inference due to the differing nature of concatenated cycles during training and single cycles during inference. Additionally, the reliance on a specific dataset may limit the generalizability of the findings to other respiratory sound classification tasks.
The proposed framework has significant implications for automated healthcare, particularly in improving the diagnosis of pulmonary diseases through enhanced respiratory sound classification. By addressing critical issues like patient-specific overfitting and label distribution bias, this work could lead to more accurate and reliable diagnostic tools in clinical settings. The main contribution of this paper is the introduction of PC-MCL, a framework that effectively mitigates multi-label distributional bias and patient-specific overfitting in respiratory sound classification. This work represents a meaningful advancement in the field, combining innovative methodological approaches with rigorous experimental validation to enhance the accuracy and robustness of automated diagnostic systems.
Neural text-to-speech (TTS) systems systematically mispronounce low-resource proper nouns, particularly non-English names, brands, and geographic locations, due to their underrepresentation in predominantly English training corpora. Existing solutions typically rely on expensive multilingual data collection, supervised finetuning, or manual phonetic annotation, which limits the deployment of TTS systems in linguistically diverse settings. We introduce SonoEdit, a model editing technique that surgically corrects pronunciation errors in pre-trained TTS models without retraining. Instead of costly finetuning or explicit phoneme injection, we propose a parsimonious alternative based on Null-Space Pronunciation Editing, which performs a single-shot parameter update to modify the pronunciation of specific words while provably preserving all other model behavior. We first adapt Acoustic Causal Tracing to identify the Transformer layers responsible for text-to-pronunciation mapping. We then apply Null-Space Constrained Editing to compute a closed-form weight update that corrects the target pronunciation while remaining mathematically orthogonal to the subspace governing general speech generation. This constrained update steers the model's acoustic output toward a desired pronunciation exemplar while guaranteeing zero first-order change on a preserved speech corpus.
Primary: TU Darmstadt
All Institutions: TU Darmstadt, Smallest AI
The main contribution of this paper is the introduction of SonoEdit, a novel model editing technique that effectively corrects pronunciation errors in TTS systems without the need for retraining, thereby providing a cost-effective solution for improving the accuracy of speech synthesis in linguistically diverse settings. The technical contribution is substantial, offering a unique approach that could significantly influence future research and applications in the field of TTS and machine learning.
The methodology presented in the paper is innovative, focusing on Null-Space Constrained Knowledge Editing to address pronunciation errors in TTS systems. The authors adapt Acoustic Causal Tracing to identify relevant Transformer layers, which is a novel approach to understanding the mapping from text to pronunciation. The use of a single-shot parameter update to correct specific pronunciations while preserving overall model behavior is a significant advancement over traditional methods that require extensive retraining or manual intervention. However, the paper could benefit from a more detailed explanation of the mathematical foundations behind the Null-Space approach to enhance clarity.
The experimental evaluation appears to be robust, with a clear focus on the effectiveness of the proposed method in correcting pronunciation errors. However, the paper lacks detailed metrics or comparative results against baseline methods, which would strengthen the claims of superiority. The absence of a comprehensive dataset description also limits the reproducibility of results. Including a variety of proper nouns and a diverse set of languages in the experiments would provide a more thorough validation of the method's applicability.
The paper does not provide sufficient details on the implementation or the datasets used, which raises concerns about reproducibility. While the methodology is theoretically sound, practical implementation details such as code availability, hyperparameter settings, and specific dataset characteristics are crucial for other researchers to replicate the results.
One limitation is the potential overfitting of the model to the specific words corrected, which could lead to unintended consequences in other areas of speech generation. Additionally, the method's reliance on identifying specific Transformer layers may limit its applicability to other architectures or TTS systems. The paper also does not address the computational efficiency of the proposed method, which is important for real-time applications.
The broader impact of this research is significant, as it addresses a critical challenge in TTS systems, particularly for low-resource languages and proper nouns. By enabling more accurate pronunciation without extensive retraining, this work has the potential to enhance accessibility and usability of TTS technologies across diverse linguistic contexts. The implications for language preservation and inclusivity in technology are noteworthy. The main contribution of this paper is the introduction of SonoEdit, a novel model editing technique that effectively corrects pronunciation errors in TTS systems without the need for retraining, thereby providing a cost-effective solution for improving the accuracy of speech synthesis in linguistically diverse settings. The technical contribution is substantial, offering a unique approach that could significantly influence future research and applications in the field of TTS and machine learning.
Large Audio Language Models (LALMs) have garnered significant research interest. Despite being built upon text-based large language models (LLMs), LALMs frequently exhibit a degradation in knowledge and reasoning capabilities. We hypothesize that this limitation stems from the failure of current training paradigms to effectively bridge the acoustic-semantic gap within the feature representation space. To address this challenge, we propose CORD, a unified alignment framework that performs online cross-modal self-distillation. Specifically, it aligns audio-conditioned reasoning with its text-conditioned counterpart within a unified model. Leveraging the text modality as an internal teacher, CORD performs multi-granularity alignment throughout the audio rollout process. At the token level, it employs on-policy reverse KL divergence with importance-aware weighting to prioritize early and semantically critical tokens. At the sequence level, CORD introduces a judge-based global reward to optimize complete reasoning trajectories via Group Relative Policy Optimization (GRPO). Empirical results across multiple benchmarks demonstrate that CORD consistently enhances audio-conditioned reasoning and substantially bridges the audio-text performance gap with only 80k synthetic training samples, validating the efficacy and data efficiency of our on-policy, multi-level cross-modal alignment approach.
Primary: Inner Mongolia University
All Institutions: Inner Mongolia University, Tsinghua Shenzhen International Graduate School, Tsinghua University
The paper presents CORD, a novel framework for bridging the audio-text reasoning gap through weighted on-policy cross-modal distillation, significantly advancing the state of audio-language models. The methodology is innovative, addressing key challenges in cross-modal alignment, and the experimental results validate its effectiveness, marking a meaningful contribution to the field.
The proposed CORD framework introduces a novel approach to cross-modal alignment by leveraging on-policy self-distillation, which is a significant departure from traditional off-policy methods. The multi-granularity alignment strategy, which includes both token-level and sequence-level objectives, is well-justified and addresses critical issues in audio-text reasoning. The importance-aware weighting mechanism for tokens is particularly innovative, allowing the model to focus on semantically critical tokens and early reasoning stages, which are often overlooked in conventional methods. The integration of a judge model for global alignment is an interesting addition that enhances the robustness of the framework.
The experiments conducted across multiple benchmarks demonstrate the effectiveness of CORD in reducing the audio-text performance gap. The results are compelling, showing consistent improvements over baseline methods, including supervised fine-tuning and forward KL divergence. The choice of datasets and the controlled training environment strengthen the validity of the findings. However, the reliance on a single dataset for training may limit the generalizability of the results.
The paper provides a thorough description of the experimental setup, including model configurations and hyperparameters. However, the absence of a publicly available code repository or demo limits the reproducibility of the results. Future work should include making the implementation accessible to facilitate further research and validation of the findings.
One limitation of the study is the focus on a single dataset, which may not capture the full diversity of audio-text reasoning tasks. Additionally, while the framework shows promise in bridging the audio-text gap, the long-term stability of the model's performance over extended training periods remains untested. The paper also does not address potential biases in the synthetic audio data generated for training.
The implications of this research are significant for the development of more effective audio-language models, which can enhance applications in various fields such as education, accessibility, and human-computer interaction. By improving the reasoning capabilities of LALMs, this work could lead to more sophisticated audio processing systems that better understand and respond to human language. The paper presents CORD, a novel framework for bridging the audio-text reasoning gap through weighted on-policy cross-modal distillation, significantly advancing the state of audio-language models. The methodology is innovative, addressing key challenges in cross-modal alignment, and the experimental results validate its effectiveness, marking a meaningful contribution to the field.
While modern Text-to-Speech (TTS) systems achieve high fidelity for read-style speech, they struggle to generate Autonomous Sensory Meridian Response (ASMR), a specialized, low-intensity speech style essential for relaxation. The inherent challenges include ASMR's subtle, often unvoiced characteristics and the demand for zero-shot speaker adaptation. In this paper, we introduce DeepASMR, the first framework designed for zero-shot ASMR generation. We demonstrate that a single short snippet of a speaker's ordinary, read-style speech is sufficient to synthesize high-fidelity ASMR in their voice, eliminating the need for whispered training data from the target speaker. Methodologically, we first identify that discrete speech tokens provide a soft factorization of ASMR style from speaker timbre. Leveraging this insight, we propose a two-stage pipeline incorporating a Large Language Model (LLM) for content-style encoding and a flow-matching acoustic decoder for timbre reconstruction. Furthermore, we contribute DeepASMR-DB, a comprehensive 670-hour English-Chinese multi-speaker ASMR speech corpus, and introduce a novel evaluation protocol integrating objective metrics, human listening tests, LLM-based scoring and unvoiced speech analysis. Extensive experiments confirm that DeepASMR achieves state-of-the-art naturalness and style fidelity in ASMR generation for anyone of any voice, while maintaining competitive performance on normal speech synthesis.
Primary: Shanghai Jiao Tong University
All Institutions: Shanghai Jiao Tong University, MoE Key Laboratory of Artificial Intelligence, VUI Labs
The main contribution of this paper is the introduction of DeepASMR, a pioneering framework for zero-shot ASMR speech generation that leverages a two-stage architecture to synthesize high-quality ASMR speech from any speaker's voice without requiring whispered training data. This work significantly advances the state of the art in speech synthesis, particularly in generating nuanced, low-intensity speech styles essential for relaxation.
The methodology is innovative, introducing a two-stage framework that effectively separates content and style in ASMR speech synthesis. The use of a large language model (LLM) for content-style encoding and a flow-matching acoustic decoder for timbre reconstruction is a significant advancement. The identification of a latent factorization within the token space is a novel insight that enhances the model's ability to generate ASMR speech without requiring whispered training data from the target speaker. The proposed task prompt selection via a virtual speaker pool is also a clever solution to mitigate timbre leakage and ensure speaker identity preservation.
The experiments are extensive and well-structured, demonstrating the effectiveness of DeepASMR across various tasks, including intra-style and cross-style synthesis. The introduction of DeepASMR-DB, a large bilingual ASMR dataset, is a valuable contribution to the field. The evaluation metrics are comprehensive, combining objective, subjective, and LLM-based assessments, which provide a holistic view of the model's performance. The results indicate that DeepASMR outperforms existing methods in terms of naturalness and style fidelity, particularly in the challenging Normal-to-ASMR task.
The paper provides detailed implementation details, including training configurations and evaluation protocols, which enhance reproducibility. The availability of the DeepASMR-DB dataset and the demo URL further supports the reproducibility of the research. However, the reliance on proprietary models and specific architectures may pose challenges for full replication without access to the same resources.
While the paper presents a significant advancement in ASMR synthesis, it does not address potential limitations in terms of the generalizability of the model across diverse languages and cultural contexts beyond English and Chinese. Additionally, the reliance on a large dataset may limit accessibility for researchers with fewer resources. The trade-offs observed in the iterative inference refinement process, where style intensity may compromise speaker identity, also highlight a potential area for improvement.
The ability to generate high-fidelity ASMR speech has implications for various applications, including mental health, relaxation, and entertainment. This research could lead to advancements in personalized audio experiences, enhancing user engagement and satisfaction. The release of the DeepASMR-DB dataset may also stimulate further research in ASMR synthesis and related fields. The main contribution of this paper is the introduction of DeepASMR, a pioneering framework for zero-shot ASMR speech generation that leverages a two-stage architecture to synthesize high-quality ASMR speech from any speaker's voice without requiring whispered training data. This work significantly advances the state of the art in speech synthesis, particularly in generating nuanced, low-intensity speech styles essential for relaxation.
Deploying Audio-Language Models (Audio-LLMs) on edge infrastructure exposes a persistent tension between perception depth and computational efficiency. Lightweight local models tend to produce passive perception - generic summaries that miss the subtle evidence required for multi-step audio reasoning - while indiscriminate cloud offloading incurs unacceptable latency, bandwidth cost, and privacy risk. We propose CoFi-Agent (Tool-Augmented Coarse-to-Fine Agent), a hybrid architecture targeting edge servers and gateways. It performs fast local perception and triggers conditional forensic refinement only when uncertainty is detected. CoFi-Agent runs an initial single-pass on a local 7B Audio-LLM, then a cloud controller gates difficult cases and issues lightweight plans for on-device tools such as temporal re-listening and local ASR. On the MMAR benchmark, CoFi-Agent improves accuracy from 27.20% to 53.60%, while achieving a better accuracy-efficiency trade-off than an always-on investigation pipeline. Overall, CoFi-Agent bridges the perception gap via tool-enabled, conditional edge-cloud collaboration under practical system constraints.
Primary: Duke University
All Institutions: Duke University
The main contribution of this paper is the introduction of CoFi-Agent, a novel hybrid architecture that enhances audio reasoning capabilities on edge devices while maintaining efficiency and privacy. This work represents a meaningful advancement in the field of edge AI, particularly for applications requiring nuanced audio understanding.
The methodology presented in this paper is innovative, introducing a hybrid coarse-to-fine architecture that effectively balances local processing and cloud offloading. The approach of using a lightweight local model for initial perception, followed by conditional refinement based on uncertainty detection, is a significant advancement in edge audio systems. The use of on-device tools like temporal re-listening and local ASR for targeted refinement is particularly noteworthy, as it addresses the challenges of latency and privacy in edge deployments. The paper clearly outlines the stages of the CoFi-Agent's operation, providing a structured framework for understanding its functionality.
The experimental evaluation is robust, utilizing the MMAR benchmark with a substantial sample size of 1,000. The results demonstrate a significant improvement in accuracy from 27.20% to 53.60%, showcasing the effectiveness of the proposed architecture. The paper includes a thorough analysis of the accuracy-latency trade-off, which is critical for applications in resource-constrained environments. However, specific details on the experimental setup, such as the exact configurations of the models used, would enhance the transparency of the results.
While the paper provides some implementation details, including the hardware setup and model specifications, it lacks comprehensive information that would facilitate full reproducibility. Key aspects such as hyperparameter settings, data preprocessing steps, and the exact versions of the models used are not sufficiently detailed. Providing a link to a code repository or supplementary materials would greatly enhance reproducibility.
One limitation of the study is the reliance on a single benchmark (MMAR) for evaluation, which may not fully capture the generalizability of the CoFi-Agent across diverse audio tasks. Additionally, the paper acknowledges that false escalations can occur, particularly in low-SNR scenarios, which could impact the overall performance. The potential for privacy concerns, even with symbolic transmission, remains a critical issue that warrants further exploration.
The proposed CoFi-Agent architecture has significant implications for various applications, including smart home devices, security systems, and autonomous robots. By addressing the challenges of latency, bandwidth, and privacy, this work paves the way for more efficient and effective edge audio processing systems. The conditional refinement strategy could inspire future research in other domains where resource constraints are a concern, potentially leading to broader applications in real-time audio analysis and interaction. The main contribution of this paper is the introduction of CoFi-Agent, a novel hybrid architecture that enhances audio reasoning capabilities on edge devices while maintaining efficiency and privacy. This work represents a meaningful advancement in the field of edge AI, particularly for applications requiring nuanced audio understanding.
Edge devices operate in constrained and varying resource settings, requiring dynamic architectures that can adapt to limitations of the available resources. To meet such demands, layer dropping ($\mathcal{LD}$) approach is typically used to transform static models into dynamic ones by skipping parts of the network along with reducing overall computational complexity. However, existing $\mathcal{LD}$ methods greatly impact the dynamic model's performance for low and high dropping cases, deteriorating the performance-computation trade-off. To this end, we propose a distillation-based layer dropping (DLD) framework that effectively combines the capabilities of knowledge distillation and $\mathcal{LD}$ in an end-to-end fashion, thereby achieving state-of-the-art performance for dynamic speech networks. Comprehensive experimentation utilizing well-known speech recognition methods, including conformer and WavLM, on three public benchmarks demonstrates the effectiveness of our framework, reducing the word error rate by $9.32\%$ and $2.25\%$ for high and no dropping cases with $33.3\%$ reduction in training time.
Primary: Johannes Kepler University Linz
All Institutions: Johannes Kepler University Linz, University of Trento
The paper presents a novel distillation-based layer dropping framework that significantly enhances dynamic speech networks. It effectively combines knowledge distillation with layer dropping, yielding state-of-the-art performance in automatic speech recognition while addressing critical performance-computation trade-offs for edge devices.
The proposed DLD framework innovatively integrates knowledge distillation with layer dropping to enhance dynamic architectures for speech recognition. The methodology is well-structured, utilizing a gated mechanism for encoder modules and employing Kullback–Leibler divergence for embedding alignment. This approach addresses performance degradation in dynamic models effectively, showcasing a solid understanding of the challenges in deploying ASR models on edge devices.
The experiments are comprehensive, utilizing two prominent architectures (Conformer and WavLM) and multiple datasets (LibriSpeech and TED-LIUM v3). The results demonstrate significant improvements in word error rates and training efficiency compared to baseline methods, providing strong empirical support for the proposed framework. The use of various dropping probabilities and comparisons with established methods enhances the robustness of the evaluation.
The paper provides sufficient implementation details, including model architectures, training procedures, and hyperparameters. The availability of the code repository further supports reproducibility, allowing other researchers to validate and build upon the findings.
While the paper addresses key challenges in dynamic architectures, it does not explore the scalability of the DLD framework across a wider range of architectures or tasks beyond ASR. Additionally, the reliance on specific datasets may limit the generalizability of the results.
The DLD framework has significant implications for deploying speech recognition models on resource-constrained devices, potentially enhancing accessibility and usability in various applications, including mobile devices and IoT systems. This work could inspire further research into dynamic architectures in other domains of machine learning. The paper presents a novel distillation-based layer dropping framework that significantly enhances dynamic speech networks. It effectively combines knowledge distillation with layer dropping, yielding state-of-the-art performance in automatic speech recognition while addressing critical performance-computation trade-offs for edge devices.
Emotional information in speech plays a unique role in multimodal perception. However, current Speech Large Language Models (SpeechLLMs), similar to conventional speech emotion recognition (SER) systems, still treat emotion understanding as a simple classification problem. This provides limited interpretability of predictions, while leaving the LLMs' expressive and reasoning capabilities underutilized. In this work, we take the first step to reformulate SER as a deep reasoning problem through reinforcement learning (RL). We propose EmotionThinker, which is designed to generate accurate emotion predictions with interpretable explanations grounded in fine-grained acoustic cues. To achieve this, we first construct EmotionCoT-35K, an emotional reasoning dataset with Chain-of-Thought annotations and detailed captions. Second, we observe that current SpeechLLMs exhibit weak prosody perception, whereas prosodic cues constitute fundamental signals for interpreting emotions. To address this, we develop the prosody-enhanced foundation model EmotionThinker-Base, and demonstrate that prosody enhancement improves emotion understanding. Third, we introduce Group-Relative-Policy-Optimization with Progressive-Trust-aware-Reasoning-Reward (GRPO-PTR) for RL. Different from standard GRPO, which relies only on rule-based outcome rewards, GRPO-PTR progressively introduces reasoning reward, dynamically adjusts it with a trustworthiness weight reflecting the alignment between reasoning and outcome, and evaluates the overall reasoning quality with a reward model based on multi-dimensional criteria. EmotionThinker outperforms previous state-of-the-art evaluation models both in emotion accuracy and explanation quality, advancing SER toward interpretable multimodal reasoning. Project page: https://github.com/dingdongwang/EmotionThinker
Primary: The Chinese University of Hong Kong
All Institutions: The Chinese University of Hong Kong, Microsoft Corporation
The main contribution of this paper is the introduction of EmotionThinker, a novel framework that utilizes reinforcement learning for explainable speech emotion reasoning, addressing the limitations of existing SER systems by incorporating prosodic features and providing interpretable explanations. The comprehensive analysis reveals a significant advancement in the interpretability and accuracy of emotion recognition in speech, with implications for various applications in AI and human-computer interaction.
The methodology presented in EmotionThinker is innovative as it reframes speech emotion recognition (SER) from a mere classification task to a deep reasoning problem using reinforcement learning (RL). The introduction of the EmotionCoT-35K dataset, which includes Chain-of-Thought annotations, is a significant contribution that enhances the interpretability of emotion predictions. The development of the prosody-enhanced foundation model and the novel GRPO-PTR algorithm, which incorporates reasoning rewards and trustworthiness weights, demonstrates a thoughtful approach to improving the model's performance and interpretability. However, the complexity of the proposed methods may pose challenges in practical applications.
The experiments conducted in the paper are robust, showcasing the performance of EmotionThinker against state-of-the-art models in terms of both emotion accuracy and explanation quality. The use of a well-structured dataset and the clear presentation of results strengthen the paper's claims. However, the paper could benefit from a more detailed analysis of the experimental setup, including hyperparameter choices and training procedures, to allow for better reproducibility.
While the paper provides a project URL that likely contains code and resources, the reproducibility of the results could be enhanced by including more detailed descriptions of the experimental setup, including data preprocessing steps, model training configurations, and evaluation metrics used. The absence of a demo URL also limits the ability to interact with the model directly.
One limitation of the study is the reliance on the prosody-enhanced model, which may not generalize well across different languages or dialects where prosodic features vary significantly. Additionally, the complexity of the GRPO-PTR algorithm may hinder its adoption in real-world applications where simpler models are preferred. The paper also does not address potential biases in the EmotionCoT-35K dataset, which could affect the model's performance.
The proposed EmotionThinker model has the potential to significantly advance the field of speech emotion recognition by providing interpretable and explainable predictions. This can have wide-ranging applications in areas such as human-computer interaction, mental health monitoring, and customer service. By enhancing the understanding of emotional cues in speech, this research could lead to more empathetic and responsive AI systems. The main contribution of this paper is the introduction of EmotionThinker, a novel framework that utilizes reinforcement learning for explainable speech emotion reasoning, addressing the limitations of existing SER systems by incorporating prosodic features and providing interpretable explanations. The comprehensive analysis reveals a significant advancement in the interpretability and accuracy of emotion recognition in speech, with implications for various applications in AI and human-computer interaction.
Real-time automatic speech recognition systems are increasingly integrated into interactive applications, from voice assistants to live transcription services. However, scaling these systems to support multiple concurrent clients while maintaining low latency and high accuracy remains a major challenge. In this work, we present SWIM, a novel real-time ASR system built on top of OpenAI's Whisper model that enables true model-level parallelization for scalable, multilingual transcription. SWIM supports multiple concurrent audio streams without modifying the underlying model. It introduces a buffer merging strategy that maintains transcription fidelity while ensuring efficient resource usage. We evaluate SWIM in multi-client settings -- scaling up to 20 concurrent users -- and show that it delivers accurate real-time transcriptions in English, Italian, and Spanish, while maintaining low latency and high throughput. While Whisper-Streaming achieves a word error rate of approximately 8.2% with an average delay of approximately 3.4 s in a single-client, English-only setting, SWIM extends this capability to multilingual, multi-client environments. It maintains comparable accuracy with significantly lower delay -- around 2.4 s with 5 clients -- and continues to scale effectively up to 20 concurrent clients without degrading transcription quality and increasing overall throughput. Our approach advances scalable ASR by improving robustness and efficiency in dynamic, multi-user environments.
Primary: UniversitĂ degli Studi di Milano
All Institutions: UniversitĂ degli Studi di Milano, VoiSmart Srl
The main contribution of this paper is the development of SWIM, a scalable real-time ASR system that effectively manages multiple concurrent audio streams while maintaining low latency and high accuracy, thereby addressing a critical challenge in the field. The technical contributions and methodology are well-articulated, showcasing significant advancements in ASR technology.
The paper introduces SWIM, a novel architecture that builds upon OpenAI's Whisper model, focusing on model-level parallelization to handle multiple concurrent audio streams. The methodology includes a buffer merging strategy that optimizes resource usage while preserving transcription fidelity. This approach is innovative in the context of real-time ASR systems, especially for multilingual applications, as it does not require modifications to the underlying model. The design choices are well-justified and align with the stated goals of scalability and efficiency.
The evaluation of SWIM is robust, demonstrating its capabilities in multi-client settings with up to 20 concurrent users. The reported results indicate a significant reduction in latency (from 3.4 seconds to 2.4 seconds with 5 clients) while maintaining comparable accuracy to existing systems. The experiments are well-structured, providing a clear comparison with Whisper-Streaming and highlighting the advantages of SWIM in practical scenarios.
The paper lacks detailed implementation specifics, such as code availability or access to datasets used for testing. This omission raises concerns about the reproducibility of the results. While the methodology is sound, the absence of a public demo or project URL limits the ability for other researchers to validate the findings independently.
One limitation noted is the reliance on the Whisper model, which may not generalize to all ASR tasks or languages. Additionally, while the system scales well, the paper does not address potential challenges in environments with highly variable audio quality or background noise, which could impact transcription accuracy.
The advancements presented in this paper have significant implications for real-time ASR applications, particularly in multilingual settings. By enabling efficient handling of multiple audio streams, SWIM could enhance user experiences in voice assistants, live transcription services, and other interactive applications. The work contributes to the ongoing evolution of scalable ASR systems, making them more accessible and effective for diverse user needs. The main contribution of this paper is the development of SWIM, a scalable real-time ASR system that effectively manages multiple concurrent audio streams while maintaining low latency and high accuracy, thereby addressing a critical challenge in the field. The technical contributions and methodology are well-articulated, showcasing significant advancements in ASR technology.
The development of robust, multilingual speaker recognition systems is hindered by a lack of large-scale, publicly available and multilingual datasets, particularly for the read-speech style crucial for applications like anti-spoofing. To address this gap, we introduce the TidyVoice dataset derived from the Mozilla Common Voice corpus after mitigating its inherent speaker heterogeneity within the provided client IDs. TidyVoice currently contains training and test data from over 212,000 monolingual speakers (Tidy-M) and around 4,500 multilingual speakers (Tidy-X) from which we derive two distinct conditions. The Tidy-M condition contains target and non-target trials from monolingual speakers across 81 languages. The Tidy-X condition contains target and non-target trials from multilingual speakers in both same- and cross-language trials. We employ two architectures of ResNet models, achieving a 0.35% EER by fine-tuning on our comprehensive Tidy-M partition. Moreover, we show that this fine-tuning enhances the model's generalization, improving performance on unseen conversational interview data from the CANDOR corpus. The complete dataset, evaluation trials, and our models are publicly released to provide a new resource for the community.
Primary: Otto-von-Guericke-University Magdeburg
All Institutions: Otto-von-Guericke-University Magdeburg, University of Zurich
The main contribution of this paper is the introduction of the TidyVoice dataset, a large-scale multilingual resource for speaker verification, which addresses significant gaps in existing datasets. This work is a notable advancement in the field, providing essential resources for developing more inclusive and effective speaker recognition systems.
The paper presents a well-structured methodology for curating the TidyVoice dataset from the Mozilla Common Voice corpus, addressing critical issues of speaker heterogeneity through a verification-based pipeline. The use of ResNet architectures for speaker verification is appropriate, and the fine-tuning process is clearly articulated, demonstrating a systematic approach to enhancing model performance. However, the paper could benefit from more detailed explanations on hyperparameter tuning and the rationale behind the chosen architectures.
The experimental evaluation is robust, with comprehensive results across multiple benchmarks, including the CANDOR corpus. The authors effectively demonstrate the performance improvements achieved through fine-tuning on the TidyVoice dataset. The use of standard metrics (EER and minDCF) is appropriate, and the results are presented clearly. However, the paper could provide more context on the significance of the improvements in practical applications.
The authors have made a commendable effort to ensure reproducibility by publicly releasing the dataset, evaluation trials, and trained models. The detailed description of the experimental setup, including feature extraction and model training protocols, enhances the reproducibility of their findings. However, the paper could include more information on the specific versions of libraries and frameworks used.
One limitation is the reliance on the Mozilla Common Voice corpus, which may still contain noise and inconsistencies despite the authors' efforts to mitigate speaker heterogeneity. Additionally, the dataset's performance may vary significantly across languages, particularly those with fewer speakers. The paper also does not address potential biases in the dataset that could affect model generalization.
The TidyVoice dataset has the potential to significantly advance the field of multilingual speaker verification, particularly in applications requiring anti-spoofing measures. By providing a large-scale, publicly available resource, the authors contribute to fairer and more robust speaker recognition systems. The implications of this work extend to security applications, personalized user experiences, and accessibility in technology. The main contribution of this paper is the introduction of the TidyVoice dataset, a large-scale multilingual resource for speaker verification, which addresses significant gaps in existing datasets. This work is a notable advancement in the field, providing essential resources for developing more inclusive and effective speaker recognition systems.
Direct Speech-to-Speech Translation (S2ST) has gained increasing attention for its ability to translate speech from one language to another, while reducing error propagation and latency inherent in traditional cascaded pipelines. However, existing direct S2ST systems continue to face notable challenges, including instability in semantic-acoustic alignment when parallel speech data is scarce, difficulty in preserving speaker identity, and limited multilingual scalability. In this work, we introduce DS2ST-LM, a scalable, single-stage direct S2ST framework leveraging a multilingual Large Language Model (LLM). The architecture integrates a Whisper speech encoder, a learnable projection module, a Qwen2-0.5B LLM, and a timbre-controlled vocoder. We construct GigaS2S-1000, a 1000-hour bilingual corpus by extending the GigaST dataset with high-fidelity synthetic target speech, and show that this synthetic data alleviates data scarcity to some extent. We investigate two semantic token generation strategies: speech-derived S3 tokens and text-derived tokens generated by a pre-trained LLM, and analyze their impact on training stability and semantic consistency. We further evaluate three projection architectures (Linear, Conv1D-Linear, and Q-Former) and observe that while higher-capacity projectors converge faster, the simple Linear projector achieves higher performance. Extensive experiments demonstrate that DS2ST-LM outperforms traditional cascaded and ST (Qwen-Audio) + TTS baselines across both lexical (BLEU, METEOR) and semantic (BLEURT, COMET) metrics, while extending to multiple language pairs, including French, Spanish, German, Hindi, Bengali, and Urdu. Furthermore, we incorporate timbre-aware speech synthesis to preserve speaker information, enabling DS2ST-LM to surpass prior direct S2ST systems in both speaker similarity and perceptual naturalness.
Primary: Indian Institute of Technology Dharwad
All Institutions: Indian Institute of Technology Dharwad, Indian Institute of Technology Jammu
The main contribution of this work is the development of DS2ST-LM, a scalable direct speech-to-speech translation framework that effectively addresses key challenges in the field while demonstrating superior performance across multiple language pairs. The comprehensive methodology, rigorous experimental evaluation, and commitment to reproducibility position this research as a significant advancement in the S2ST domain, with potential applications that extend beyond traditional language barriers.
The proposed methodology introduces a novel single-stage direct speech-to-speech translation (S2ST) framework, DS2ST-LM, which integrates a Whisper speech encoder, a multilingual large language model (LLM), and a timbre-controlled vocoder. The architecture is innovative in its use of semantic token generation strategies, comparing speech-derived and text-derived tokens, and evaluating various projection architectures to optimize performance. The incorporation of a large-scale synthetic dataset, GigaS2S-1000, enhances the model's training capabilities, addressing the scarcity of parallel speech data. The approach is well-structured, systematically addressing key challenges in S2ST, such as semantic-acoustic alignment and speaker identity preservation.
The experimental evaluation is comprehensive, utilizing multiple datasets and baseline comparisons to assess the performance of DS2ST-LM. The results demonstrate significant improvements over traditional cascaded systems and ST + TTS baselines across various lexical and semantic metrics. The paper provides detailed performance metrics, including BLEU, METEOR, BLEURT, and COMET, which effectively illustrate the model's strengths in translation quality and speaker similarity. The extensive evaluation across multiple language pairs further validates the model's robustness and scalability.
The authors emphasize reproducibility by releasing all training recipes, evaluation pipelines, and model checkpoints. This commitment to transparency is commendable and facilitates further research in the field. However, the paper could benefit from more detailed implementation instructions or a dedicated repository for code and models to enhance accessibility for other researchers.
While the paper presents a strong framework, limitations include the reliance on synthetic data for training, which may not fully capture the nuances of natural speech. Additionally, the performance variations across languages suggest that the model's effectiveness may be contingent on the linguistic diversity and representation in the training data. The study also does not address potential biases in the synthetic data generation process, which could impact the model's generalization to real-world applications.
The proposed DS2ST-LM framework has significant implications for real-time multilingual communication, potentially enhancing accessibility and understanding in diverse linguistic contexts. The ability to preserve speaker identity and produce high-quality speech synthesis could facilitate applications in areas such as international diplomacy, global business, and education. Furthermore, the research contributes to the growing body of work on integrating LLMs in speech processing, paving the way for future advancements in the field. The main contribution of this work is the development of DS2ST-LM, a scalable direct speech-to-speech translation framework that effectively addresses key challenges in the field while demonstrating superior performance across multiple language pairs. The comprehensive methodology, rigorous experimental evaluation, and commitment to reproducibility position this research as a significant advancement in the S2ST domain, with potential applications that extend beyond traditional language barriers.
An utterance-level speaker embedding is typically obtained by aggregating a sequence of frame-level representations. However, in real-world scenarios, individual frames encode not only speaker-relevant information but also various nuisance factors. As a result, different frames contribute unequally to the final utterance-level speaker representation for Automatic Speaker Verification systems. To address this issue, we propose to estimate the inherent uncertainty of each frame and assign adaptive weights accordingly, where frames with higher uncertainty receive lower attention. Based on this idea, we present U3-xi, a comprehensive framework designed to produce more reliable and interpretable uncertainty estimates for speaker embeddings. Specifically, we introduce several strategies for uncertainty supervision. First, we propose speaker-level uncertainty supervision via a Stochastic Variance Loss, where the distance between an utterance embedding and its corresponding speaker centroid serves as a pseudo ground truth for uncertainty learning. Second, we incorporate global-level uncertainty supervision by injecting the predicted uncertainty into the sof tmax scale during training. This adaptive scaling mechanism adjusts the sharpness of the decision boundary according to sample difficulty, providing global guidance. Third, we redesign the uncertainty estimation module by integrating a Transformer encoder with multi-view self-attention, enabling the model to capture rich local and long-range temporal dependencies. Comprehensive experiments demonstrate that U3-xi is model-agnostic and can be seamlessly applied to various speaker encoders. In particular, when applied to ECAPA-TDNN, it achieves 21.1% and 15.57% relative improvements on the VoxCeleb1 test sets in terms of EER and minDCF, respectively.
Primary: The Hong Kong Polytechnic University
All Institutions: The Hong Kong Polytechnic University
The main contribution of this paper is the U3-xi framework, which enhances speaker recognition by effectively incorporating uncertainty estimation into speaker embeddings, resulting in improved performance and interpretability. This work represents a significant step forward in the field of audio processing and speaker verification, addressing key challenges in the domain with innovative methodologies and rigorous experimental validation.
The proposed U3-xi framework introduces a novel approach to speaker recognition by incorporating uncertainty estimation into the speaker embedding process. The methodology is well-structured, with three main components: speaker-level uncertainty supervision via Stochastic Variance Loss, global-level uncertainty supervision through adaptive scaling in the softmax function, and an advanced uncertainty estimation module utilizing a Transformer encoder with multi-view self-attention. This combination allows the model to effectively weigh frame contributions based on their uncertainty, enhancing the robustness and interpretability of speaker embeddings. The integration of these components demonstrates a thoughtful approach to addressing the limitations of existing methods, particularly the xi-vector framework.
The experiments conducted on the VoxCeleb1 and VoxCeleb2 datasets provide strong empirical support for the proposed framework. The reported improvements in Equal Error Rate (EER) and minimum Detection Cost Function (minDCF) metrics are significant, showcasing the effectiveness of the U3-xi model over baseline methods. The paper includes comprehensive ablation studies that validate the contributions of each component, thus reinforcing the credibility of the results. However, further exploration of cross-domain performance could enhance the understanding of the model's generalizability.
The paper provides a clear description of the experimental setup, including dataset details, training strategies, and evaluation metrics. However, the absence of a publicly available implementation or code repository limits reproducibility. Providing a GitHub link or similar resource would greatly benefit the community and facilitate further research based on this work.
While the proposed methods show promise, there are limitations in terms of scalability and computational efficiency, particularly with the Transformer encoder's complexity. Additionally, the reliance on specific datasets may restrict the generalizability of the findings to other speaker recognition tasks or languages. The paper could also benefit from a discussion on the potential impact of varying dataset conditions on the performance of the model.
The advancements in speaker recognition through uncertainty modeling have significant implications for biometric authentication, personalized user services, and security applications. The ability to produce more reliable speaker embeddings can enhance the performance of systems in real-world scenarios, where noise and variability are prevalent. Furthermore, the interpretability of uncertainty estimates can lead to more trustworthy AI systems in sensitive applications. The main contribution of this paper is the U3-xi framework, which enhances speaker recognition by effectively incorporating uncertainty estimation into speaker embeddings, resulting in improved performance and interpretability. This work represents a significant step forward in the field of audio processing and speaker verification, addressing key challenges in the domain with innovative methodologies and rigorous experimental validation.
Keyword Spotting (KWS) systems with small footprint models deployed on edge devices face significant accuracy and robustness challenges due to domain shifts caused by varying noise and recording conditions. To address this, we propose a comprehensive framework for continual learning designed to adapt to new domains while maintaining computational efficiency. The proposed pipeline integrates a dual-input Convolutional Neural Network, utilizing both Mel Frequency Cepstral Coefficients (MFCC) and Mel-spectrogram features, supported by a multi-stage denoising process, involving discrete wavelet transform and spectral subtraction techniques, plus model and prototype update blocks. Unlike prior methods that restrict updates to specific layers, our approach updates the complete quantized model, made possible due to compact model architecture. A subset of input samples are selected during runtime using class prototypes and confidence-driven filtering, which are then pseudo-labeled and combined with rehearsal buffer for incremental model retraining. Experimental results on noisy test dataset demonstrate the framework's effectiveness, achieving 99.63\% accuracy on clean data and maintaining robust performance (exceeding 94\% accuracy) across diverse noisy environments, even at -10 dB Signal-to-Noise Ratio. The proposed framework work confirms that integrating efficient denoising with prototype-based continual learning enables KWS models to operate autonomously and robustly in resource-constrained, dynamic environments.
Primary: University of Kentucky
All Institutions: University of Kentucky
The main contribution of this paper is the development of a novel continual learning framework for keyword spotting that effectively adapts to domain shifts in noisy environments while maintaining computational efficiency. This work significantly advances the field of audio processing by integrating robust denoising techniques with a prototype-based continual learning approach, demonstrating high accuracy and adaptability in real-time applications.
The proposed methodology integrates a dual-input CNN architecture that utilizes both MFCC and Mel-spectrogram features, combined with a multi-stage denoising process. This approach is innovative as it allows for a comprehensive continual learning framework that updates the entire model rather than just specific layers, which is a notable departure from traditional methods. The use of a rehearsal buffer and confidence-driven filtering for effective sample selection enhances the robustness of the model, particularly in dynamic environments. However, the complexity of the model architecture and the denoising techniques may introduce challenges in terms of computational efficiency on resource-constrained devices.
The experimental evaluation is thorough, utilizing the Google Speech Commands Dataset and the DEMAND dataset for noise simulation. The results demonstrate high accuracy (99.63% on clean data and above 94% in noisy conditions) which supports the effectiveness of the proposed framework. The systematic assessment of performance across different noise levels and environments adds credibility to the findings. However, the paper could benefit from a more detailed analysis of the computational costs associated with the proposed methods.
The paper provides a detailed description of the methodology, including the architecture, training process, and evaluation metrics, which aids in reproducibility. However, the absence of a public repository for the code and models limits the ability for others to fully replicate the results. Including a link to a GitHub repository or similar would enhance reproducibility.
The study primarily focuses on binary classification, which may limit its applicability to broader use cases. Additionally, the reliance on specific datasets may not generalize well to other domains or languages. The computational efficiency on extremely resource-constrained devices remains to be fully validated, as the implementation details may vary across different hardware.
The proposed framework has significant implications for the deployment of keyword spotting systems in real-world applications, especially in resource-constrained environments such as IoT devices. By addressing domain shifts and maintaining performance in noisy conditions, the research contributes to the advancement of continual learning in practical scenarios, potentially enhancing user experiences in voice-activated systems. The main contribution of this paper is the development of a novel continual learning framework for keyword spotting that effectively adapts to domain shifts in noisy environments while maintaining computational efficiency. This work significantly advances the field of audio processing by integrating robust denoising techniques with a prototype-based continual learning approach, demonstrating high accuracy and adaptability in real-time applications.
In this report, we present the Qwen3-TTS series, a family of advanced multilingual, controllable, robust, and streaming text-to-speech models. Qwen3-TTS supports state-of-the-art 3-second voice cloning and description-based control, allowing both the creation of entirely novel voices and fine-grained manipulation over the output speech. Trained on over 5 million hours of speech data spanning 10 languages, Qwen3-TTS adopts a dual-track LM architecture for real-time synthesis, coupled with two speech tokenizers: 1) Qwen-TTS-Tokenizer-25Hz is a single-codebook codec emphasizing semantic content, which offers seamlessly integration with Qwen-Audio and enables streaming waveform reconstruction via a block-wise DiT. 2) Qwen-TTS-Tokenizer-12Hz achieves extreme bitrate reduction and ultra-low-latency streaming, enabling immediate first-packet emission ($97\,\mathrm{ms}$) through its 12.5 Hz, 16-layer multi-codebook design and a lightweight causal ConvNet. Extensive experiments indicate state-of-the-art performance across diverse objective and subjective benchmark (e.g., TTS multilingual test set, InstructTTSEval, and our long speech test set). To facilitate community research and development, we release both tokenizers and models under the Apache 2.0 license.
Primary: unknown
All Institutions: unknown
The Qwen3-TTS series represents a notable advancement in multilingual and controllable text-to-speech synthesis, with a robust architecture and promising performance metrics. However, the paper would benefit from clearer experimental details and a more thorough discussion of limitations and ethical considerations.
The Qwen3-TTS models utilize a dual-track language model architecture, which is innovative in the context of real-time text-to-speech synthesis. The introduction of two distinct tokenizers—Qwen-TTS-Tokenizer-25Hz and Qwen-TTS-Tokenizer-12Hz—demonstrates a thoughtful approach to addressing both semantic content and latency. The 25Hz tokenizer focuses on semantic fidelity, while the 12Hz tokenizer optimizes for low-latency streaming, showcasing a nuanced understanding of the trade-offs in TTS systems. The methodology is well-structured, though it could benefit from a more detailed explanation of the training process and the specific algorithms employed in the dual-track architecture.
The paper reports extensive experiments that claim state-of-the-art performance across various benchmarks, including multilingual tests and subjective evaluations. However, the specifics of these experiments, such as the datasets used, the evaluation metrics, and the comparison with baseline models, are not elaborated in the abstract. This lack of detail makes it difficult to fully assess the robustness of the results. The claim of "extensive experiments" should be backed by comprehensive tables and figures in the full paper to substantiate the findings.
The paper mentions the release of models and tokenizers under the Apache 2.0 license, which is a positive step towards reproducibility. However, without specific URLs or links to the code repositories or datasets used for training and evaluation, it is challenging to ascertain how easily others can replicate the results. Clear documentation and access to the training data would enhance reproducibility significantly.
The paper does not address potential limitations of the Qwen3-TTS models, such as the challenges of multilingual synthesis, the quality of voice cloning in less-represented languages, or the computational resources required for real-time synthesis. Additionally, the performance metrics should be critically examined, as subjective evaluations can be influenced by various biases.
The advancements presented in the Qwen3-TTS series have significant implications for applications in voice synthesis, accessibility technologies, and content creation. The ability to clone voices and manipulate speech output opens new avenues for personalized applications, but it also raises ethical concerns regarding misuse in deepfake technologies. The release of the models to the community could foster further research and innovation in TTS systems. The Qwen3-TTS series represents a notable advancement in multilingual and controllable text-to-speech synthesis, with a robust architecture and promising performance metrics. However, the paper would benefit from clearer experimental details and a more thorough discussion of limitations and ethical considerations.
Although text-to-audio generation has made remarkable progress in realism and diversity, the development of evaluation metrics has not kept pace. Widely-adopted approaches, typically based on embedding similarity like CLAPScore, effectively measure general relevance but remain limited in fine-grained semantic alignment and compositional reasoning. To address this, we introduce AQAScore, a backbone-agnostic evaluation framework that leverages the reasoning capabilities of audio-aware large language models (ALLMs). AQAScore reformulates assessment as a probabilistic semantic verification task; rather than relying on open-ended text generation, it estimates alignment by computing the exact log-probability of a "Yes" answer to targeted semantic queries. We evaluate AQAScore across multiple benchmarks, including human-rated relevance, pairwise comparison, and compositional reasoning tasks. Experimental results show that AQAScore consistently achieves higher correlation with human judgments than similarity-based metrics and generative prompting baselines, showing its effectiveness in capturing subtle semantic inconsistencies and scaling with the capability of underlying ALLMs.
Primary: Massachusetts Institute of Technology
All Institutions: Massachusetts Institute of Technology, National Taiwan University
The main contribution of this paper is the introduction of AQAScore, a novel evaluation framework that enhances the assessment of semantic alignment in text-to-audio generation. The comprehensive analysis highlights the innovative methodology, robust experimental validation, and the potential for broader applications in the field, marking a significant advancement in evaluation metrics for audio generation tasks.
The AQAScore framework introduces a novel approach to evaluating text-to-audio generation by reformulating the evaluation as a probabilistic semantic verification task. This methodology leverages audio-aware large language models (ALLMs) to assess semantic alignment through targeted queries, which is a significant shift from traditional embedding similarity metrics. The backbone-agnostic nature of AQAScore enhances its applicability across different models, making it a versatile tool for researchers in the field.
The experimental setup is robust, involving multiple benchmarks that include human-rated relevance and pairwise comparisons. The results demonstrate a consistent improvement in correlation with human judgments compared to existing metrics, indicating that AQAScore effectively captures subtle semantic inconsistencies. The thorough evaluation across diverse tasks adds credibility to the findings and showcases the framework's strengths.
While the paper outlines the methodology and experimental setups, it lacks specific implementation details that would facilitate reproducibility. The absence of a publicly available code repository or demo limits the ability for other researchers to validate the findings independently.
The paper acknowledges limitations, such as the potential biases in human ratings and the reliance on the capabilities of underlying ALLMs. Additionally, the framework's performance may vary depending on the specific models used, which could affect generalizability.
AQAScore has the potential to significantly influence the evaluation metrics used in text-to-audio generation, paving the way for more nuanced assessments of semantic alignment. This could lead to advancements in applications such as audio generation for multimedia content, enhancing user experiences in various domains. The main contribution of this paper is the introduction of AQAScore, a novel evaluation framework that enhances the assessment of semantic alignment in text-to-audio generation. The comprehensive analysis highlights the innovative methodology, robust experimental validation, and the potential for broader applications in the field, marking a significant advancement in evaluation metrics for audio generation tasks.
The advances in generative AI have enabled the creation of synthetic audio which is perceptually indistinguishable from real, genuine audio. Although this stellar progress enables many positive applications, it also raises risks of misuse, such as for impersonation, disinformation and fraud. Despite a growing number of open-source fake audio detection codes released through numerous challenges and initiatives, most are tailored to specific competitions, datasets or models. A standardized and unified toolkit that supports the fair benchmarking and comparison of competing solutions with not just common databases, protocols, metrics, but also a shared codebase, is missing. To address this, we propose WeDefense, the first open-source toolkit to support both fake audio detection and localization. Beyond model training, WeDefense emphasizes critical yet often overlooked components: flexible input and augmentation, calibration, score fusion, standardized evaluation metrics, and analysis tools for deeper understanding and interpretation. The toolkit is publicly available at https://github.com/zlin0/wedefense with interactive demos for fake audio detection and localization.
Primary: Johns Hopkins University
All Institutions: Johns Hopkins University, Brno University of Technology, National Institute of Informatics, National University of Singapore, University of Rochester, Nanyang Technological University, Nanjing University, WeNet Opensource Community, EURECOM
WeDefense is the first open-source toolkit specifically designed for fake audio detection and localization, providing a comprehensive and modular framework that addresses critical gaps in existing methodologies. The paper's contributions are significant, offering a valuable resource for researchers and practitioners in the rapidly evolving field of audio security.
The methodology presented in WeDefense is comprehensive, addressing both detection and localization of fake audio. The toolkit's design emphasizes modularity and extensibility, allowing for integration of various models and techniques. It incorporates critical aspects such as flexible input handling, calibration, and score fusion, which are often overlooked in existing frameworks. The authors have effectively identified gaps in current methodologies and provided a structured approach to tackle these issues, making it a valuable resource for researchers in the field.
The experimental evaluation is thorough, utilizing well-established datasets like PartialSpoof and ASVspoof5. The paper provides detailed comparisons of various models, including both conventional and self-supervised learning (SSL) models, demonstrating the effectiveness of the proposed toolkit. The results indicate that the WeDefense toolkit achieves competitive performance, particularly with SSL models, which is a significant advancement in the domain of fake audio detection.
The authors have made efforts to ensure reproducibility by providing a clear structure for experiments and detailing configurations in YAML files. The availability of the toolkit on GitHub, along with interactive demos, enhances the reproducibility of the research. However, the paper mentions that the repository will be made available after the review period, which could limit immediate access for validation.
One limitation noted is the modest number of supported models for localization at the time of writing, which may restrict the toolkit's applicability for certain advanced tasks. Additionally, while the toolkit is designed to be extensible, the integration of new models and techniques may require significant effort from users who are not familiar with the underlying architecture.
The WeDefense toolkit has the potential to significantly impact the field of audio processing and security by providing a standardized framework for fake audio detection and localization. Its open-source nature encourages community involvement and could lead to accelerated advancements in the development of robust detection systems, ultimately contributing to the fight against disinformation and fraud in audio content. WeDefense is the first open-source toolkit specifically designed for fake audio detection and localization, providing a comprehensive and modular framework that addresses critical gaps in existing methodologies. The paper's contributions are significant, offering a valuable resource for researchers and practitioners in the rapidly evolving field of audio security.
Developing algorithms for sound classification, detection, and localization requires large amounts of flexible and realistic audio data, especially when leveraging modern machine learning and beamforming techniques. However, most existing acoustic simulators are tailored for indoor environments and are limited to static sound sources, making them unsuitable for scenarios involving moving sources, moving microphones, or long-distance propagation. This paper presents DynamicSound an open-source acoustic simulation framework for generating multichannel audio from one or more sound sources with the possibility to move them continuously in three-dimensional space and recorded by arbitrarily configured microphone arrays. The proposed model explicitly accounts for finite sound propagation delays, Doppler effects, distance-dependent attenuation, air absorption, and first-order reflections from planar surfaces, yielding temporally consistent spatial audio signals. Unlike conventional mono or stereo simulators, the proposed system synthesizes audio for an arbitrary number of virtual microphones, accurately reproducing inter-microphone time delays, level differences, and spectral coloration induced by the environment. Comparative evaluations with existing open-source tools demonstrate that the generated signals preserve high spatial fidelity across varying source positions and acoustic conditions. By enabling the generation of realistic multichannel audio under controlled and repeatable conditions, the proposed open framework provides a flexible and reproducible tool for the development, training, and evaluation of modern spatial audio and sound-source localization algorithms.
Primary: Politecnico di Torino
All Institutions: Politecnico di Torino, University of California
The main contribution of this paper is the introduction of DynamicSound, an open-source acoustic simulation framework that accurately simulates moving sound sources and microphone arrays, addressing limitations in existing simulators. This work significantly enhances the capabilities for generating realistic audio data necessary for developing modern machine learning algorithms in sound classification and localization.
The methodology presented in this paper is robust and well-structured, focusing on simulating sound propagation with moving sources and microphone arrays. The authors incorporate key physical phenomena such as time-of-flight delays, Doppler effects, and air absorption, which are essential for realistic audio simulation. The use of object-oriented programming principles enhances the modularity and extensibility of the framework, allowing for various configurations of sound sources and microphone arrays. However, the paper could benefit from a more detailed discussion on the implementation specifics and potential computational limitations.
The experimental evaluation is thorough, comparing the proposed DynamicSound simulator against existing frameworks like pyroomacoustics and pyroadacoustics. The results demonstrate that DynamicSound maintains higher spatial fidelity and temporal consistency, particularly in dynamic scenarios involving moving sources. The experiments are well-designed, showcasing different sound propagation scenarios and effectively illustrating the advantages of the proposed framework. However, the lack of a broader range of environmental conditions in the experiments may limit the generalizability of the findings.
The paper mentions that the DynamicSound framework is open-source and available on GitHub, which is a positive aspect for reproducibility. However, specific implementation details, such as dependencies and setup instructions, are not provided in the text, which could hinder users from replicating the results without additional guidance.
The primary limitations identified include the current inability to model complex acoustic phenomena such as occlusion and diffraction, which may affect the realism of the simulations in more intricate environments. Additionally, the reflection model assumes perfectly reflective surfaces, which may not hold true in real-world applications.
The DynamicSound simulator has significant implications for various applications, including autonomous vehicles, surveillance systems, and sound-source localization algorithms. By providing a flexible and realistic simulation tool, it enables researchers to develop and evaluate machine learning models for audio classification and localization more effectively. The open-source nature of the project promotes collaboration and innovation within the community, potentially leading to advancements in spatial audio processing and related fields. The main contribution of this paper is the introduction of DynamicSound, an open-source acoustic simulation framework that accurately simulates moving sound sources and microphone arrays, addressing limitations in existing simulators. This work significantly enhances the capabilities for generating realistic audio data necessary for developing modern machine learning algorithms in sound classification and localization.
Catastrophic forgetting remains a major challenge for continual learning (CL) in automatic speech recognition (ASR), where models must adapt to new domains without losing performance on previously learned conditions. Several CL methods have been proposed for ASR, and, recently, weight averaging - where models are averaged in a merging step after fine-tuning - has proven effective as a simple memory-free strategy. However, it is heuristic in nature and ignores the underlying loss landscapes of the tasks, hindering adaptability. In this work, we propose Inverse Hessian Regularization (IHR), a memory-free approach for CL in ASR that incorporates curvature information into the merging step. After fine-tuning on a new task, the adaptation is adjusted through a Kronecker-factored inverse Hessian approximation of the previous task, ensuring that the model moves primarily in directions less harmful to past performance, while keeping the method lightweight. We evaluate IHR on two CL benchmarks and show that it significantly outperforms state-of-the-art baselines, reducing forgetting while improving adaptability. Ablation studies and analyses further confirm its effectiveness.
Primary: Department Electrical Engineering ESAT-PSI
All Institutions: Research Foundation Flanders (FWO), Department Electrical Engineering ESAT-PSI
This paper introduces Inverse Hessian Regularization (IHR), a novel memory-free method for continual learning in automatic speech recognition, which effectively reduces forgetting while maintaining strong adaptability to new tasks. The comprehensive evaluation of IHR demonstrates its potential to significantly advance the field of continual learning in ASR, addressing critical challenges faced by existing methods.
The proposed Inverse Hessian Regularization (IHR) method innovatively incorporates curvature information into the continual learning process for ASR. By using a Kronecker-factored approximation of the inverse Hessian, the method effectively adjusts the model updates to minimize catastrophic forgetting while maintaining adaptability to new tasks. The lightweight nature of the approach, requiring only a single application of the inverse Hessian per task, enhances its practicality. The methodology is well-structured and addresses key limitations of existing methods, such as weight averaging.
The experiments are robust, utilizing two distinct continual learning benchmarks that reflect real-world challenges in ASR. The results demonstrate significant improvements over state-of-the-art baselines, with detailed metrics including Word Error Rate (WER) and Backward Transfer (BWT) providing a comprehensive view of performance. The ablation studies further validate the effectiveness of the IHR approach, confirming the importance of the inverse Hessian adjustment in reducing forgetting and improving adaptability.
The paper provides sufficient details regarding the experimental setup, including model architecture, training protocols, and hyperparameter settings. The availability of a GitHub repository for code and detailed results enhances reproducibility, allowing other researchers to validate and build upon the findings.
While the method shows promise, it relies on approximations of the inverse Hessian, which may introduce inaccuracies. Additionally, the approach is primarily tested on specific ASR tasks, and its generalizability to other domains or more complex scenarios remains to be fully explored.
The implications of this work are significant for the development of adaptive ASR systems, particularly in dynamic environments where models must continuously learn from new data without forgetting previous knowledge. The lightweight and memory-free nature of IHR could facilitate the deployment of ASR systems in resource-constrained settings, enhancing their usability and effectiveness across diverse applications. This paper introduces Inverse Hessian Regularization (IHR), a novel memory-free method for continual learning in automatic speech recognition, which effectively reduces forgetting while maintaining strong adaptability to new tasks. The comprehensive evaluation of IHR demonstrates its potential to significantly advance the field of continual learning in ASR, addressing critical challenges faced by existing methods.
Adapting speech enhancement (SE) models to unseen environments is crucial for practical deployments, yet test-time adaptation (TTA) for SE remains largely under-explored due to a lack of understanding of how SE models degrade under domain shifts. We observe that mask-based SE models lose confidence under domain shifts, with predicted masks becoming flattened and losing decisive speech preservation and noise suppression. Based on this insight, we propose mask polarization (MPol), a lightweight TTA method that restores mask bimodality through distribution comparison using the Wasserstein distance. MPol requires no additional parameters beyond the trained model, making it suitable for resource-constrained edge deployments. Experimental results across diverse domain shifts and architectures demonstrate that MPol achieves very consistent gains that are competitive with significantly more complex approaches.
Primary: Institute of Signal Processing and System Theory, University of Stuttgart
All Institutions: Institute of Signal Processing and System Theory, University of Stuttgart
The paper presents MPol, a novel lightweight TTA method for speech enhancement that effectively restores mask bimodality under domain shifts. This contribution is significant as it bridges theoretical insights from classification adaptation to practical applications in audio processing, demonstrating both innovation and technical rigor in addressing a critical challenge in the field.
The proposed methodology, mask polarization (MPol), is innovative in its approach to restoring bimodal characteristics of masks in speech enhancement models during test-time adaptation (TTA). By leveraging the Wasserstein distance to compare distributions, MPol effectively addresses the degradation of model confidence under domain shifts without introducing additional parameters. This lightweight approach is particularly relevant for resource-constrained environments, making it a practical solution for real-world applications. The empirical investigation into the loss of bimodality in masks under domain shifts provides a solid foundation for the proposed method, which is well-justified and theoretically sound.
The experiments are comprehensive, utilizing a variety of datasets that cover a broad range of domain shifts. The performance metrics employed, including both perceptual and signal-level evaluations, provide a well-rounded assessment of MPol's effectiveness. The results indicate that MPol achieves competitive performance compared to more complex methods, demonstrating its robustness across different architectures. The ablation study further strengthens the findings by highlighting the importance of weight ensembling, which contributes to the stability of the adaptation process.
The paper provides sufficient details regarding the experimental setup, including the datasets used, model architectures, and evaluation metrics. However, the absence of a publicly available code repository or demo URL limits the reproducibility of the results. Future work should consider making the implementation accessible to facilitate validation and further exploration by the research community.
One limitation of the study is the focus on additive noise, leaving other types of corruptions unaddressed. Additionally, while the method shows promise, it may not generalize to all types of speech enhancement tasks or environments, particularly those that involve more complex noise scenarios. The reliance on empirical distribution comparisons may also introduce variability depending on the characteristics of the target domain.
The findings of this research have significant implications for the deployment of speech enhancement systems in real-world applications, particularly in resource-constrained environments such as mobile devices or edge computing scenarios. By enabling effective adaptation to unseen environments without the need for extensive additional resources, MPol could enhance the accessibility and usability of speech enhancement technologies across various domains, including telecommunications, assistive technologies, and hearing aids. The paper presents MPol, a novel lightweight TTA method for speech enhancement that effectively restores mask bimodality under domain shifts. This contribution is significant as it bridges theoretical insights from classification adaptation to practical applications in audio processing, demonstrating both innovation and technical rigor in addressing a critical challenge in the field.
Recent advances in text-to-music generation (TTM) have yielded high-quality results, but often at the cost of extensive compute and the use of large proprietary internal data. To improve the affordability and openness of TTM training, an open-source generative model backbone that is more training- and data-efficient is needed. In this paper, we constrain the number of trainable parameters in the generative model to match that of the MusicGen-small benchmark (with about 300M parameters), and replace its Transformer backbone with the emerging class of state-space models (SSMs). Specifically, we explore different SSM variants for sequence modeling, and compare a single-stage SSM-based design with a decomposable two-stage SSM/diffusion hybrid design. All proposed models are trained from scratch on a purely public dataset comprising 457 hours of CC-licensed music, ensuring full openness. Our experimental findings are three-fold. First, we show that SSMs exhibit superior training efficiency compared to the Transformer counterpart. Second, despite using only 9% of the FLOPs and 2% of the training data size compared to the MusicGen-small benchmark, our model achieves competitive performance in both objective metrics and subjective listening tests based on MusicCaps captions. Finally, our scaling-down experiment demonstrates that SSMs can maintain competitive performance relative to the Transformer baseline even at the same training budget (measured in iterations), when the model size is reduced to four times smaller. To facilitate the democratization of TTM research, the processed captions, model checkpoints, and source code are available on GitHub via the project page: https://lonian6.github.io/ssmttm/.
Primary: National Taiwan University
All Institutions: National Taiwan University
The paper presents a compelling exploration of state-space models for text-to-music generation, demonstrating their potential to reduce training costs while maintaining performance. The methodology is innovative, and the results contribute meaningfully to the field, particularly in promoting openness and accessibility in machine learning research.
The paper introduces a novel approach to text-to-music generation by leveraging state-space models (SSMs) instead of traditional Transformer architectures. The authors explore various SSM configurations, including a single-stage and a two-stage SSM/diffusion hybrid design, which significantly reduces the training resource requirements while maintaining competitive performance. The methodology is well-structured, with clear research questions and a systematic exploration of different architectures, making a strong case for the efficiency of SSMs in this domain.
The experiments are comprehensive and well-executed, comparing the proposed models against a baseline Transformer model and the official MusicGen-small. The use of both objective metrics (e.g., Fréchet Distance, KL divergence, CLAP scores) and subjective listening tests provides a robust evaluation of the models' performance. The results demonstrate that the SSM-based models achieve competitive audio generation quality with significantly lower training costs, validating the authors' claims about the efficiency of their approach.
The authors emphasize openness by using publicly available datasets and providing access to their model checkpoints and source code. However, while the paper details the training setup and evaluation protocols, the reproducibility could be further enhanced by providing explicit instructions for replicating the experiments, including hyperparameter settings and specific configurations used during training.
One limitation is the reliance on a relatively small dataset (457 hours of CC-licensed music) compared to the 20K hours used for the MusicGen-small benchmark. This may affect the generalizability of the findings. Additionally, while the models perform well under constrained training budgets, the paper does not extensively explore the performance of the SSMs in more complex or diverse music generation tasks.
The findings have significant implications for democratizing text-to-music generation research, as they suggest that high-quality models can be trained with fewer resources. This could enable a wider range of researchers and developers to explore TTM applications, fostering innovation in music generation and related fields. The paper presents a compelling exploration of state-space models for text-to-music generation, demonstrating their potential to reduce training costs while maintaining performance. The methodology is innovative, and the results contribute meaningfully to the field, particularly in promoting openness and accessibility in machine learning research.
Deploying speaker verification on resource-constrained devices remains challenging due to the computational cost of high-capacity models; knowledge distillation (KD) offers a remedy. Classical KD entangles target confidence with non-target structure in a Kullback-Leibler term, limiting the transfer of relational information. Decoupled KD separates these signals into target and non-target terms, yet treats non-targets uniformly and remains vulnerable to the long tail of low-probability classes in large-class settings. We introduce Triage KD (TRKD), a distillation scheme that operationalizes assess-prioritize-focus. TRKD introduces a cumulative-probability cutoff $Ď„$ to assess per-example difficulty and partition the teacher posterior into three groups: the target class, a high-probability non-target confusion-set, and a background-set. To prioritize informative signals, TRKD distills the confusion-set conditional distribution and discards the background. Concurrently, it transfers a three-mass (target/confusion/background) that capture sample difficulty and inter-class confusion. Finally, TRKD focuses learning via a curriculum on $Ď„$: training begins with a larger $Ď„$ to convey broad non-target context, then $Ď„$ is progressively decreased to shrink the confusion-set, concentrating supervision on the most confusable classes. In extensive experiments on VoxCeleb1 with both homogeneous and heterogeneous teacher-student pairs, TRKD was consistently superior to recent KD variants and attained the lowest EER across all protocols.
Primary: AI Solution Team
All Institutions: AI Solution Team
The main contribution of this paper is the introduction of Triage KD (TRKD), a novel knowledge distillation method that enhances speaker verification performance by effectively managing class confusion and optimizing the learning process through a structured curriculum. This approach not only improves model accuracy but also addresses the practical challenges of deploying AI models in resource-limited environments, marking a significant advancement in the field of audio processing and machine learning.
The proposed Triage KD (TRKD) method introduces a novel approach to knowledge distillation by operationalizing the assess-prioritize-focus principle. It effectively partitions the teacher's posterior into three distinct groups, allowing for a more targeted and informative distillation process. The methodology is well-structured, leveraging a cumulative-probability cutoff to manage the complexity of the non-target classes, which is particularly relevant in large-class settings. The curriculum learning aspect is a strong addition, enabling a gradual refinement of the model's focus on the most confusable classes.
The experiments conducted on the VoxCeleb1 dataset are extensive and well-designed, covering both homogeneous and heterogeneous teacher-student pairs. The results demonstrate significant improvements in equal error rates (EER) across various protocols, consistently outperforming existing knowledge distillation methods. The ablation studies further validate the effectiveness of the proposed method, highlighting the importance of the three-mass partitioning and the cumulative-probability cutoff.
The paper provides detailed implementation specifics, including model architectures, training configurations, and hyperparameters, which enhances reproducibility. However, the absence of a publicly available code repository limits the ease of replication for external researchers.
While TRKD shows promising results, the paper does not address potential scalability issues when applied to even larger datasets or more complex models. Additionally, the method's performance in real-world scenarios outside of the controlled experimental setup remains to be evaluated.
The implications of TRKD extend beyond speaker verification, as the principles of selective supervision and difficulty-aware curricula could be beneficial in various domains of machine learning, including computer vision and natural language processing. The ability to deploy efficient models on resource-constrained devices is particularly relevant in today's AI landscape, where edge computing is becoming increasingly important. The main contribution of this paper is the introduction of Triage KD (TRKD), a novel knowledge distillation method that enhances speaker verification performance by effectively managing class confusion and optimizing the learning process through a structured curriculum. This approach not only improves model accuracy but also addresses the practical challenges of deploying AI models in resource-limited environments, marking a significant advancement in the field of audio processing and machine learning.
We present VCNAC, a variable channel neural audio codec. Our approach features a single encoder and decoder parametrization that enables native inference for different channel setups, from mono speech to cinematic 5.1 channel surround audio. Channel compatibility objectives ensure that multi-channel content maintains perceptual quality when decoded to fewer channels. The shared representation enables training of generative language models on a single set of codebooks while supporting inference-time scalability across modalities and channel configurations. Evaluation using objective spatial audio metrics and subjective listening tests demonstrates that our unified approach maintains high reconstruction quality across mono, stereo, and surround audio configurations.
Primary: Amazon
All Institutions: Amazon
The main contribution of this paper is the introduction of VCNAC, a novel variable-channel neural audio codec that efficiently processes mono, stereo, and surround audio within a unified architecture, achieving high-quality reconstruction at lower bitrates than existing state-of-the-art codecs. This work represents a significant advancement in the field of neural audio codecs, addressing key limitations of fixed-channel architectures and enhancing the scalability of audio processing applications.
The proposed methodology in VCNAC is innovative, leveraging a variable-channel architecture that allows for dynamic processing of audio across different channel configurations (mono, stereo, and surround) within a single encoder-decoder framework. The use of shared codebooks and cross-channel attention mechanisms is particularly noteworthy, as it enables efficient information exchange and preserves spatial audio characteristics. The architecture is well-structured, with a clear separation of channel-specific processing and a robust fusion and splitting strategy that enhances performance while reducing computational overhead.
The experimental evaluation is thorough, utilizing both objective metrics (SI-SNR, PESQ, and spatial-specific measures) and subjective assessments (MUSHRA study) to validate the codec's performance across various audio modalities. The results demonstrate that VCNAC outperforms existing codecs in terms of reconstruction quality and bitrate efficiency, particularly in the context of surround audio. However, the reliance on synthetic training data for surround configurations could affect the generalizability of the results.
The paper provides sufficient implementation details, including architecture specifics, training data, and evaluation metrics, which facilitates reproducibility. However, the absence of a publicly available code repository or demo URL limits the ease with which other researchers can replicate the findings.
One significant limitation is the use of simulated surround audio data for training, which may not fully capture the complexities of real-world audio environments. Additionally, the performance metrics for rear channels indicate potential weaknesses in handling low loudness levels, suggesting that the model may struggle with certain audio characteristics.
The development of VCNAC has the potential to significantly impact the field of audio processing, particularly in applications requiring efficient multi-channel audio encoding and decoding. Its ability to operate at lower bitrates while maintaining high perceptual quality could lead to advancements in streaming services, gaming, and virtual reality applications, where audio fidelity is crucial. The main contribution of this paper is the introduction of VCNAC, a novel variable-channel neural audio codec that efficiently processes mono, stereo, and surround audio within a unified architecture, achieving high-quality reconstruction at lower bitrates than existing state-of-the-art codecs. This work represents a significant advancement in the field of neural audio codecs, addressing key limitations of fixed-channel architectures and enhancing the scalability of audio processing applications.
Achieving pronunciation proficiency in a second language (L2) remains a challenge, despite the development of Computer-Assisted Pronunciation Training (CAPT) systems. Traditional CAPT systems often provide unintuitive feedback that lacks actionable guidance, limiting its effectiveness. Recent advancements in audio-language models (ALMs) offer the potential to enhance these systems by providing more user-friendly feedback. In this work, we investigate ALMs for chat-based pronunciation training by introducing L2-Arctic-plus, an English dataset with detailed error explanations and actionable suggestions for improvement. We benchmark cascaded ASR+LLMs and existing ALMs on this dataset, specifically in detecting mispronunciation and generating actionable feedback. To improve the performance, we further propose to instruction-tune ALMs on L2-Arctic-plus. Experimental results demonstrate that our instruction-tuned models significantly outperform existing baselines on mispronunciation detection and suggestion generation in terms of both objective and human evaluation, highlighting the value of the proposed dataset.
Primary: unknown
All Institutions: unknown
The paper presents a significant advancement in the field of interactive language learning by introducing a novel dataset and methodology for improving pronunciation training through audio-language models. The comprehensive evaluation of models on this dataset highlights the potential of ALMs to provide actionable feedback, addressing a critical gap in existing systems.
The paper introduces a novel dataset, L2-Arctic-plus, specifically designed for chat-based pronunciation training, which includes detailed annotations for mispronunciation errors and actionable feedback. The methodology of instruction-tuning audio-language models (ALMs) on this dataset is well-structured, addressing the limitations of existing models in providing intuitive feedback. The use of a cascaded ASR+LLM framework and the exploration of existing ALMs are also noteworthy, although the paper could benefit from a more detailed description of the instruction-tuning process.
The experiments are comprehensive, benchmarking various models on the newly introduced dataset. The results demonstrate significant improvements in mispronunciation detection and feedback generation, particularly with the instruction-tuned ALMs. The evaluation metrics used are appropriate, including both objective measures and human evaluations, which lend credibility to the findings.
The paper provides sufficient implementation details, including the models used and the training process. However, it lacks specific hyperparameters and configurations that could enhance reproducibility. The code is publicly available, which is a positive aspect for potential replicators.
The paper acknowledges limitations, such as the focus on "reading-aloud" scenarios and the lack of auditory feedback. Additionally, the performance of the models, while improved, is still described as suboptimal, indicating room for further research and refinement.
The advancements presented in this paper have the potential to significantly improve language learning tools, making them more effective and user-friendly. The integration of ALMs into educational contexts could enhance learner engagement and outcomes in pronunciation training. The paper presents a significant advancement in the field of interactive language learning by introducing a novel dataset and methodology for improving pronunciation training through audio-language models. The comprehensive evaluation of models on this dataset highlights the potential of ALMs to provide actionable feedback, addressing a critical gap in existing systems.
Short-utterance speaker verification remains challenging due to limited speaker-discriminative cues in short speech segments. While existing methods focus on enhancing speaker encoders, the embedding learning strategy still forces a single fixed-dimensional representation reused for utterances of any length, leaving capacity misaligned with the information available at different durations. We propose Duration-Aware Matryoshka Embedding (DAME), a model-agnostic framework that builds a nested hierarchy of sub-embeddings aligned to utterance durations: lower-dimensional representations capture compact speaker traits from short utterances, while higher dimensions encode richer details from longer speech. DAME supports both training from scratch and fine-tuning, and serves as a direct alternative to conventional large-margin fine-tuning, consistently improving performance across durations. On the VoxCeleb1-O/E/H and VOiCES evaluation sets, DAME consistently reduces the equal error rate on 1-s and other short-duration trials, while maintaining full-length performance with no additional inference cost. These gains generalize across various speaker encoder architectures under both general training and fine-tuning setups.
Primary: AI Solution Team
All Institutions: AI Solution Team
The paper presents DAME, a novel framework that enhances speaker verification by aligning embedding capacity with utterance duration, significantly improving performance on short utterances. This contribution is particularly relevant in the context of real-world applications where short speech segments are common, addressing a critical limitation in existing speaker verification methodologies.
The proposed DAME framework introduces a novel approach to speaker verification by creating a nested hierarchy of sub-embeddings that are aligned with utterance durations. This model-agnostic strategy allows for dynamic adjustment of embedding sizes based on the length of the input, addressing a significant gap in existing methods that utilize fixed-dimensional representations. The methodology is well-structured, leveraging duration-aware supervision to enhance the discriminative power of embeddings for both short and long utterances. The use of prefix embeddings and a multi-prefix loss function demonstrates a thoughtful integration of existing concepts in representation learning while innovatively adapting them to the specific challenges of speaker verification.
The experiments are comprehensive, utilizing multiple datasets (VoxCeleb1-O/E/H and VOiCES) to evaluate the performance of DAME across various speaker encoder architectures. The results show consistent improvements in equal error rates (EER) for short-duration trials, which is a critical metric in speaker verification. The paper effectively compares DAME against conventional methods, including large-margin fine-tuning, and demonstrates its robustness across different conditions. However, the paper could benefit from more extensive ablation studies to further elucidate the contributions of individual components of the DAME framework.
The paper provides sufficient details regarding the training regime, model configurations, and evaluation protocols, which would allow other researchers to replicate the experiments. However, the absence of a publicly available code repository or demo URL limits the ease of reproducibility. Including such resources would significantly enhance the paper's impact and facilitate further research in this area.
While the DAME framework shows promise, it may still face challenges in extreme real-world scenarios where utterance quality is poor or heavily distorted. Additionally, the reliance on specific datasets may limit the generalizability of the findings. The paper does not address potential biases in the datasets used, which could affect the model's performance across diverse populations.
The advancements presented in this paper have significant implications for applications in security, telecommunications, and human-computer interaction, where robust speaker verification is essential. By improving the performance of speaker verification systems on short utterances, DAME could enhance user experience in voice-activated systems and contribute to more secure authentication methods. The paper presents DAME, a novel framework that enhances speaker verification by aligning embedding capacity with utterance duration, significantly improving performance on short utterances. This contribution is particularly relevant in the context of real-world applications where short speech segments are common, addressing a critical limitation in existing speaker verification methodologies.
Recent advances in audio-language models have demonstrated remarkable success on short, segment-level speech tasks. However, real-world applications such as meeting transcription, spoken document understanding, and conversational analysis require robust models capable of processing and reasoning over long-form audio. In this work, we present LongSpeech, a large-scale and scalable benchmark specifically designed to evaluate and advance the capabilities of speech models on long-duration audio. LongSpeech comprises over 100,000 speech segments, each approximately 10 minutes long, with rich annotations for ASR, speech translation, summarization, language detection, speaker counting, content separation, and question answering. We introduce a reproducible pipeline for constructing long-form speech benchmarks from diverse sources, enabling future extensions. Our initial experiments with state-of-the-art models reveal significant performance gaps, with models often specializing in one task at the expense of others and struggling with higher-level reasoning. These findings underscore the challenging nature of our benchmark. Our benchmark will be made publicly available to the research community.
Primary: Shanghai Jiao Tong University
All Institutions: Shanghai Jiao Tong University, Alibaba International Digital Commerce
The main contribution of this work is the introduction of LongSpeech, a large-scale benchmark designed to evaluate long-form speech processing capabilities, which addresses significant gaps in current audio-language models. The paper's comprehensive methodology and experimental evaluation underscore its potential to catalyze advancements in the field, although improvements in reproducibility and accessibility would enhance its impact further.
The methodology presented in the paper is robust, with a clear focus on constructing a large-scale benchmark for long-form speech processing. The authors detail their approach to dataset construction, including the integration of diverse data sources and the careful curation of audio segments. The multi-task nature of the benchmark is a significant strength, addressing various aspects of speech understanding, including ASR, translation, summarization, and more. The reproducibility of the dataset construction process is emphasized, which is critical for future research.
The experimental evaluation is thorough, utilizing multiple state-of-the-art models to benchmark performance across the various tasks. The results highlight significant performance gaps in current models, particularly in higher-level reasoning tasks. The use of established evaluation metrics (e.g., WER, BLEU, ROUGE) adds credibility to the findings, although the paper could benefit from a more detailed discussion of the experimental setup and model configurations.
While the authors mention a reproducible pipeline for constructing the benchmark, the paper lacks specific implementation details or links to code repositories that would facilitate direct replication of the results. This is a notable gap, as reproducibility is crucial for the validation of research findings.
The paper acknowledges limitations in the current models' performance, particularly in tasks requiring deep semantic understanding and reasoning over long audio contexts. However, it does not address potential biases in the dataset or the limitations of the models used in the experiments. Additionally, the lack of a publicly accessible demo or project URL diminishes the immediate applicability of the benchmark.
The LongSpeech benchmark has the potential to significantly advance the field of long-form speech processing, addressing a critical gap between existing benchmarks and real-world applications. By providing a comprehensive evaluation framework, it encourages the development of more robust audio-language models capable of handling complex, real-world audio scenarios. This could have implications in various domains, including meeting transcription, spoken document understanding, and conversational AI. The main contribution of this work is the introduction of LongSpeech, a large-scale benchmark designed to evaluate long-form speech processing capabilities, which addresses significant gaps in current audio-language models. The paper's comprehensive methodology and experimental evaluation underscore its potential to catalyze advancements in the field, although improvements in reproducibility and accessibility would enhance its impact further.
Protecting speaker identity is crucial for online voice applications, yet streaming speaker anonymization (SA) remains underexplored. Recent research has demonstrated that neural audio codec (NAC) provides superior speaker feature disentanglement and linguistic fidelity. NAC can also be used with causal language models (LM) to enhance linguistic fidelity and prompt control for streaming tasks. However, existing NAC-based online LM systems are designed for voice conversion (VC) rather than anonymization, lacking the techniques required for privacy protection. Building on these advances, we present Stream-Voice-Anon, which adapts modern causal LM-based NAC architectures specifically for streaming SA by integrating anonymization techniques. Our anonymization approach incorporates pseudo-speaker representation sampling, a speaker embedding mixing and diverse prompt selection strategies for LM conditioning that leverage the disentanglement properties of quantized content codes to prevent speaker information leakage. Additionally, we compare dynamic and fixed delay configurations to explore latency-privacy trade-offs in real-time scenarios. Under the VoicePrivacy 2024 Challenge protocol, Stream-Voice-Anon achieves substantial improvements in intelligibility (up to 46% relative WER reduction) and emotion preservation (up to 28% UAR relative) compared to the previous state-of-the-art streaming method DarkStream while maintaining comparable latency (180ms vs 200ms) and privacy protection against lazy-informed attackers, though showing 15% relative degradation against semi-informed attackers.
Primary: Institute for Infocomm Research
All Institutions: Institute for Infocomm Research, Nanyang Technological University, The Hong Kong Polytechnic University
The main contribution of this paper is the development of Stream-Voice-Anon, a novel real-time speaker anonymization system that effectively combines neural audio codecs and language models to enhance privacy while preserving speech intelligibility and emotional content. This work represents a significant advancement in the field of audio processing and privacy-preserving technologies, addressing a critical need in online voice applications.
The proposed methodology of Stream-Voice-Anon is innovative in its integration of neural audio codecs with causal language models for real-time speaker anonymization. The use of pseudo-speaker representation sampling and diverse prompt selection strategies demonstrates a thoughtful approach to enhancing privacy while maintaining intelligibility and emotional fidelity. The dual-decoder architecture and the dynamic delay mechanism for latency management are particularly noteworthy, showcasing a sophisticated understanding of the trade-offs involved in real-time processing.
The experimental setup is robust, utilizing the VoicePrivacy 2024 Challenge protocol for evaluation, which lends credibility to the results. The paper reports significant improvements over the state-of-the-art method DarkStream, particularly in intelligibility and emotion preservation, while maintaining comparable latency. The comprehensive evaluation against both lazy-informed and semi-informed attackers provides a nuanced understanding of the system's privacy capabilities.
The paper provides detailed implementation specifics, including model configurations, training details, and datasets used, which enhances reproducibility. However, the lack of a publicly available code repository limits the ease with which other researchers can replicate the work.
One identified limitation is the relative degradation in performance against semi-informed attackers, indicating that while the system is effective, there is room for improvement in robustness against more sophisticated adversaries. Additionally, the reliance on specific datasets for training may limit the generalizability of the model to other domains or languages.
The implications of this work are significant, particularly in applications requiring speaker privacy, such as call centers, legal recordings, and medical conversations. The advancements in real-time speaker anonymization could lead to broader adoption of voice technologies in sensitive contexts, enhancing user trust and compliance with privacy regulations. The main contribution of this paper is the development of Stream-Voice-Anon, a novel real-time speaker anonymization system that effectively combines neural audio codecs and language models to enhance privacy while preserving speech intelligibility and emotional content. This work represents a significant advancement in the field of audio processing and privacy-preserving technologies, addressing a critical need in online voice applications.
We present S$^2$Voice, the winning system of the Singing Voice Conversion Challenge (SVCC) 2025 for both the in-domain and zero-shot singing style conversion tracks. Built on the strong two-stage Vevo baseline, S$^2$Voice advances style control and robustness through several contributions. First, we integrate style embeddings into the autoregressive large language model (AR LLM) via a FiLM-style layer-norm conditioning and a style-aware cross-attention for enhanced fine-grained style modeling. Second, we introduce a global speaker embedding into the flow-matching transformer to improve timbre similarity. Third, we curate a large, high-quality singing corpus via an automated pipeline for web harvesting, vocal separation, and transcript refinement. Finally, we employ a multi-stage training strategy combining supervised fine-tuning (SFT) and direct preference optimization (DPO). Subjective listening tests confirm our system's superior performance: leading in style similarity and singer similarity for Task 1, and across naturalness, style similarity, and singer similarity for Task 2. Ablation studies demonstrate the effectiveness of our contributions in enhancing style fidelity, timbre preservation, and generalization. Audio samples are available~\footnote{https://honee-w.github.io/SVC-Challenge-Demo/}.
Primary: School of Software
All Institutions: School of Software
The main contribution of this paper is the introduction of S$^2$Voice, a novel system for singing style conversion that combines advanced modeling techniques and a curated dataset, achieving state-of-the-art performance in both in-domain and zero-shot tasks. The comprehensive methodology and rigorous experimental validation position this work as a significant advancement in the field of voice conversion and synthesis.
The methodology presented in S$^2$Voice is robust and innovative, utilizing a two-stage framework that effectively separates content and style modeling from acoustic rendering. The integration of FiLM-style layer normalization and style-aware cross-attention mechanisms enhances the model's ability to capture fine-grained style attributes, addressing core challenges in singing style conversion. The introduction of global speaker embeddings for timbre preservation is a significant advancement, as it mitigates style leakage during the acoustic modeling phase. The automated data curation pipeline also demonstrates a thorough approach to building a high-quality training dataset, which is critical for the model's performance.
The experimental evaluation is comprehensive, with subjective listening tests confirming the superiority of S$^2$Voice over the Vevo baseline across multiple metrics, including style similarity and naturalness. The ablation studies provide clear evidence of the contributions of each proposed method, validating the effectiveness of the new components introduced in the model. The use of both in-domain and zero-shot tasks in the evaluation further strengthens the findings, showcasing the model's generalization capabilities.
The paper provides sufficient detail on the model architecture, training procedures, and evaluation metrics to allow for reproducibility. However, the absence of a public code repository limits the ease with which other researchers can replicate the results. The authors could enhance reproducibility by sharing their code and trained models.
One limitation noted is the marginal decrease in metrics observed after the direct preference optimization (DPO) stage, which, while improving perceptual quality, may indicate a trade-off between average performance metrics and the reduction of low-quality outputs. Additionally, the reliance on automated data curation may introduce biases or errors that could affect model performance.
The advancements made in singing style conversion have significant implications for various applications, including music production, entertainment, and personalized audio experiences. By improving the ability to convert singing styles while preserving timbre and naturalness, this research could enhance creative tools for musicians and content creators, leading to more expressive and diverse musical outputs. The main contribution of this paper is the introduction of S$^2$Voice, a novel system for singing style conversion that combines advanced modeling techniques and a curated dataset, achieving state-of-the-art performance in both in-domain and zero-shot tasks. The comprehensive methodology and rigorous experimental validation position this work as a significant advancement in the field of voice conversion and synthesis.
Respiratory sound analysis is a crucial tool for screening asthma and other pulmonary pathologies, yet traditional auscultation remains subjective and experience-dependent. Our prior research established a CNN baseline using DenseNet201, which demonstrated high sensitivity in classifying respiratory sounds. In this work, we (i) adapt the Audio Spectrogram Transformer (AST) for respiratory sound analysis and (ii) evaluate a multimodal Vision-Language Model (VLM) that integrates spectrograms with structured patient metadata. AST is initialized from publicly available weights and fine-tuned on a medical dataset containing hundreds of recordings per diagnosis. The VLM experiment uses a compact Moondream-type model that processes spectrogram images alongside a structured text prompt (sex, age, recording site) to output a JSON-formatted diagnosis. Results indicate that AST achieves approximately 97% accuracy with an F1-score around 97% and ROC AUC of 0.98 for asthma detection, significantly outperforming both the internal CNN baseline and typical external benchmarks. The VLM reaches 86-87% accuracy, performing comparably to the CNN baseline while demonstrating the capability to integrate clinical context into the inference process. These results confirm the effectiveness of self-attention for acoustic screening and highlight the potential of multimodal architectures for holistic diagnostic tools.
Primary: Perm State Medical University
All Institutions: Perm State Medical University
The main contribution of this paper is the successful adaptation of transformer architectures for respiratory sound analysis, demonstrating significant improvements in diagnostic accuracy while integrating clinical context through multimodal approaches. This work not only advances the technical capabilities in the field but also emphasizes the importance of contextual data in medical diagnostics, potentially transforming how respiratory conditions are diagnosed and managed.
The paper presents a robust methodology by adapting the Audio Spectrogram Transformer (AST) for respiratory sound analysis, which is a novel application of transformer architectures in the medical domain. The integration of a Vision-Language Model (VLM) that combines audio spectrograms with structured patient metadata is particularly innovative, as it mirrors clinical workflows and enhances interpretability. The use of transfer learning and careful dataset management demonstrates a thoughtful approach to mitigating overfitting and ensuring the model's clinical relevance.
The experiments are well-structured, utilizing a controlled dataset that allows for a fair comparison between the AST and the DenseNet201 baseline. The results indicate a significant performance improvement with the AST achieving approximately 97% accuracy, which is a notable contribution to the field. The VLM's performance, while slightly lower, still shows promise in integrating multimodal data, highlighting the potential for future applications in clinical settings.
The paper provides sufficient detail regarding the dataset, model architecture, and training procedures, which supports reproducibility. However, the lack of publicly available code or a demo limits the ability for other researchers to replicate the findings directly. The authors do mention that the dataset is available upon request, which is a positive aspect for reproducibility.
One limitation is the reliance on a single-center dataset, which may affect the generalizability of the results. Additionally, while the AST shows high accuracy, the VLM's performance indicates that further refinement is needed to fully leverage multimodal data. The paper also does not address potential biases in the dataset or the implications of using demographic information in model predictions.
The findings have significant implications for the field of respiratory medicine, particularly in improving diagnostic accuracy through automated systems. The integration of multimodal data could enhance clinical decision-making processes and provide a more holistic view of patient health. This research could pave the way for further advancements in telemedicine and remote patient monitoring. The main contribution of this paper is the successful adaptation of transformer architectures for respiratory sound analysis, demonstrating significant improvements in diagnostic accuracy while integrating clinical context through multimodal approaches. This work not only advances the technical capabilities in the field but also emphasizes the importance of contextual data in medical diagnostics, potentially transforming how respiratory conditions are diagnosed and managed.
In the field of audio generation, signal-to-noise ratio (SNR) has long served as an objective metric for evaluating audio quality. Nevertheless, recent studies have shown that SNR and its variants are not always highly correlated with human perception, prompting us to raise the questions: Why does SNR fail in measuring audio quality? And how to improve its reliability as an objective metric? In this paper, we identify the inadequate measurement of phase distance as a pivotal factor and propose to reformulate SNR with specially designed phase-distance terms, yielding an improved metric named GOMPSNR. We further extend the newly proposed formulation to derive two novel categories of loss function, corresponding to magnitude-guided phase refinement and joint magnitude-phase optimization, respectively. Besides, extensive experiments are conducted for an optimal combination of different loss functions. Experimental results on advanced neural vocoders demonstrate that our proposed GOMPSNR exhibits more reliable error measurement than SNR. Meanwhile, our proposed loss functions yield substantial improvements in model performance, and our wellchosen combination of different loss functions further optimizes the overall model capability.
Primary: Unknown
All Institutions: Unknown
The main contribution of this paper is the introduction of GOMPSNR, a novel metric for audio quality assessment that improves upon traditional SNR by incorporating phase-distance considerations, along with the development of new loss functions that enhance audio generation performance. This work represents a meaningful advancement in the evaluation of audio quality, addressing a critical gap in existing methodologies and offering practical improvements for audio generation tasks.
The methodology presented in this paper is well-structured, focusing on the reformulation of the SNR metric to address its shortcomings in audio quality assessment. The introduction of GOMPSNR is innovative, as it incorporates phase-distance terms that are critical for accurate audio representation. The derivation of new loss functions based on this metric is a significant step forward, showcasing a thoughtful approach to integrating magnitude and phase information in audio generation tasks. The mathematical rigor in the derivation and the clear rationale for the proposed changes enhance the credibility of the methodology.
The experiments conducted are extensive and well-designed, utilizing established datasets such as LJSpeech and LibriTTS. The results demonstrate a clear improvement in correlation with perceptual metrics when using GOMPSNR compared to traditional SNR. The paper effectively showcases the performance of various vocoders under different loss function configurations, providing a comprehensive evaluation of the proposed methods. However, the lack of comparative results with other recent metrics could limit the contextual understanding of GOMPSNR's performance.
The paper provides sufficient detail regarding the implementation of the proposed methods, including the datasets used and the training settings. The availability of a GitHub repository for the project enhances reproducibility, allowing other researchers to validate the findings and explore the proposed methods further. However, more explicit details on hyperparameter tuning and specific configurations would aid in achieving full reproducibility.
One limitation of the study is the reliance on specific datasets, which may not fully represent the diversity of audio generation tasks. Additionally, while GOMPSNR shows promise, the paper does not extensively discuss its limitations or scenarios where it may not perform as well. The potential computational overhead introduced by the new loss functions is also not addressed, which could impact practical applications.
The proposed GOMPSNR metric and associated loss functions have the potential to significantly impact the field of audio generation by providing a more reliable and perceptually relevant measure of audio quality. This could lead to advancements in various applications, including speech synthesis, music generation, and audio restoration. The work encourages further exploration of phase information in audio processing, which may inspire future research directions. The main contribution of this paper is the introduction of GOMPSNR, a novel metric for audio quality assessment that improves upon traditional SNR by incorporating phase-distance considerations, along with the development of new loss functions that enhance audio generation performance. This work represents a meaningful advancement in the evaluation of audio quality, addressing a critical gap in existing methodologies and offering practical improvements for audio generation tasks.
Multimodal foundation models that integrate audio, vision, and language achieve strong performance on reasoning and generation tasks, yet their robustness to adversarial manipulation remains poorly understood. We study a realistic and underexplored threat model: untargeted, audio-only adversarial attacks on trimodal audio-video-language models. We analyze six complementary attack objectives that target different stages of multimodal processing, including audio encoder representations, cross-modal attention, hidden states, and output likelihoods. Across three state-of-the-art models and multiple benchmarks, we show that audio-only perturbations can induce severe multimodal failures, achieving up to 96% attack success rate. We further show that attacks can be successful at low perceptual distortions (LPIPS <= 0.08, SI-SNR >= 0) and benefit more from extended optimization than increased data scale. Transferability across models and encoders remains limited, while speech recognition systems such as Whisper primarily respond to perturbation magnitude, achieving >97% attack success under severe distortion. These results expose a previously overlooked single-modality attack surface in multimodal systems and motivate defenses that enforce cross-modal consistency.
Primary: unknown
All Institutions: unknown
The paper presents a systematic study of audio-only adversarial attacks on trimodal audio-video-language models, revealing significant vulnerabilities and contributing to the understanding of adversarial robustness in multimodal systems. The innovative methodology and comprehensive experimental evaluation underscore its relevance and potential impact in the field of machine learning.
The paper proposes a systematic approach to audio-only adversarial attacks on trimodal models, introducing six distinct attack objectives targeting various stages of multimodal processing. The methodology is well-structured, leveraging gradient-based optimization to craft perturbations that can significantly degrade model performance without altering visual or textual inputs. This innovative focus on audio-only attacks in multimodal systems is a notable contribution, as it highlights vulnerabilities that have been largely overlooked in prior research.
The experiments are comprehensive, evaluating the proposed attacks across three state-of-the-art models and multiple benchmarks. The results demonstrate high attack success rates and provide insights into the effectiveness of different attack strategies. The analysis of transferability and perceptual distortion adds depth to the evaluation, although the limited cross-model transferability observed raises questions about the generalizability of the findings.
While the paper provides detailed descriptions of the attack methodologies and evaluation metrics, it lacks specific implementation details and code availability, which could hinder reproducibility. The reliance on specific models and datasets also limits the ability to replicate results in other contexts.
The study primarily focuses on white-box attacks, which may not reflect real-world scenarios where adversaries have limited access to model internals. Additionally, the evaluation is restricted to three multimodal models, and the potential effectiveness of the attacks in physical-world settings remains untested. The authors also do not explore defenses against their attacks, which is a significant oversight.
This research has important implications for the security of multimodal systems, particularly in applications involving audio processing. Understanding the vulnerabilities exposed by audio-only adversarial attacks can inform the development of more robust models and defenses, ultimately enhancing the safety and reliability of AI systems that integrate multiple modalities. The paper presents a systematic study of audio-only adversarial attacks on trimodal audio-video-language models, revealing significant vulnerabilities and contributing to the understanding of adversarial robustness in multimodal systems. The innovative methodology and comprehensive experimental evaluation underscore its relevance and potential impact in the field of machine learning.
Speech large language models (LLMs) have driven significant progress in end-to-end speech understanding and recognition, yet they continue to struggle with accurately recognizing rare words and domain-specific terminology. This paper presents a novel fine-tuning method, Reinforcement Learning with Biasing Rewards (RLBR), which employs a specialized biasing words preferred reward to explicitly emphasize biasing words in the reward calculation. In addition, we introduce reference-aware mechanisms that extend the reinforcement learning algorithm with reference transcription to strengthen the potential trajectory exploration space. Experiments on the LibriSpeech corpus across various biasing list sizes demonstrate that RLBR delivers substantial performance improvements over a strong supervised fine-tuning (SFT) baseline and consistently outperforms several recently published methods. The proposed approach achieves excellent performance on the LibriSpeech test-clean and test-other sets, reaching Biasing Word Error Rates (BWERs) of 0.59% / 2.11%, 1.09% / 3.24%, and 1.36% / 4.04% for biasing list sizes of 100, 500, and 1000, respectively, without compromising the overall WERs.
Primary: Microsoft Core AI
All Institutions: Microsoft Core AI
The paper presents RLBR, a novel fine-tuning approach for contextual biasing in speech large language models. The technical contribution is significant, with a well-defined methodology and promising experimental results that could impact the development of more accurate and context-aware speech recognition systems.
The proposed RLBR method innovatively applies reinforcement learning to enhance contextual biasing in speech LLMs, focusing on rare and domain-specific terminology. The introduction of a specialized reward function that prioritizes biasing words is a significant advancement over traditional methods. The reference-aware mechanisms further strengthen the exploration space, which is a thoughtful addition that addresses limitations in existing RL applications in speech recognition. The methodology is well-structured, with clear definitions and justifications for each component, although it could benefit from more detailed explanations of the ablation studies.
The experiments conducted on the LibriSpeech corpus are robust, with comprehensive evaluations across various biasing list sizes. The reported results demonstrate substantial performance improvements over strong baselines, indicating the effectiveness of the RLBR approach. However, the paper could enhance its credibility by including comparisons with a wider range of existing methods and providing more detailed statistical analyses of the results.
The implementation details are well-documented, including the architecture, training procedures, and parameter settings. However, the absence of a shared code repository or demo URL limits the reproducibility of the results. Providing access to the code would significantly enhance the paper's impact and facilitate further research in this area.
While the paper presents a novel approach, it does not address potential limitations such as the scalability of the method to larger datasets or its performance in real-world applications outside the LibriSpeech corpus. Additionally, the reliance on specific architectures may limit the generalizability of the findings to other speech recognition systems.
The RLBR method has the potential to significantly improve speech recognition systems, particularly in applications requiring accurate transcription of rare or domain-specific terms. This advancement could enhance user experiences in various fields, including healthcare, legal, and technical domains, where precise terminology is crucial. The implications of this research extend to improving accessibility and usability of speech technologies for diverse user groups. The paper presents RLBR, a novel fine-tuning approach for contextual biasing in speech large language models. The technical contribution is significant, with a well-defined methodology and promising experimental results that could impact the development of more accurate and context-aware speech recognition systems.
This paper targets a new scenario that integrates speech separation with speech compression, aiming to disentangle multiple speakers while producing discrete representations for efficient transmission or storage, with applications in online meetings and dialogue archiving. To address this scenario, we propose CodeSep, a codec-driven model that jointly performs speech separation and low-bitrate compression. CodeSep comprises a residual vector quantizer (RVQ)-based plain neural speech codec, a base-token disentanglement (BTD) module, and parallel auxiliary-token serial prediction (ATSP) modules. The BTD module disentangles mixed-speech mel-spectrograms into base tokens for each speaker, which are then refined by ATSP modules to serially predict auxiliary tokens, and finally, all tokens are decoded to reconstruct separated waveforms through the codec decoder. During training, the codec's RVQ provides supervision with permutation-invariant and teacher-forcing-based cross-entropy losses. As only base tokens are transmitted or stored, CodeSep achieves low-bitrate compression. Experimental results show that CodeSep attains satisfactory separation performance at only 1 kbps compared with baseline methods.
Primary: University of Science and Technology of China
All Institutions: National Engineering Research Center of Speech and Language Information Processing, University of Science and Technology of China
The main contribution of this paper is the introduction of CodeSep, a codec-driven model that effectively combines speech separation and low-bitrate compression, showcasing significant advancements in both methodology and practical applications. The innovative approach and comprehensive evaluation position this work as a valuable addition to the field of audio processing and machine learning.
The proposed CodeSep model introduces a novel architecture that integrates speech separation and low-bitrate compression through a codec-driven approach. The use of a residual vector quantizer (RVQ) for token representation and the innovative base-token disentanglement (BTD) and auxiliary-token serial prediction (ATSP) modules are significant contributions. The methodology effectively addresses the dual challenge of separating mixed speech while maintaining low bitrate for efficient transmission, which is a critical need in real-world applications like online meetings. The training strategy, including permutation-invariant cross-entropy loss for BTD and teacher-forcing for ATSP, demonstrates a thoughtful approach to tackle the inherent challenges of the task.
The experiments utilize the Libri2Mix-clean dataset, which is appropriate for the task, and the evaluation metrics are well-chosen to assess both objective and subjective performance. The results indicate that CodeSep outperforms baseline methods significantly, achieving high-quality speech separation at a remarkably low bitrate of 1 kbps. The inclusion of both objective metrics (UTMOS, DNSMOS) and subjective assessments (NMOS, SMOS) strengthens the validity of the findings, providing a comprehensive evaluation of the model's performance.
The paper provides sufficient details on the architecture, training procedures, and evaluation metrics, which would allow other researchers to replicate the experiments. However, the absence of a public code repository limits full reproducibility. The demo URL offers some audio samples, but a complete code release would enhance the reproducibility of the findings.
One limitation is the focus on a two-speaker scenario, which may not generalize well to cases with more speakers or different acoustic environments. Additionally, while the results are promising, the performance at higher bitrates could be further explored to understand the model's scalability and robustness in various conditions.
The integration of speech separation and compression has significant implications for applications in telecommunications, online conferencing, and archiving of spoken dialogues. By achieving high-quality separation at low bitrates, CodeSep can facilitate more efficient use of bandwidth and storage, which is increasingly important in the context of growing digital communication needs. The main contribution of this paper is the introduction of CodeSep, a codec-driven model that effectively combines speech separation and low-bitrate compression, showcasing significant advancements in both methodology and practical applications. The innovative approach and comprehensive evaluation position this work as a valuable addition to the field of audio processing and machine learning.
Immersive spatial audio has become increasingly critical for applications ranging from AR/VR to home entertainment and automotive sound systems. However, existing generative methods remain constrained to low-dimensional formats such as binaural audio and First-Order Ambisonics (FOA). Binaural rendering is inherently limited to headphone playback, while FOA suffers from spatial aliasing and insufficient resolution for high-frequency. To overcome these limitations, we introduce ImmersiveFlow, the first end-to-end generative framework that directly synthesizes discrete 7.1.4 format spatial audio from stereo input. ImmersiveFlow leverages Flow Matching to learn trajectories from stereo inputs to multichannel spatial features within a pretrained VAE latent space. At inference, the Flow Matching model predicted latent features are decoded by the VAE and converted into the final 7.1.4 waveform. Comprehensive objective and subjective evaluations demonstrate that our method produces perceptually rich sound fields and enhanced externalization, significantly outperforming traditional upmixing techniques. Code implementations and audio samples are provided at: https://github.com/violet-audio/ImmersiveFlow.
Primary: Nanjing University
All Institutions: Nanjing University, The Chinese University of Hong Kong
The paper presents ImmersiveFlow, a pioneering framework for stereo-to-7.1.4 spatial audio generation, significantly advancing the field of immersive audio processing. The innovative use of flow matching within a VAE context demonstrates a strong potential for improving audio quality in various applications, marking a substantial contribution to the audio machine learning landscape.
The proposed ImmersiveFlow framework introduces a novel approach to generating high-channel spatial audio directly from stereo input using Conditional Flow Matching (CFM) and a pretrained Variational Autoencoder (VAE). This end-to-end generative model effectively addresses the limitations of traditional upmixing techniques by leveraging flow-based generative modeling to learn the mapping from stereo audio to immersive 7.1.4 audio. The methodology is well-structured, employing a transformer-based architecture for the velocity field, which enhances the model's ability to capture complex spatial audio distributions.
The paper presents a comprehensive evaluation of ImmersiveFlow through both objective and subjective metrics. The dataset, while internally constructed, is robust, comprising 100 professionally mixed tracks, and the evaluation metrics are appropriate for assessing generative audio quality. The results indicate that ImmersiveFlow performs comparably to commercial upmixing tools while demonstrating superior generative capabilities, particularly in capturing spatial statistical structures.
The paper provides sufficient details regarding the implementation, including the architecture of the VAE and the flow matching model, as well as the training setup. The availability of code and audio samples on GitHub enhances reproducibility, allowing other researchers to validate and build upon the work.
One limitation noted is the potential loss of high-frequency detail during the VAE encoding process, which may affect the overall quality of the generated audio. Additionally, the internal dataset, while extensive, may not encompass the full diversity of audio content found in real-world scenarios, potentially limiting the generalizability of the model.
The development of ImmersiveFlow has significant implications for various applications in immersive media, including AR/VR, home entertainment, and automotive sound systems. By enabling high-quality spatial audio generation from stereo sources, this work could enhance user experiences in these domains, paving the way for more realistic and engaging audio environments. The paper presents ImmersiveFlow, a pioneering framework for stereo-to-7.1.4 spatial audio generation, significantly advancing the field of immersive audio processing. The innovative use of flow matching within a VAE context demonstrates a strong potential for improving audio quality in various applications, marking a substantial contribution to the audio machine learning landscape.
We introduce UNMIXX, a novel framework for multiple singing voices separation (MSVS). While related to speech separation, MSVS faces unique challenges: data scarcity and the highly correlated nature of singing voices mixture. To address these issues, we propose UNMIXX with three key components: (1) musically informed mixing strategy to construct highly correlated, music-like mixtures, (2) cross-source attention that drives representations of two singers apart via reverse attention, and (3) magnitude penalty loss penalizing erroneously assigned interfering energy. UNMIXX not only addresses data scarcity by simulating realistic training data, but also excels at separating highly correlated mixtures through cross-source interactions at both the architectural and loss levels. Our extensive experiments demonstrate that UNMIXX greatly enhances performance, with SDRi gains exceeding 2.2 dB over prior work.
Primary: Korea Advanced Institute of Science and Technology
All Institutions: Korea Advanced Institute of Science and Technology
UNMIXX presents a novel framework for multiple singing voices separation, effectively addressing challenges unique to this domain through innovative methodologies and promising experimental results. The contributions made by this paper are significant, particularly in enhancing the separation of highly correlated vocal mixtures, which is a critical area of research in audio processing.
The methodology presented in UNMIXX is innovative, particularly in its musically informed mixing strategy that simulates realistic training data for highly correlated singing voices. The introduction of cross-source attention is a significant advancement, as it allows for better separation of overlapping vocal signals by leveraging reverse attention mechanisms. The magnitude penalty loss is a thoughtful addition that addresses the challenge of energy misallocation in voice separation tasks. Overall, the framework is well-structured and addresses specific challenges in the domain of multiple singing voices separation.
The experiments conducted are extensive and demonstrate a clear improvement over previous methods, with SDRi gains exceeding 2.2 dB. The paper provides a thorough evaluation of the proposed method against baseline models, which adds credibility to the results. However, details regarding the datasets used for training and testing, as well as the specific metrics for evaluation, could be elaborated further to enhance clarity.
The paper lacks detailed implementation specifics that would allow for full reproducibility of the results. While the framework is described, the absence of a publicly available code repository or demo limits the ability of other researchers to replicate the findings. Including a project URL or supplementary materials would significantly improve this aspect.
One identified limitation is the reliance on simulated data for training, which may not fully capture the complexities of real-world singing voice mixtures. Additionally, the performance gains, while significant, may vary depending on the specific characteristics of the input mixtures, which could limit the generalizability of the approach.
The potential applications of UNMIXX extend beyond academic research into practical domains such as music production, karaoke systems, and voice-based entertainment technologies. By improving the separation of singing voices, this work could enhance user experiences in various audio applications, contributing to advancements in audio processing and machine learning in the creative industries. UNMIXX presents a novel framework for multiple singing voices separation, effectively addressing challenges unique to this domain through innovative methodologies and promising experimental results. The contributions made by this paper are significant, particularly in enhancing the separation of highly correlated vocal mixtures, which is a critical area of research in audio processing.
Parameter-efficient fine-tuning (PEFT) is a scalable approach for adapting large speech foundation models to new domains. While methods such as LoRA and its state-of-the-art variants reduce adaptation costs, they typically allocate parameters uniformly across model subspaces, which limits their efficiency and scalability in speech applications. Building on our prior work, this paper introduces SSVD-Outer (SSVD-O), an extension of the structured SVD-guided (SSVD) fine-tuning method. SSVD-O combines input acoustic feature space-associated inner transformations with output semantic feature space-associated outer transformations to enable scalable and balanced adaptation. We conduct the first systematic analysis of parameter budget allocation across model subspaces in PEFT for automatic speech recognition (ASR), and investigate the trade-off between learning and forgetting under constrained resources. SSVD-O is benchmarked against LoRA, DoRA, PiSSA, and SSVD on domain-shifted ASR tasks, including child speech and regional accents, across model scales from 0.1B to 2B within the ESPnet framework. Experimental results show that SSVD-O consistently narrows the performance gap to full fine-tuning while improving generalization and mitigating catastrophic forgetting.
Primary: Carnegie Mellon University
All Institutions: Carnegie Mellon University
The paper introduces SSVD-O, a novel structured SVD-guided PEFT method that enhances adaptation efficiency in speech recognition tasks. The comprehensive analysis of its methodology, experimental validation, and implications for the field underscores its potential to advance the state-of-the-art in automatic speech recognition.
The proposed SSVD-O method innovatively combines inner and outer transformations to enhance parameter-efficient fine-tuning (PEFT) for speech recognition tasks. By leveraging structured singular value decomposition (SVD), the authors effectively address the limitations of existing methods like LoRA, DoRA, and PiSSA, particularly in the context of domain-shifted automatic speech recognition (ASR). The systematic analysis of parameter budget allocation across model subspaces is a significant methodological contribution, providing insights into the trade-offs between learning and forgetting.
The experiments are well-structured, benchmarking SSVD-O against several state-of-the-art PEFT methods across various model scales and domain-shifted tasks. The use of diverse datasets, including child speech and regional accents, strengthens the evaluation. The results demonstrate that SSVD-O outperforms existing methods and narrows the performance gap to full fine-tuning, showcasing its effectiveness in real-world applications.
The paper mentions the use of the ESPnet framework for implementation, which is a widely recognized toolkit in the speech processing community. However, the lack of specific URLs for code or datasets limits reproducibility. Detailed descriptions of the experimental setup and parameter configurations are provided, which aids in replicating the results.
One limitation is the potential complexity introduced by the dual transformation approach, which may require careful tuning of parameters. Additionally, while the paper addresses catastrophic forgetting, it does not explore the long-term implications of using SSVD-O in continual learning scenarios extensively. The focus on specific ASR tasks may also limit the generalizability of the findings to other domains.
The advancements presented in this paper have significant implications for the field of speech recognition, particularly in adapting large models to low-resource and domain-specific tasks. The methodology could be applied to various applications, including educational tools for children, assistive technologies for individuals with speech disorders, and regional dialect recognition systems. The findings promote more efficient use of computational resources in deploying large-scale models in practical settings. The paper introduces SSVD-O, a novel structured SVD-guided PEFT method that enhances adaptation efficiency in speech recognition tasks. The comprehensive analysis of its methodology, experimental validation, and implications for the field underscores its potential to advance the state-of-the-art in automatic speech recognition.
Latest advances in deep spatial filtering for Ambisonics demonstrate strong performance in stationary multi-speaker scenarios by rotating the sound field toward a target speaker prior to multi-channel enhancement. For applicability in dynamic acoustic conditions with moving speakers, we propose to automate this rotary steering using an interleaved tracking algorithm conditioned on the target's initial direction. However, for nearby or crossing speakers, robust tracking becomes difficult and spatial cues less effective for enhancement. By incorporating the processed recording as additional guide into both algorithms, our novel joint autoregressive framework leverages temporal-spectral correlations of speech to resolve spatially challenging speaker constellations. Consequently, our proposed method significantly improves tracking and enhancement of closely spaced speakers, consistently outperforming comparable non-autoregressive methods on a synthetic dataset. Real-world recordings complement these findings in complex scenarios with multiple speaker crossings and varying speaker-to-array distances.
Primary: University of Hamburg
All Institutions: University of Hamburg
The main contribution of this paper is the development of a joint autoregressive framework for adaptive rotary steering that significantly improves the extraction of closely moving speakers in dynamic acoustic scenarios. This work addresses critical challenges in audio processing and has the potential to advance the field significantly.
The paper introduces an innovative joint autoregressive framework that enhances the robustness of speaker extraction in dynamic environments. The methodology effectively combines rotary steering with interleaved tracking, leveraging temporal-spectral correlations of speech. This approach is well-structured and addresses a significant gap in existing methods, particularly in scenarios with closely spaced or crossing speakers. The integration of processed recordings as additional guidance is a notable advancement that enhances the algorithm's adaptability to real-world conditions.
The experiments are comprehensive, utilizing both synthetic datasets and real-world recordings to validate the proposed method. The results demonstrate a clear improvement over non-autoregressive methods, showcasing the effectiveness of the joint autoregressive framework in complex scenarios. The evaluation metrics used are appropriate, and the comparisons with baseline methods are thorough, providing strong evidence for the claims made.
The paper lacks detailed implementation specifics, such as hyperparameters, training procedures, or code availability, which could hinder reproducibility. While the methodology is sound, the absence of a clear roadmap for replication may limit the community's ability to build upon this work.
One limitation is the reliance on initial direction of arrival (DoA) for tracking, which may not be robust in all scenarios, particularly in highly dynamic environments with rapid speaker movements. Additionally, the performance in extreme conditions or with a larger number of speakers has not been thoroughly explored, which could affect the generalizability of the findings.
The proposed method has significant implications for applications in real-time audio processing, such as in hearing aids, teleconferencing systems, and assistive technologies for the hearing impaired. By improving the extraction of closely moving speakers, this research could enhance communication clarity in complex auditory environments, contributing positively to accessibility and user experience. The main contribution of this paper is the development of a joint autoregressive framework for adaptive rotary steering that significantly improves the extraction of closely moving speakers in dynamic acoustic scenarios. This work addresses critical challenges in audio processing and has the potential to advance the field significantly.
Single-channel speech enhancement models face significant performance degradation in extremely noisy environments. While prior work has shown that complementary bone-conducted speech can guide enhancement, effective integration of this noise-immune modality remains a challenge. This paper introduces a novel multimodal speech enhancement framework that integrates bone-conduction sensors with air-conducted microphones using a conditional diffusion model. Our proposed model significantly outperforms previously established multimodal techniques and a powerful diffusion-based single-modal baseline across a wide range of acoustic conditions.
Primary: University of Hamburg
All Institutions: University of Hamburg
The main contribution of this paper is the introduction of BCDM, a novel multimodal speech enhancement framework that significantly improves performance in noisy environments by effectively integrating bone-conduction and air-conducted speech through a conditional diffusion model. This work represents a meaningful advancement in the field of audio signal processing, addressing critical challenges in speech enhancement and setting a foundation for future research in multimodal approaches.
The paper presents a novel approach to multimodal speech enhancement through the Bone-conduction Conditional Diffusion Model (BCDM). It effectively integrates bone-conduction and air-conducted speech modalities using two conditioning strategies (Input Concatenation and Decoder Conditioning) within a diffusion model framework. The methodology is well-structured, leveraging the strengths of both modalities while addressing the limitations of existing models. The use of a U-Net backbone and the detailed explanation of the conditioning strategies provide a solid foundation for the proposed enhancements.
The experimental setup is robust, utilizing the ABCS dataset for training and testing, which includes a substantial amount of time-aligned speech recordings. The comparison against multiple state-of-the-art baselines demonstrates the effectiveness of the proposed model across various noise conditions. The metrics used (POLQA, PESQ, ESTOI) are appropriate for evaluating speech quality and intelligibility, and the results convincingly show that BCDM outperforms existing models.
The paper provides sufficient details regarding the experimental setup, including dataset usage, model configurations, and training parameters, which facilitates reproducibility. However, the absence of a public demo or code repository limits immediate access for verification and experimentation by other researchers.
While the results are promising, the paper does not address potential limitations such as the computational complexity of the diffusion model, especially with respect to the number of reverse steps required for inference. Additionally, the performance in extremely noisy environments could be further explored to understand the model's robustness.
The proposed model has significant implications for real-world applications in speech enhancement, particularly in environments with high background noise, such as in telecommunications and assistive technologies for the hearing impaired. The integration of bone-conducted speech could lead to advancements in personal communication devices and enhance user experience in challenging acoustic conditions. The main contribution of this paper is the introduction of BCDM, a novel multimodal speech enhancement framework that significantly improves performance in noisy environments by effectively integrating bone-conduction and air-conducted speech through a conditional diffusion model. This work represents a meaningful advancement in the field of audio signal processing, addressing critical challenges in speech enhancement and setting a foundation for future research in multimodal approaches.
Audio-visual speech recognition (AVSR) typically improves recognition accuracy in noisy environments by integrating noise-immune visual cues with audio signals. Nevertheless, high-noise audio inputs are prone to introducing adverse interference into the feature fusion process. To mitigate this, recent AVSR methods often adopt mask-based strategies to filter audio noise during feature interaction and fusion, yet such methods risk discarding semantically relevant information alongside noise. In this work, we propose an end-to-end noise-robust AVSR framework coupled with speech enhancement, eliminating the need for explicit noise mask generation. This framework leverages a Conformer-based bottleneck fusion module to implicitly refine noisy audio features with video assistance. By reducing modality redundancy and enhancing inter-modal interactions, our method preserves speech semantic integrity to achieve robust recognition performance. Experimental evaluations on the public LRS3 benchmark suggest that our method outperforms prior advanced mask-based baselines under noisy conditions.
Primary: Defense Innovation Institute
All Institutions: Defense Innovation Institute, Academy of Military Sciences, Peking University, University of Electronic Science and Technology of China
The main contribution of this paper is the introduction of a novel AVSR framework that integrates speech enhancement without explicit noise masking, significantly improving robustness in noisy environments. This work represents a meaningful advancement in the field of audio-visual speech recognition, addressing critical challenges faced by existing methodologies and paving the way for future research in multimodal integration and noise-robust systems.
The proposed methodology introduces a novel "purification before fusion" paradigm that effectively integrates audio-visual speech recognition (AVSR) with speech enhancement. By employing a Conformer-based bottleneck fusion module, the authors eliminate the need for explicit noise masks, which is a significant departure from traditional methods that often discard relevant information alongside noise. The auxiliary audio-visual speech enhancement module is well-structured, using a combination of reconstruction and perceptual loss functions to ensure that the audio features are semantically rich and robust against noise. The end-to-end training approach enhances the model's efficiency and effectiveness, making it a compelling contribution to the field.
The experiments are rigorously designed, utilizing the LRS3 dataset, which is a large-scale benchmark for audio-visual speech recognition. The results demonstrate a clear performance improvement over existing mask-based methods, particularly in noisy conditions. The evaluation metrics, including word error rate (WER), are appropriate for the task, and the authors provide comprehensive comparisons against multiple state-of-the-art baselines. The ablation studies further substantiate the effectiveness of the proposed enhancements, particularly the impact of bottleneck tokens and the dual loss functions.
The paper provides sufficient implementation details, including model architectures, training protocols, and hyperparameter settings. However, the lack of a publicly available code repository or demo URL limits the reproducibility of the results. Future work should consider releasing the code to facilitate further research and validation.
One limitation of the study is the reliance on a specific dataset (LRS3), which may not fully represent all possible real-world scenarios. Additionally, while the proposed method shows promise, its performance in extremely noisy or diverse environments remains to be thoroughly evaluated. The computational efficiency of the model, while improved, still needs to be assessed in practical applications.
The implications of this research extend to various applications in speech recognition, particularly in environments with significant background noise, such as public spaces or during teleconferencing. The ability to enhance speech recognition through visual cues could greatly benefit accessibility technologies, human-computer interaction, and automated transcription services. The main contribution of this paper is the introduction of a novel AVSR framework that integrates speech enhancement without explicit noise masking, significantly improving robustness in noisy environments. This work represents a meaningful advancement in the field of audio-visual speech recognition, addressing critical challenges faced by existing methodologies and paving the way for future research in multimodal integration and noise-robust systems.
The ambiguity of human emotions poses several challenges for machine learning models, as they often overlap and lack clear delineating boundaries. Contrastive language-audio pretraining (CLAP) has emerged as a key technique for generalisable emotion recognition. However, as conventional CLAP enforces a strict one-to-one alignment between paired audio-text samples, it overlooks intra-modal similarity and treats all non-matching pairs as equally negative. This conflicts with the fuzzy boundaries between different emotions. To address this limitation, we propose SmoothCLAP, which introduces softened targets derived from intra-modal similarity and paralinguistic features. By combining these softened targets with conventional contrastive supervision, SmoothCLAP learns embeddings that respect graded emotional relationships, while retaining the same inference pipeline as CLAP. Experiments on eight affective computing tasks across English and German demonstrate that SmoothCLAP is consistently achieving superior performance. Our results highlight that leveraging soft supervision is a promising strategy for building emotion-aware audio-text models.
Primary: Munich Center for Machine Learning (MCML)
All Institutions: Munich Center for Machine Learning (MCML), Munich Data Science Institute (MDSI), Reliable AI (relAI)
The main contribution of this paper is the introduction of SmoothCLAP, a novel framework that enhances contrastive language-audio pretraining for emotion recognition by incorporating softened targets derived from intra-modal similarities. This advancement represents a meaningful step towards more accurate and flexible emotion-aware models, addressing the inherent complexities of human emotional expression.
The proposed SmoothCLAP framework innovatively integrates computational paralinguistic features into the CLAP training process by introducing softened targets that account for intra-modal similarities. This approach addresses the limitations of conventional contrastive learning methods that enforce strict one-to-one alignments, thereby enabling the model to capture nuanced emotional relationships. The methodology is well-structured, clearly outlining the architecture and the rationale behind the soft-target supervision, which enhances the model's ability to generalize across different emotional contexts.
The experiments are comprehensive, covering eight affective computing tasks in both English and German, which demonstrates the robustness and versatility of the proposed method. The results indicate that SmoothCLAP consistently outperforms existing models, particularly in zero-shot scenarios, showcasing its potential for broader applications beyond emotion recognition. The use of diverse datasets strengthens the validity of the findings, although the paper acknowledges some limitations in performance across specific tasks.
The paper provides sufficient details regarding the experimental setup, including dataset descriptions, model architectures, and training parameters. However, the absence of a publicly accessible code repository or demo limits the reproducibility of the results. Future work should consider making the implementation available to facilitate validation by the research community.
One notable limitation is the model's underperformance on certain tasks, which the authors attribute to the "no free lunch" theorem. Additionally, the reliance on specific datasets may introduce biases, and the influence of training data diversity on model performance remains an open question. The paper could benefit from a deeper exploration of these aspects.
The integration of soft-target supervision in emotion recognition models has significant implications for affective computing, particularly in applications such as human-computer interaction, mental health monitoring, and personalized content delivery. By improving the model's ability to understand and interpret human emotions, this research could lead to more empathetic AI systems. The main contribution of this paper is the introduction of SmoothCLAP, a novel framework that enhances contrastive language-audio pretraining for emotion recognition by incorporating softened targets derived from intra-modal similarities. This advancement represents a meaningful step towards more accurate and flexible emotion-aware models, addressing the inherent complexities of human emotional expression.
Learning representative embeddings for different types of speaking styles, such as emotion, age, and gender, is critical for both recognition tasks (e.g., cognitive computing and human-computer interaction) and generative tasks (e.g., style-controllable speech generation). In this work, we introduce ParaMETA, a unified and flexible framework for learning and controlling speaking styles directly from speech. Unlike existing methods that rely on single-task models or cross-modal alignment, ParaMETA learns disentangled, task-specific embeddings by projecting speech into dedicated subspaces for each type of style. This design reduces inter-task interference, mitigates negative transfer, and allows a single model to handle multiple paralinguistic tasks such as emotion, gender, age, and language classification. Beyond recognition, ParaMETA enables fine-grained style control in Text-To-Speech (TTS) generative models. It supports both speech- and text-based prompting and allows users to modify one speaking styles while preserving others. Extensive experiments demonstrate that ParaMETA outperforms strong baselines in classification accuracy and generates more natural and expressive speech, while maintaining a lightweight and efficient model suitable for real-world applications.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of ParaMETA, a novel framework that effectively learns and controls disentangled paralinguistic speaking styles from speech, addressing critical challenges in both recognition and generative tasks. This work significantly advances the field by providing a structured approach to modeling speaking styles, which enhances both the interpretability and performance of speech processing systems.
The proposed methodology of ParaMETA is innovative in its approach to disentangling paralinguistic speaking styles by utilizing a two-stage embedding learning strategy. This framework allows for the projection of speech into task-specific subspaces, which effectively reduces inter-task interference and negative transfer. The use of supervised contrastive learning and the introduction of prototype embeddings for class representation are significant contributions that enhance the model's ability to learn and control speaking styles. The methodology is well-structured and addresses existing limitations in related works, such as CLAP and UniStyle, by providing a more flexible and interpretable framework.
The experiments conducted are extensive and well-designed, utilizing a multilingual and multi-style speech dataset that covers a broad range of speaking styles. The results demonstrate that ParaMETA consistently outperforms several strong baselines in both classification accuracy and generative tasks. The evaluation metrics used, including classification accuracy and perceptual quality in TTS generation, provide a comprehensive assessment of the framework's performance. The ablation studies further validate the effectiveness of the proposed methods, showcasing the robustness of the model across different architectures.
The paper provides sufficient details regarding the experimental setup, including the dataset construction, training procedures, and evaluation metrics, which facilitate reproducibility. However, the lack of specific institutional affiliations and the absence of a demo URL may hinder broader accessibility for researchers looking to replicate the results.
While the framework shows promising results, there are limitations in terms of the potential computational demands associated with training large models. The paper also does not address the scalability of the approach to even larger datasets or more diverse speaking styles beyond those tested. Additionally, the reliance on a specific set of datasets may limit the generalizability of the findings.
ParaMETA has significant implications for applications in affective computing, human-computer interaction, and generative speech synthesis. The ability to control speaking styles in TTS systems can enhance user experiences in various domains, including virtual assistants and conversational agents. The framework's focus on disentangled representations could lead to advancements in personalized AI systems that respond more effectively to human emotions and characteristics. The main contribution of this paper is the introduction of ParaMETA, a novel framework that effectively learns and controls disentangled paralinguistic speaking styles from speech, addressing critical challenges in both recognition and generative tasks. This work significantly advances the field by providing a structured approach to modeling speaking styles, which enhances both the interpretability and performance of speech processing systems.
Contrastive language-audio pretraining (CLAP) has achieved notable success in learning semantically rich audio representations and is widely adopted for various audio-related tasks. However, current CLAP models face several key limitations. First, they are typically trained on relatively small datasets, often comprising a few million audio samples. Second, existing CLAP models are restricted to short and fixed duration, which constrains their usage in real-world scenarios with variable-duration audio. Third, the standard contrastive training objective operates on global representations, which may hinder the learning of dense, fine-grained audio features. To address these challenges, we introduce Scalable Language-Audio Pretraining (SLAP), which scales language-audio pretraining to 109 million audio-text pairs with variable audio durations and incorporates multiple training objectives. SLAP unifies contrastive loss with additional self-supervised and captioning losses in a single-stage training, facilitating the learning of richer dense audio representations. The proposed SLAP model achieves new state-of-the-art performance on audio-text retrieval and zero-shot audio classification tasks, demonstrating its effectiveness across diverse benchmarks.
Primary: unknown
All Institutions: unknown
The SLAP framework addresses critical limitations in existing CLAP models by scaling pretraining to 109 million audio-text pairs, supporting variable-length audio inputs, and unifying multiple training objectives in a single-stage pipeline. This comprehensive analysis highlights the technical contributions and significance of the work, while also noting areas for improvement and broader implications for the field.
The SLAP framework introduces significant innovations in the architecture of the audio encoder and the training methodology. The redesigned Transformer architecture effectively captures audio-specific characteristics and supports variable-length inputs, which is a notable advancement over existing CLAP models. The unified training pipeline that combines multiple objectives in a single-stage training is a practical approach that simplifies the training process while enhancing model performance. However, the paper could benefit from more detailed explanations of the architectural choices and their empirical justifications.
The experiments are comprehensive, covering a variety of tasks such as audio-text retrieval, zero-shot audio classification, and audio captioning. The results demonstrate state-of-the-art performance across multiple benchmarks, indicating the effectiveness of the proposed model. However, the reliance on a dataset with generated captions raises questions about the quality of the training data and its impact on the model's generalization capabilities.
The paper provides sufficient details regarding the architecture and training process, including hyperparameters and training protocols. However, the absence of a publicly available dataset and code repository limits reproducibility. Future work should consider releasing these resources to facilitate validation of the results.
The primary limitation lies in the use of generated captions for training, which may introduce noise and affect the model's robustness. Additionally, while the model shows strong performance, it may not generalize well to tasks outside the evaluated benchmarks. The computational efficiency of the proposed methods, while improved, could still be a concern for practical applications involving longer audio inputs.
The SLAP framework has the potential to advance the field of audio representation learning significantly, particularly in applications such as audio classification, retrieval, and captioning. Its scalability and ability to handle variable-duration audio make it suitable for real-world applications, potentially improving accessibility in multimedia content creation and audio analysis. The SLAP framework addresses critical limitations in existing CLAP models by scaling pretraining to 109 million audio-text pairs, supporting variable-length audio inputs, and unifying multiple training objectives in a single-stage pipeline. This comprehensive analysis highlights the technical contributions and significance of the work, while also noting areas for improvement and broader implications for the field.